This should be a good tokenizer for most European-language documents:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, StandardTokenizer implements Unicode text segmentation, as specified by UAX#29.