On the contrary to NGramTokenFilter, this class sets offsets so that characters between startOffset and endOffset in the original stream are the same as the term chars.
For example, "abcde" would be tokenized as (minGram=2, maxGram=3):
Term | ab | abc | bc | bcd | cd | cde | de |
---|---|---|---|---|---|---|---|
Position increment | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Position length | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Offsets | [0,2[ | [0,3[ | [1,3[ | [1,4[ | [2,4[ | [2,5[ | [3,5[ |
This tokenizer changed a lot in Lucene 4.4 in order to:
Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).
Although highly discouraged, it is still possible to use the old behavior through Lucene43NGramTokenizer.