TokenStream
enumerates the sequence of tokens, either from Fields of a Document or from query text. this is an abstract class; concrete subclasses are:
TokenStream
whose input is a Reader; and TokenStream
whose input is another TokenStream
. TokenStream
API has been introduced with Lucene 2.9. this API has moved from being Token-based to Attribute-based. While Token still exists in 2.9 as a convenience class, the preferred way to store the information of a Token is to use AttributeImpls. TokenStream
now extends AttributeSource, which provides access to all of the token Attributes for the TokenStream
. Note that only one instance per AttributeImpl is created and reused for every token. this approach reduces object creation and allows local caching of references to the AttributeImpls. See #IncrementToken() for further details.
The workflow of the new TokenStream
API is as follows:
TokenStream
/TokenFilters which add/get attributes to/from the AttributeSource. TokenStream
. You can find some example code for the new API in the analysis package level Javadoc.
Sometimes it is desirable to capture a current state of a TokenStream
, e.g., for buffering purposes (see CachingTokenFilter, TeeSinkTokenFilter). For this usecase AttributeSource#captureState and AttributeSource#restoreState can be used.
The {@code TokenStream}-API in Lucene is based on the decorator pattern. Therefore all non-abstract subclasses must be final or have at least a final implementation of #incrementToken! this is checked when Java assertions are enabled.