java.lang.Objectorg.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
All Implemented Interfaces:
Closeable
TokenStream enumerates the sequence of tokens, either from
Field s of a Document or from query text.
This is an abstract class; concrete subclasses are:
TokenStream whose input is a Reader; and
TokenStream whose input is another
TokenStream.
TokenStream API has been introduced with Lucene 2.9. This API
has moved from being Token -based to Attribute -based. While
Token still exists in 2.9 as a convenience class, the preferred way
to store the information of a Token is to use AttributeImpl s.
TokenStream now extends AttributeSource , which provides
access to all of the token Attribute s for the TokenStream.
Note that only one instance per AttributeImpl is created and reused
for every token. This approach reduces object creation and allows local
caching of references to the AttributeImpl s. See
#incrementToken() for further details.
The workflow of the new TokenStream API is as follows:
TokenStream/TokenFilter s which add/get
attributes to/from the AttributeSource .
TokenStream.
You can find some example code for the new API in the analysis package level Javadoc.
Sometimes it is desirable to capture a current state of a TokenStream,
e.g., for buffering purposes (see CachingTokenFilter ,
TeeSinkTokenFilter ). For this usecase
AttributeSource#captureState and AttributeSource#restoreState
can be used.
| Constructor: |
|---|
|
|
|
| Method from org.apache.lucene.analysis.TokenStream Summary: |
|---|
| close, end, incrementToken, reset |
| Methods from org.apache.lucene.util.AttributeSource: |
|---|
| addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString |
| Methods from java.lang.Object: |
|---|
| clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Method from org.apache.lucene.analysis.TokenStream Detail: |
|---|
|
false
(using the new TokenStream API). Streams implementing the old API
should upgrade to use this feature.
This method can be used to perform any end-of-stream operations, such as
setting the final offset of a stream. The final offset of a stream might
differ from the offset of the last token eg in case one or more whitespaces
followed after the last token, but a WhitespaceTokenizer was used. |
The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use #captureState to create a copy of the current attribute state. This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to #addAttribute(Class) and #getAttribute(Class) , references to all AttributeImpl s that this stream uses should be retrieved during instantiation. To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in #incrementToken() . |
TokenStream are intended to be consumed more than once, it is
necessary to implement #reset() . Note that if your TokenStream
caches tokens and feeds them back again after a reset, it is imperative
that you clone the tokens when you store them away (on the first pass) as
well as when you return them (on future passes after #reset() ). |