You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by ma...@apache.org on 2009/08/25 16:16:01 UTC
svn commit: r807645 -
/lucene/java/trunk/src/java/org/apache/lucene/analysis/TokenStream.java
Author: markrmiller
Date: Tue Aug 25 14:16:00 2009
New Revision: 807645
URL: http://svn.apache.org/viewvc?rev=807645&view=rev
Log:
LUCENE-1760: javadoc improvements for TokenStream
Modified:
lucene/java/trunk/src/java/org/apache/lucene/analysis/TokenStream.java
Modified: lucene/java/trunk/src/java/org/apache/lucene/analysis/TokenStream.java
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/TokenStream.java?rev=807645&r1=807644&r2=807645&view=diff
==============================================================================
--- lucene/java/trunk/src/java/org/apache/lucene/analysis/TokenStream.java (original)
+++ lucene/java/trunk/src/java/org/apache/lucene/analysis/TokenStream.java Tue Aug 25 14:16:00 2009
@@ -26,48 +26,62 @@
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.util.Attribute;
import org.apache.lucene.util.AttributeImpl;
import org.apache.lucene.util.AttributeSource;
-/** A TokenStream enumerates the sequence of tokens, either from
- fields of a document or from query text.
- <p>
- This is an abstract class. Concrete subclasses are:
- <ul>
- <li>{@link Tokenizer}, a TokenStream
- whose input is a Reader; and
- <li>{@link TokenFilter}, a TokenStream
- whose input is another TokenStream.
- </ul>
- A new TokenStream API is introduced with Lucene 2.9. While Token still
- exists in 2.9 as a convenience class, the preferred way to store
- the information of a token is to use {@link AttributeImpl}s.
- <p>
- For that reason TokenStream extends {@link AttributeSource}
- now. Note that only one instance per {@link AttributeImpl} is
- created and reused for every token. This approach reduces
- object creations and allows local caching of references to
- the {@link AttributeImpl}s. See {@link #incrementToken()} for further details.
- <p>
- <b>The workflow of the new TokenStream API is as follows:</b>
- <ol>
- <li>Instantiation of TokenStream/TokenFilters which add/get attributes
- to/from the {@link AttributeSource}.
- <li>The consumer calls {@link TokenStream#reset()}.
- <li>the consumer retrieves attributes from the
- stream and stores local references to all attributes it wants to access
- <li>The consumer calls {@link #incrementToken()} until it returns false and
- consumes the attributes after each call.
- </ol>
- To make sure that filters and consumers know which attributes are available
- the attributes must be added during instantiation. Filters and
- consumers are not required to check for availability of attributes in {@link #incrementToken()}.
- <p>
- Sometimes it is desirable to capture a current state of a
- TokenStream, e. g. for buffering purposes (see {@link CachingTokenFilter},
- {@link TeeSinkTokenFilter}). For this usecase
- {@link AttributeSource#captureState} and {@link AttributeSource#restoreState} can be used.
+/**
+ * A {@link TokenStream} enumerates the sequence of tokens, either from
+ * {@link Field}s of a {@link Document} or from query text.
+ * <p>
+ * This is an abstract class. Concrete subclasses are:
+ * <ul>
+ * <li>{@link Tokenizer}, a {@link TokenStream} whose input is a Reader; and
+ * <li>{@link TokenFilter}, a {@link TokenStream} whose input is another
+ * {@link TokenStream}.
+ * </ul>
+ * A new {@link TokenStream} API has been introduced with Lucene 2.9. This API
+ * has moved from being {@link Token} based to {@link Attribute} based. While
+ * {@link Token} still exists in 2.9 as a convenience class, the preferred way
+ * to store the information of a {@link Token} is to use {@link AttributeImpl}s.
+ * <p>
+ * {@link TokenStream} now extends {@link AttributeSource}, which provides
+ * access to all of the token {@link Attribute}s for the {@link TokenStream}.
+ * Note that only one instance per {@link AttributeImpl} is created and reused
+ * for every token. This approach reduces object creation and allows local
+ * caching of references to the {@link AttributeImpl}s. See
+ * {@link #incrementToken()} for further details.
+ * <p>
+ * <b>The workflow of the new {@link TokenStream} API is as follows:</b>
+ * <ol>
+ * <li>Instantiation of {@link TokenStream}/{@link TokenFilter}s which add/get
+ * attributes to/from the {@link AttributeSource}.
+ * <li>The consumer calls {@link TokenStream#reset()}.
+ * <li>the consumer retrieves attributes from the stream and stores local
+ * references to all attributes it wants to access
+ * <li>The consumer calls {@link #incrementToken()} until it returns false and
+ * consumes the attributes after each call.
+ * <li>The consumer calls {@link #end()} so that any end-of-stream operations
+ * can be performed.
+ * <li>The consumer calls {@link #close()} to release any resource when finished
+ * using the {@link TokenStream}
+ * </ol>
+ * To make sure that filters and consumers know which attributes are available,
+ * the attributes must be added during instantiation. Filters and consumers are
+ * not required to check for availability of attributes in
+ * {@link #incrementToken()}.
+ * <p>
+ * You can find some example code for the new API in the analysis package level
+ * Javadoc.
+ * <p>
+ * Sometimes it is desirable to capture a current state of a {@link TokenStream}
+ * , e. g. for buffering purposes (see {@link CachingTokenFilter},
+ * {@link TeeSinkTokenFilter}). For this usecase
+ * {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}
+ * can be used.
*/
public abstract class TokenStream extends AttributeSource {
@@ -228,54 +242,67 @@
}
/**
- * For extra performance you can globally enable the new {@link #incrementToken}
- * API using {@link Attribute}s. There will be a small, but in most cases neglectible performance
- * increase by enabling this, but it only works if <b>all</b> TokenStreams and -Filters
- * use the new API and implement {@link #incrementToken}. This setting can only be enabled
+ * For extra performance you can globally enable the new
+ * {@link #incrementToken} API using {@link Attribute}s. There will be a
+ * small, but in most cases negligible performance increase by enabling this,
+ * but it only works if <b>all</b> {@link TokenStream}s use the new API and
+ * implement {@link #incrementToken}. This setting can only be enabled
* globally.
- * <P>This setting only affects TokenStreams instantiated after this call. All TokenStreams
- * already created use the other setting.
- * <P>All core analyzers are compatible with this setting, if you have own
- * TokenStreams/-Filters, that are also compatible, enable this.
- * <P>When enabled, tokenization may throw {@link UnsupportedOperationException}s,
- * if the whole tokenizer chain is not compatible.
- * <P>The default is <code>false</code>, so there is the fallback to the old API available.
- * @deprecated This setting will be <code>true</code> per default in Lucene 3.0,
- * when {@link #incrementToken} is abstract and must be always implemented.
+ * <P>
+ * This setting only affects {@link TokenStream}s instantiated after this
+ * call. All {@link TokenStream}s already created use the other setting.
+ * <P>
+ * All core {@link Analyzer}s are compatible with this setting, if you have
+ * your own {@link TokenStream}s that are also compatible, you should enable
+ * this.
+ * <P>
+ * When enabled, tokenization may throw {@link UnsupportedOperationException}
+ * s, if the whole tokenizer chain is not compatible eg one of the
+ * {@link TokenStream}s does not implement the new {@link TokenStream} API.
+ * <P>
+ * The default is <code>false</code>, so there is the fallback to the old API
+ * available.
+ *
+ * @deprecated This setting will no longer be needed in Lucene 3.0 as the old
+ * API will be removed.
*/
public static void setOnlyUseNewAPI(boolean onlyUseNewAPI) {
TokenStream.onlyUseNewAPI = onlyUseNewAPI;
}
- /** Returns if only the new API is used.
+ /**
+ * Returns if only the new API is used.
+ *
* @see #setOnlyUseNewAPI
- * @deprecated This setting will be <code>true</code> per default in Lucene 3.0,
- * when {@link #incrementToken} is abstract and must be always implemented.
+ * @deprecated This setting will no longer be needed in Lucene 3.0 as
+ * the old API will be removed.
*/
public static boolean getOnlyUseNewAPI() {
return onlyUseNewAPI;
}
/**
- * Consumers (eg the indexer) use this method to advance the stream
- * to the next token. Implementing classes must implement this method
- * and update the appropriate {@link AttributeImpl}s with content of the
- * next token.
+ * Consumers (ie {@link IndexWriter}) use this method to advance the stream to
+ * the next token. Implementing classes must implement this method and update
+ * the appropriate {@link AttributeImpl}s with the attributes of the next
+ * token.
* <p>
* This method is called for every token of a document, so an efficient
- * implementation is crucial for good performance. To avoid calls to
- * {@link #addAttribute(Class)} and {@link #getAttribute(Class)} and
- * downcasts, references to all {@link AttributeImpl}s that this stream uses
- * should be retrieved during instantiation.
+ * implementation is crucial for good performance. To avoid calls to
+ * {@link #addAttribute(Class)} and {@link #getAttribute(Class)} or downcasts,
+ * references to all {@link AttributeImpl}s that this stream uses should be
+ * retrieved during instantiation.
* <p>
- * To make sure that filters and consumers know which attributes are available
- * the attributes must be added during instantiation. Filters and
- * consumers are not required to check for availability of attributes in {@link #incrementToken()}.
+ * To ensure that filters and consumers know which attributes are available,
+ * the attributes must be added during instantiation. Filters and consumers
+ * are not required to check for availability of attributes in
+ * {@link #incrementToken()}.
*
* @return false for end of stream; true otherwise
- *
- * <p>
- * <b>Note that this method will be defined abstract in Lucene 3.0.</b>
+ *
+ * <p>
+ * <b>Note that this method will be defined abstract in Lucene
+ * 3.0.</b>
*/
public boolean incrementToken() throws IOException {
assert !onlyUseNewAPI && tokenWrapper != null;
@@ -293,14 +320,15 @@
}
/**
- * This method is called by the consumer after the last token has been consumed,
- * ie after {@link #incrementToken()} returned <code>false</code> (using the new TokenStream API)
- * or after {@link #next(Token)} or {@link #next()} returned <code>null</code> (old TokenStream API).
+ * This method is called by the consumer after the last token has been
+ * consumed, eg after {@link #incrementToken()} returned <code>false</code>
+ * (using the new {@link TokenStream} API) or after {@link #next(Token)} or
+ * {@link #next()} returned <code>null</code> (old {@link TokenStream} API).
* <p/>
- * This method can be used to perform any end-of-stream operations, such as setting the final
- * offset of a stream. The final offset of a stream might differ from the offset of the last token
- * eg in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer}
- * was used.
+ * This method can be used to perform any end-of-stream operations, such as
+ * setting the final offset of a stream. The final offset of a stream might
+ * differ from the offset of the last token eg in case one or more whitespaces
+ * followed after the last token, but a {@link WhitespaceTokenizer} was used.
*
* @throws IOException
*/
@@ -308,36 +336,35 @@
// do nothing by default
}
- /** Returns the next token in the stream, or null at EOS.
- * When possible, the input Token should be used as the
- * returned Token (this gives fastest tokenization
- * performance), but this is not required and a new Token
- * may be returned. Callers may re-use a single Token
- * instance for successive calls to this method.
- * <p>
- * This implicitly defines a "contract" between
- * consumers (callers of this method) and
- * producers (implementations of this method
- * that are the source for tokens):
- * <ul>
- * <li>A consumer must fully consume the previously
- * returned Token before calling this method again.</li>
- * <li>A producer must call {@link Token#clear()}
- * before setting the fields in it & returning it</li>
- * </ul>
- * Also, the producer must make no assumptions about a
- * Token after it has been returned: the caller may
- * arbitrarily change it. If the producer needs to hold
- * onto the token for subsequent calls, it must clone()
- * it before storing it.
- * Note that a {@link TokenFilter} is considered a consumer.
- * @param reusableToken a Token that may or may not be used to
- * return; this parameter should never be null (the callee
- * is not required to check for null before using it, but it is a
- * good idea to assert that it is not null.)
- * @return next token in the stream or null if end-of-stream was hit
- * @deprecated The new {@link #incrementToken()} and {@link AttributeSource}
- * APIs should be used instead.
+ /**
+ * Returns the next token in the stream, or null at EOS. When possible, the
+ * input Token should be used as the returned Token (this gives fastest
+ * tokenization performance), but this is not required and a new Token may be
+ * returned. Callers may re-use a single Token instance for successive calls
+ * to this method.
+ * <p>
+ * This implicitly defines a "contract" between consumers (callers of this
+ * method) and producers (implementations of this method that are the source
+ * for tokens):
+ * <ul>
+ * <li>A consumer must fully consume the previously returned {@link Token}
+ * before calling this method again.</li>
+ * <li>A producer must call {@link Token#clear()} before setting the fields in
+ * it and returning it</li>
+ * </ul>
+ * Also, the producer must make no assumptions about a {@link Token} after it
+ * has been returned: the caller may arbitrarily change it. If the producer
+ * needs to hold onto the {@link Token} for subsequent calls, it must clone()
+ * it before storing it. Note that a {@link TokenFilter} is considered a
+ * consumer.
+ *
+ * @param reusableToken a {@link Token} that may or may not be used to return;
+ * this parameter should never be null (the callee is not required to
+ * check for null before using it, but it is a good idea to assert that
+ * it is not null.)
+ * @return next {@link Token} in the stream or null if end-of-stream was hit
+ * @deprecated The new {@link #incrementToken()} and {@link AttributeSource}
+ * APIs should be used instead.
*/
public Token next(final Token reusableToken) throws IOException {
assert reusableToken != null;
@@ -357,12 +384,13 @@
}
}
- /** Returns the next token in the stream, or null at EOS.
- * @deprecated The returned Token is a "full private copy" (not
- * re-used across calls to next()) but will be slower
- * than calling {@link #next(Token)} or using the new
- * {@link #incrementToken()} method with the new
- * {@link AttributeSource} API.
+ /**
+ * Returns the next {@link Token} in the stream, or null at EOS.
+ *
+ * @deprecated The returned Token is a "full private copy" (not re-used across
+ * calls to {@link #next()}) but will be slower than calling
+ * {@link #next(Token)} or using the new {@link #incrementToken()}
+ * method with the new {@link AttributeSource} API.
*/
public Token next() throws IOException {
if (onlyUseNewAPI)
@@ -379,17 +407,15 @@
}
}
- /** Resets this stream to the beginning. This is an
- * optional operation, so subclasses may or may not
- * implement this method. Reset() is not needed for
- * the standard indexing process. However, if the Tokens
- * of a TokenStream are intended to be consumed more than
- * once, it is necessary to implement reset(). Note that
- * if your TokenStream caches tokens and feeds them back
- * again after a reset, it is imperative that you
- * clone the tokens when you store them away (on the
- * first pass) as well as when you return them (on future
- * passes after reset()).
+ /**
+ * Resets this stream to the beginning. This is an optional operation, so
+ * subclasses may or may not implement this method. {@link #reset()} is not needed for
+ * the standard indexing process. However, if the tokens of a
+ * {@link TokenStream} are intended to be consumed more than once, it is
+ * necessary to implement {@link #reset()}. Note that if your TokenStream
+ * caches tokens and feeds them back again after a reset, it is imperative
+ * that you clone the tokens when you store them away (on the first pass) as
+ * well as when you return them (on future passes after {@link #reset()}).
*/
public void reset() throws IOException {}