You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/05/12 15:45:47 UTC

[jira] [Created] (LUCENE-3088) inconsistency of tokenstream.end() with OffsetLimitTokenFilter and LimitTokenCountFilter

inconsistency of tokenstream.end() with OffsetLimitTokenFilter and LimitTokenCountFilter
----------------------------------------------------------------------------------------

                 Key: LUCENE-3088
                 URL: https://issues.apache.org/jira/browse/LUCENE-3088
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Robert Muir


In LUCENE-3064, we added some state and checks to MockTokenizer to validate that consumers
are properly using the tokenstream workflow (described here: http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/TokenStream.html)

One inconsistency is the following steps:
4. The consumer calls incrementToken() until it returns false consuming the attributes after each call.
5. The consumer calls end() so that any end-of-stream operations can be performed.

In the case of these limitingfilters, end() is called on the Tokenizer *before* incrementToken() returns false. This is a little strange for a few reasons: one is that the tokenizer might not even be "ready" for end(), e.g. it might be coded where end() only works correctly if its entirely consumed. The other problem of course is that the finalOffset, the general use of end(), will most often be wrong in this case, so multi-valued field highlighting will not work.

We should probably figure out a way to address the inconsistency, some ideas are:
# fixing the javadocs, perhaps documenting that end() could be called at any time, and accepting the fact that the finalOffset will be wrong.
# the limiting filters could consume the rest of the tokens in a while (incrementToken()) loop to ensure totally proper behavior.
# the limiting filters could do something tricky like override end() so that its not invoked on the Tokenizer in a surprising state. This is still evil but perhaps less evil than calling it "out of order".
# ...


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org