You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "David Smiley (JIRA)" <ji...@apache.org> on 2015/03/28 05:06:52 UTC

[jira] [Created] (LUCENE-6375) Inconsistent interpretation of maxDocCharsToAnalyze in Highlighter & WeightedSpanTermExtractor

David Smiley created LUCENE-6375:
------------------------------------

             Summary: Inconsistent interpretation of maxDocCharsToAnalyze in Highlighter & WeightedSpanTermExtractor
                 Key: LUCENE-6375
                 URL: https://issues.apache.org/jira/browse/LUCENE-6375
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: David Smiley
            Priority: Minor


Way back in LUCENE-2939, the default/standard Highlighter's WeightedSpanTermExtractor (referenced by QueryScorer, used by Highlighter.java) got a performance feature maxDocCharsToAnalyze to set a limit on how much text to process when looking for phrase queries and wildcards (and some other advanced query types).  Highlighter itself also has a limit by the same name.  They are not interpreted the same way!

Highlighter loops over tokens and halts early if the token's start offset >= maxDocCharsToAnalyze.  In this light, it's almost as if the input string was truncated to be this length, but a bit beyond to the next tokenization boundary.  The PostingsHighlighter also has a configurable limit it calls "maxLength" (or contentLength) that is conceptually similar but implemented differently because it doesn't tokenize; but it does have the inverted start & end offsets to check if it's reached the end with respect to this configured limit.  FYI Solr's hl.maxAnalyzedChars is supplied as a configured input to both highlighters in this manner; the FastVectorHighlighter doesn't have a limit.

Highlighter propagates it's configured maxAnalyzedChars to QueryScorer which in turn propagates it to WeightedSpanTermExtractor.  _WSTE doesn't interpret this the same way as Highlighter or PostingsHighlighter._  It uses an OffsetLimitTokenFilter which accumulates the deltas in start & end offsets of each token it sees.  That is:
{code:java}
      int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset();
      offsetCount += offsetLength;
{code}

So if you've got analysis which produces a lot of posInc-0 tokens (as I do), you will likely hit this limit earlier than when Highlighter will.  Or if you have very few tokens with tons of whitespace then WSTE will index terms that will never be highlighted.  This isn't a big deal but it should be fixed.  This filter should simply examine if the startOffset is >= a configured limit and return false from it's incrementToken if so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org