You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2016/04/29 20:44:12 UTC

[jira] [Created] (LUCENE-7267) Field with an explicit TokenStream must be tokenized and then uses the default Analyzer offset gaps

Dawid Weiss created LUCENE-7267:
-----------------------------------

             Summary: Field with an explicit TokenStream must be tokenized and then uses the default Analyzer offset gaps
                 Key: LUCENE-7267
                 URL: https://issues.apache.org/jira/browse/LUCENE-7267
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Dawid Weiss
            Priority: Minor


This took me somewhat by surprise. We have a pretty complex code that uses fields with explicit token streams (which provide their own offset data) and multivalues.

It was surprising to see that offsets for subsequent values were shifted by 1 compared to what was explicitly provided in the OffsetAttribute. A bit of debugging showed this code inside {{PerField.invert}}:

{code}
      if (analyzed) {
        invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
        invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
      }
{code}

A field with an explicit token stream must still be declared as tokenized and PerField then thinks that this field must have come from an analyzer (where in fact it didn't):

{code}
      final boolean analyzed = fieldType.tokenized() && docState.analyzer != null;
{code}

While the default position increment is 0, the default offset gap isn't -- it's 1, causing the shift.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org