You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "David Wayne Smiley (Jira)" <ji...@apache.org> on 2019/10/15 14:46:00 UTC

[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

    [ https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16951982#comment-16951982 ] 

David Wayne Smiley commented on LUCENE-8509:
--------------------------------------------

[~romseygeek] why was this option not added as a new configuration flag?  This is sort of an internal implementation detail, so it's not a big deal but If it were, it'd be easier for the tests to toggle this flag.  It's also disappointing to see yet another constructor arg when we already have a bit field for booleans.

Also:
* the CHANGES.txt claims offset adjusting is false, but actually it defaults to true.
* there was no documentation change.  At least the javadocs of this class which shows all the other options.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 8.0
>
>         Attachments: LUCENE-8509.patch, LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the beginning of the second token).  The WDGF takes the first token and splits it into two, adjusting the offsets of the second token, so we get "a"[0,1] and "b"[2,3].  The trim filter removes the leading space from the second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space has already been stripped, WDGF sees no need to adjust offsets, and emits the token as-is, resulting in the start offsets of the tokenstream being [0, 2, 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org