You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alan Woodward (JIRA)" <ji...@apache.org> on 2019/07/15 11:35:00 UTC

[jira] [Commented] (LUCENE-8916) GraphTokenStreamFiniteStrings.FiniteStringsTokenStream does not play well with subsequent TokenFilters

    [ https://issues.apache.org/jira/browse/LUCENE-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885109#comment-16885109 ] 

Alan Woodward commented on LUCENE-8916:
---------------------------------------

Interestingly, the patch attached to LUCENE-8644 will fix this, as it makes FTSFS clone all attributes, rather than just saving terms and playing them back again in a synthetic token stream.

> GraphTokenStreamFiniteStrings.FiniteStringsTokenStream does not play well with subsequent TokenFilters
> ------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8916
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>
> GraphTokenStreamFiniteStrings provides a view over multiple paths through a Token graph, which is useful when building queries over multiple length synonyms.  This view is exposed as an iterator over simple TokenStreams.  However, these TokenStreams do not work correctly when further wrapped in token filters, because they do not use a CharTermAttribute.
> For an example of issues this can cause, see https://github.com/elastic/elasticsearch/issues/43976, where elasticsearch uses a special shingle field to speed up phrase searches.  Queries are converted to shingles if they have multiple terms. However, if the query resolves into a graph due to synonyms, then this conversion breaks because the FixedShingleFilter is given a token stream built by GTSFS; terms are set using BytesTermAttribute, but then read using CharTermAttribute, and as these have different backing implementations, FSF ends up emitting null tokens.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org