You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alan Woodward (JIRA)" <ji...@apache.org> on 2019/04/01 14:35:00 UTC

[jira] [Commented] (LUCENE-8717) Handle stop words that appear at articulation points

    [ https://issues.apache.org/jira/browse/LUCENE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806850#comment-16806850 ] 

Alan Woodward commented on LUCENE-8717:
---------------------------------------

{{StopFilter}} extends {{FilteringTokenFilter}} so it will handle things in the same way.  I've gone back and forth a bit on whether we should use {{TermDeletedAttribute}} all the time, or whether we can restrict it to just articulation points, but I think we should probably do the latter.  Articulation points will only appear in token graphs, and there are a whole bunch of token filters that don't really make sense in that context, so with this change we only need to update a few filters; if we extend it so that everything needs to understand TermDeletedAttribute then there are a whole bunch of filters that we need to update (for example, all the summarizing/hashing filters - do we include deleted terms or not here?)

For synonyms, this still doesn't quite work because of the way SynonymMap builds itself, but I think cutting SynonymGraphFilter over to use GraphTokenFilter will make it a lot easier to match multi-token inputs with stopwords.  That would be a separate issue though.

> Handle stop words that appear at articulation points
> ----------------------------------------------------
>
>                 Key: LUCENE-8717
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8717
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8717.patch, LUCENE-8717.patch
>
>
> Our set of TokenFilters currently cannot handle the case where a multi-term synonym starts with a stopword.  This means that given a synonym file containing the mapping "the walking dead => twd" and a standard english stopword filter, QueryBuilder will produce incorrect queries.
> The tricky part here is that our standard way of dealing with stopwords, which is to just remove them entirely from the token stream and use a larger position increment on subsequent tokens, doesn't work when the removed token also has a position length greater than 1.  There are various tricks you can do to increment position length on the previous token, but this doesn't work if the stopword is the first token in the token stream, or if there are multiple stopwords in the side path.
> Instead, I'd like to propose adding a new TermDeletedAttribute, which we only use on tokens that should be removed from the stream but which hold necessary information about the structure of the token graph.  These tokens can then be removed by GraphTokenStreamFiniteStrings at query time, and by FlattenGraphFilter at index time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org