You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2016/11/02 01:02:36 UTC

[jira] [Commented] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

    [ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15627246#comment-15627246 ] 

Steve Rowe commented on LUCENE-6664:
------------------------------------

[~mikemccand], I think your repurposing of posincr/poslen on this issue (as node ids) is to enable non-lossy query parser interpretation of token streams, so that e.g. tokens from overlapping phrases aren't inappropriately interleaved in generated queries, like your wtf example on [LUCENE-6582|https://issues.apache.org/jira/browse/LUCENE-6582?focusedCommentId=14592501&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14592501]):

{quote}
 if I have these synonyms:
{noformat}
wtf --> what the fudge
wtf --> wow that's funny
{noformat}
And then I'm tokenizing this:
{noformat}
wtf happened
{noformat}
Before this change (today) I get this crazy sausage incorrectly
matching phrases like "wtf the fudge" and "wow happened funny":
!https://issues.apache.org/jira/secure/attachment/12740491/12740491_before.png!
But after this change, the expanded synonyms become separate paths in
the graph right? So it will look like this?:
!https://issues.apache.org/jira/secure/attachment/12740492/12740492_after.png!
{quote}

An alternative implementation idea I had, which would not change posincr/poslen semantics, is to add a new attribute encoding an entity ID.  Graph-aware producers would mark tokens that should be treated as a sequence with the same entity ID, and graph-aware consumers would use the entity ID to losslessly interpret the resulting graph.  Here's the wtf example using this scheme:

||token||posInc||posLen||entityID||
|wtf|1|3|0|
|what|0|1|1|
|wow|0|1|2|
|the|1|1|1|
|that's|0|1|2|
|fudge|1|1|1|
|funny|0|1|2|
|happened|1|1|3|

No flattening stage is required.  Non-graph-aware components aren't affected (I think).  And handling QueryParser.autoGeneratePhraseQueries() properly (see LUCENE-7533) would be easy: if more than one token has the same entityID, then it should be a phrase when autoGeneratePhraseQueries=true.

I haven't written any code yet, so I'm not sure this idea is feasible.

Thoughts?

> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
>                 Key: LUCENE-6664
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6664
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org