You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2016/11/02 00:08:58 UTC
[jira] [Updated] (LUCENE-7533) Classic query parser: autoGeneratePhraseQueries=true doesn't work when splitOnWhitespace=false

     [ https://issues.apache.org/jira/browse/LUCENE-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Rowe updated LUCENE-7533:
-------------------------------
    Attachment: LUCENE-7533.patch

Patch that addresses some of this issue, with some failing tests and nocommits.

The existing autoGeneratePhraseQueries=true approach generates queries exactly as if the query had contained quotation marks, but as I mentioned above, this is inappropriate when splitOnWhitespace=false and the query text contains spaces.

The approach in the patch is to add a new QueryBuilder method to handle the autoGeneratePhraseQueries=true case.  The query text is split on whitespace and these tokens' offsets are compared to those produced by the configured analyzer.  When multiple non-overlapping tokens have offsets within the bounds of a single whitespace-separated token, a phrase query is created.  If the original token is present as a token overlapping with the first split token, then a disjunction query is created with the original token and the phrase query of the split tokens.

I've added a couple of tests that show posincr/poslength/offset output from SynonymFilter and WordDelimiterFilter (likely the two most frequently used analysis components that can create split tokens), and both create corrupt token graphs of various kinds (e.g. LUCENE-6582, LUCENE-5051), so solving this problem in a complete way just isn't possible right now.

So I'm not happy with the approach in the patch.  It only covers a subset of possible token graphs (e.g. more than one overlapping multi-term synonym doesn't work).  And it's a lot of new code solving a problem that AFAIK no user has reported (does anybody even use autoGeneratePhraseQueries=true with classic QP?),

I'd be much happier if we could somehow get TermAutomatonQuery hooked into the query parsers, and then rewrite to simpler queries if possible: LUCENE-6824.  First thing though is unbreaking SynonymFilter and friends to produce non-broken token graphs though.  Attempts to do this for SynonymFilter have stalled though: LUCENE-6664.  (I have a germ of an idea that might break the logjam - I'll post over there.)

For this issue, maybe instead of my patch, for now, we just disallow autoGeneratePhraseQueries=true when splitOnWhitespace=false.

Thoughts?

> Classic query parser: autoGeneratePhraseQueries=true doesn't work when splitOnWhitespace=false
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7533
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7533
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 6.2, 6.3, 6.2.1
>            Reporter: Steve Rowe
>         Attachments: LUCENE-7533.patch
>
>
> LUCENE-2605 introduced the classic query parser option to not split on whitespace prior to performing analysis.
> When splitOnWhitespace=false, the output from analysis can now come from multiple whitespace-separated tokens, which breaks code assumptions when autoGeneratePhraseQueries=true: for this combination of options, it's not appropriate to auto-quote multiple non-overlapping tokens produced by analysis.  E.g. simple whitespace tokenization over the query "some words" will produce the token sequence ("some", "words"), and even when autoGeneratePhraseQueries=true, we should not be creating a phrase query here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org