You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2017/03/16 23:31:41 UTC
[jira] [Comment Edited] (SOLR-9185) Solr's edismax and "Lucene"/standard query parsers should not split on whitespace before sending terms to analysis

    [ https://issues.apache.org/jira/browse/SOLR-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929124#comment-15929124 ] 

Steve Rowe edited comment on SOLR-9185 at 3/16/17 11:30 PM:
------------------------------------------------------------

Patch addressing the remaining issues.  Precommit and all Solr tests pass.  I plan on committing this shortly so that it will make the 6.5 release.

Both edismax and the standard query parser are covered.  I did not add this feature to the dismax parser (or to any other Solr query parsers); if people want this feature added elsewhere, we can do that under another issue.

Some implementation notes:

* As noted in previous comments on this issue, the feature is activated by specifying request param {{sow=false}}.  By default, {{sow=true}}; there is no behavior change at all if the {{sow}} param is not specified.
* I ran {{TestSolrQueryParser.testParsingPerformance()}} under three conditions: a) unpatched; b) patched using the default behavior (same as {{sow=true}}); and c) patched with {{sow=false}} to activate the don't-split-on-whitespace code.  The best-of-ten results run in a bash loop on my Linux box show all three within about 0.5% of each other's QPS (likely noise): between 91K and 92K QPS.  Average-of-ten puts the two patched conditions at roughly 2% slower (88K vs. 90K QPS).  I think this is acceptable.
* When per-field query structures differ, e.g. when one field's analyzer removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery structure when {{sow=false}} differs from that produced when {{sow=true}}.  Briefly, {{sow=true}} produces a boolean query containing one dismax query per query term, while {{sow=false}} produces a dismax query containing one boolean query per field. Min-should-match processing does (what I think is) the right thing here. See {{TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis()}} for some examples of this. *Note*: when {{sow=false}} and all queried fields' query structure is the same, edismax does what it has always done: produce a boolean query containing one dismax query per term.
* There is a new test suite {{TestMultiWordSynonyms}} that shows multi-term source synonyms matching at query-time.
* In order to deal with the set query changes introduced by SOLR-9786, I extended {{SolrQueryParserBase.RawQuery}} to hold an array of terms, to enable their later consumption as either a concatenated string (for tokenized fields) or individually (for non-tokenized fields).
* As noted on LUCENE-7533 for Lucene's classic query parser (and equally applicable to the Solr standard and edismax query parsers), {{split-on-whitespace=false}} and {{autoGeneratePhraseQueries=true}} don't play well together at present.  I've introduced a new exception {{QueryParserConfigurationException}} that will be thrown if any queried field is configured with {{autoGeneratePhraseQueries=true}} when the {{sow=false}} request param is specified.  For edismax, this is a departure: it's supposed to never throw exceptions.  I think that's okay for now though, since this is an optional/experimental feature.  Maybe when {{sow=false}} becomes the default (later, under another issue - see below), edismax should just log a warning and produce a query that excludes fields with this problem?

After this has been committed, I'll make a new issue to switch the default behavior on 7.0/master to {{sow=false}}.


was (Author: steve_rowe):
Patch addressing the remaining issues.  Precommit and all Solr tests pass.  I plan on committing this shortly so that it will make the 6.5 release.

Both edismax and the standard query parser are covered.  I did not add this feature to the dismax parser (or to any other Solr query parsers); if people want this feature added elsewhere, we can do that under another issue.

Some implementation notes:

* As noted in previous comments on this issue, the feature is activated by specifying request param {{sow=false}}.  By default, {{sow=true}}; there is no behavior change at all if the {{sow}} param is not specified.
* I ran {{TestSolrQueryParser.testParsingPerformance()}} under three conditions: a) unpatched; b) patched using the default behavior (same as {{sow=true}}; and c) patched with {{sow=false}} to activate the don't-split-on-whitespace code.  The best-of-ten results run in a bash loop on my Linux box show all three within about 0.5% of each other's QPS (likely noise): between 91K and 92K QPS.  Average-of-ten puts the two patched conditions at roughly 2% slower (88K vs. 90K QPS).  I think this is acceptable.
* When per-field query structures differ, e.g. when one field's analyzer removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery structure when {{sow=false}} differs from that produced when {{sow=true}}.  Briefly, {{sow=true}} produces a boolean query containing one dismax query per query term, while {{sow=false}} produces a dismax query containing one boolean query per field. Min-should-match processing does (what I think is) the right thing here. See {{TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis()}} for some examples of this. *Note*: when {{sow=false}} and all queried fields' query structure is the same, edismax does what it has always done: produce a boolean query containing one dismax query per term.
* There is a new test suite {{TestMultiWordSynonyms}} that shows multi-term source synonyms matching at query-time.
* In order to deal with the set query changes introduced by SOLR-9786, I extended {{SolrQueryParserBase.RawQuery}} to hold an array of terms, to enable their later consumption as either a concatenated string (for tokenized fields) or individually (for non-tokenized fields).
* As noted on LUCENE-7533 for Lucene's classic query parser (and equally applicable to the Solr standard and edismax query parsers), {{split-on-whitespace=false}} and {{autoGeneratePhraseQueries=true}} don't play well together at present.  I've introduced a new exception {{QueryParserConfigurationException}} that will be thrown if any queried field is configured with {{autoGeneratePhraseQueries=true}} when the {{sow=false}} request param is specified.  For edismax, this is a departure: it's supposed to never throw exceptions.  I think that's okay for now though, since this is an optional/experimental feature.  Maybe when {{sow=false}} becomes the default (later, under another issue - see below), edismax should just log a warning and produce a query that excludes fields with this problem?

After this has been committed, I'll make a new issue to switch the default behavior on 7.0/master to {{sow=false}}.

> Solr's edismax and "Lucene"/standard query parsers should not split on whitespace before sending terms to analysis
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9185
>                 URL: https://issues.apache.org/jira/browse/SOLR-9185
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>         Attachments: SOLR-9185.patch, SOLR-9185.patch, SOLR-9185.patch, SOLR-9185.patch
>
>
> Copied from LUCENE-2605:
> The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across whitespace boundaries:
> n-gram analysis
> shingles
> synonyms (especially multi-word for whitespace-separated languages)
> languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org