You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2016/07/20 17:06:20 UTC
[jira] [Updated] (LUCENE-7315) Flexible "standard" query parser parses on whitespace

     [ https://issues.apache.org/jira/browse/LUCENE-7315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Rowe updated LUCENE-7315:
-------------------------------
    Attachment: LUCENE-7315.patch

WIP patch against master, generated files not included ({{ant javacc-flexible}} in {{lucene/queryparser/}} will generate them), still has nocommits and failing tests.

In addition to enabling not splitting on whitespace prior to text analysis, the patch includes the following changes:

* Changed {{TermQueryNode}}'s {{positionIncrement}} name to {{position}}, since that's what it really holds.
* {{SynonymQueryNode}}/{{Builder}} now produces a {{SynonymQuery}} instead of a boolean query.
* Refactored {{AnalyzerQueryNodeProcessor.postProcessNode()}} into shorter methods and made it simpler and easier to follow.
* Moved split-on-whitespace tests to the shared {{QueryParserTestBase}}.

Some challenges remain:

* Unlike the classic QP, the flexible standard QP appears to remove a top-level MUST boolean query, e.g. {{+(word)}} -> {{word}}.  Some of the split-on-whitespace shared tests will need to be specialized for each parser.
* There's no simple way to collapse the children of the boolean query produced for text containing whitespace when not splitting on whitespace into their ancestor boolean query (if there is one), so some of the shared split-on-whitespace tests are failing.
** The patch includes a {{FlattenQueryNodeProcessor}} meant to address this issue, but it's not working and I haven't figured out why yet.
* Recent master-only changes will likely make the branch_6x backport non-trivial, e.g LUCENE-7347. 

> Flexible "standard" query parser parses on whitespace
> -----------------------------------------------------
>
>                 Key: LUCENE-7315
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7315
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/queryparser
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>         Attachments: LUCENE-7315.patch
>
>
> Copied from LUCENE-2605:
> The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across whitespace boundaries:
> n-gram analysis
> shingles
> synonyms (especially multi-word for whitespace-separated languages)
> languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org