You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2016/07/01 01:40:11 UTC

[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace

     [ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Rowe updated LUCENE-2605:
-------------------------------
    Attachment: LUCENE-2605.patch

Okay, really final patch.  On SOLR-9185 I was having trouble integrating the Solr standard QP's comment support with the whitespace tokenization I introduced here, so I tried switching the Solr parser back to ignoring both whitespace and comments, and it worked.  The patch brings this grammar simplification back here too - in addition to many fewer whitespace mentions in the rules, fewer (and less complicated) lookaheads are required.

I've included the generated files in the patch.

No tests changed from the last patch.

All Lucene tests pass, and precommit passes.

> queryparser parses on whitespace
> --------------------------------
>
>                 Key: LUCENE-2605
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2605
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>            Reporter: Robert Muir
>            Assignee: Steve Rowe
>         Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch
>
>
> The queryparser parses input on whitespace, and sends each whitespace separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across whitespace boundaries:
> * n-gram analysis
> * shingles 
> * synonyms (especially multi-word for whitespace-separated languages)
> * languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their charfilters/tokenizers/tokenfilters will do the same thing at index and querytime, but
> in many cases they can't. Instead, preferably the queryparser would parse around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org