You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Itamar Syn-Hershko (JIRA)" <ji...@apache.org> on 2010/05/15 21:29:42 UTC

[jira] Created: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

QueryParser should ignore double-quotes if mid-word
---------------------------------------------------

                 Key: LUCENE-2465
                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
             Project: Lucene - Java
          Issue Type: Bug
          Components: QueryParser
    Affects Versions: 3.0.1, 3.0, 2.9.2, 2.9.1, 2.9, 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9, 2.3.3, 2.4.2, 2.9.3, Flex Branch, 3.0.2, 3.1, 4.0
            Reporter: Itamar Syn-Hershko


Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.

Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.

The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.

This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867997#action_12867997 ] 

Robert Muir commented on LUCENE-2465:
-------------------------------------

{quote}
But instead of looking for whitespace, a quote would only be considered an opening quote if it wasn't preceded by a letter or number or backslash.
{quote}

great, now you break phrases for complex scripts because "letter" or "number" doesnt apply (e.g. hindi/thai have non-spacing vowels).

This is why i say, the only solution is to follow unicode. Adding hacks like this will only break other languages.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868003#action_12868003 ] 

Michael McCandless commented on LUCENE-2465:
--------------------------------------------

OK that's a good point -- saying the quote must be preceded by "whitespace" is in fact wrong for non-whitespace languages.  Meaning, if we made this change we'd actually break what are correct String -> PhraseQuery done by QP today.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Itamar Syn-Hershko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868183#action_12868183 ] 

Itamar Syn-Hershko commented on LUCENE-2465:
--------------------------------------------

bq. This is why i say, the only solution is to follow unicode. Adding hacks like this will only break other languages.

Problem is, Hebrew parsing has been broken for a long time now, and this still needs fixing. I don't think you should be forcing extra pre-handling for Hebrew or Bengali (or other) queries, just to keep CJK parsing working out of the box. Escaping those cases by the caller is a much more complex operation than a normal escape you'd do on your queries.

For languages where a colon is being used as a character, if indeed the use case is the same as mid-word gershayim (i.e. there's no key for that letter and it is more of a letter than a punctuation char), the issue with the QP is the same.

If the solution I had proposed initially wouldn't have caused other issues with CJK phrases, I'd insist on it. However, you are obviously right this change would break functionality for those languages, but you are wrong claiming it is not up to the query parser to resolve. As Shai have already pointed out, the QP should parse based on syntax with the smallest hassle to the consumer.

Obviously, a solution has to be provided, and it is agreed it should not affect the variety of supported languages. How about creating this functionality and leaving it as optional? for CJK you'd leave it off, while for all other languages (English and European) you could turn it on and feel no difference at the worse case scenario.

Or, you could have this setting accessible from your Analyzer. Analyzers are defining the core's behavior per-language, and as such it would make sense to make the QP check with the analyzer which cases are a syntax error and which aren't.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868006#action_12868006 ] 

Robert Muir commented on LUCENE-2465:
-------------------------------------

Also, by the way there are foldings and processing in contrib/ICU to help with this use case specifically.

# ICUTokenizer contains a tailored Hebrew grammar, that *only* applies to Hebrew script. It adds the double quote " to MidLetter and the single quote ' to Extend only for Hebrew context. Of course, geresh and gershayim work the same way.
# ICUFoldingFilter then contains logic to cause geresh and ', gershayim and " to conflate.

So, I recommend this approach:

All you have to do is disambiguate at query-time, if the double quote should be a phrase query, or should be part of an acronym, by mapping it to the correct unicode character (gershayim) for the latter case. You can use a regex or dictionary-based approach, or user-feedback or whatever approach you feel is best.

At index and query-time, if you use these components (or use similar mappings/handling in your own tokenstreams), all forms will conflate to the same terms and match during your search. You are just disambiguating to the correct unicode characters to cause the parse that you want.



> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867906#action_12867906 ] 

Robert Muir commented on LUCENE-2465:
-------------------------------------

As i stated before this is not a bug.

The double quote is not a MidLetter according to Unicode, this is no unicode violation.

You should use QueryParser.escape(), or use the correct unicode character:
U+05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM

Which is defined as MidLetter and will work correctly in StandardTokenizer (LUCENE-2167) also.


> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867994#action_12867994 ] 

Yonik Seeley commented on LUCENE-2465:
--------------------------------------

The Lucene query parser isn't really meant to directly handle casual user queries, but is rather meant to have a more rigid syntax.  Even if the QP was modified to ignore a quote in the middle of a word, it seems one would still need to make a pass on the string to escape other potentially invalid syntax elements?

Still, for the expert user directly entering lucene syntax, the proposed change could perhaps make life a little easier, so I'm not against it.  But instead of looking for whitespace, a quote would only be considered an opening quote if it wasn't preceded by a letter or number or backslash.

We can't use whitespace because we need to support syntax like
{code}
("hi")
+"hi"
field:"hi"
{code}

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Itamar Syn-Hershko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867982#action_12867982 ] 

Itamar Syn-Hershko commented on LUCENE-2465:
--------------------------------------------

Using QueryParser.escape() is not an option, since by that I practically prevent the QP from ever returning PhraseQuery's on user queries (it just replaces all occurrences of a QP syntax char).

Your other suggestion of using the "correct" Unicode char GERSHAYIM is not doable, because we are talking about user-typed queries here, and no user has such a character on his keyboard. In 99.9% of Hebrew text files, old and new, double-quotes is being used as GERSHAYIM. Only exceptions are when an automated program has automatically converted the mid-word instance of double-quotes into U+05F4. This is pretty much like asking the Lucene community to type U+201C and U+201D (left / right double quotation marks) around phrases or they won't be recognized as such. Because no one has those characters easily accessible from their k/b (to the best of my knowledge), and it doesn't really matter anyway what you type, this thought never passed in anyone's mind. Exactly the same goes for Hebrew.

The only doable workaround is to go through the query string before sending it to the QP, and resolve this by either escaping mid-word double-quotes or replacing them with U+05F4. Since most Hebrew dictionaries work with double-quotes for acronyms anyway, escaping it seems much better, but then I ask again - why bother with a double-pass on the query string if a simple change to the QP can resolve that? The effect the behavior has on non-Hebrew scripts is flawed anyway, and there's no reason to require such a pass for Hebrew consumers only (imagine what it'd be like to write a multi-lingual search interface with this issue in mind).

As a reference, see how Google and Wikipedia treat Hebrew acronyms:
http://www.google.com/#hl=en&source=hp&q=%D7%9E%D7%A0%D7%9B%22%D7%9C&aq=f&aqi=&aql=&oq=&gs_rfai=&fp=d059ab474882bfe2
http://he.wikipedia.org/wiki/%D7%9E%D7%A0%D7%9B%22%D7%9C

Google recognizes both double-quotes and GERSHAYIM as correct forms of Hebrew acronyms, while Wikipedia only uses the former in all acronyms.

Robert, I hear what you are saying, but this just ain't right when it comes to usability, when the resolution is so simple and doesn't break anything.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Earwin Burrfoot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868019#action_12868019 ] 

Earwin Burrfoot commented on LUCENE-2465:
-----------------------------------------

My special use case requires accepting a whole slew of special characters as a part of query terms, ":" included. Let's cover this too? :)

I had thoughts recently, that "parse the query, then process words within each sub-query" idea is broken. You should first analyze the whole line as text, doing all kinds of substitutions, synonymizing, colocation detection, whatever you fancy, and only then try to build a query out of it.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868001#action_12868001 ] 

Robert Muir commented on LUCENE-2465:
-------------------------------------

bq. Also, GERSHAYIM is simply not a valid argument - users cannot type Unicode, they type text.

I am suggesting we follow the rules of unicode, for a few reasons.
# This is not unique to hebrew Gershayim. The same problem is found in numerous other languages, where query parser syntax overlaps with "incorrect unicode" text in those languages. I have this same issue with the conflation of : and Bengali ঃ, and in some other charsets there is only one glyph for both.
# Adding some heuristic that does not obey the rules of unicode risks breaking other languages. While it might seem perfectly harmless, we risk doing harmful things to other languages. This is like what happens to Chinese text today.
# Disambiguating when a ' should be a gershayim is really app-dependent, just like disambiguating when : should be  ঃ. Its a subproblem of character set conversion (which is not always lossless and exact), and charset conversion doesnt belong in the query parser.

So, adding some of the heuristics i see here will change phrase queries for example, for languages that dont use spaces between words like Thai. Trying to base it on Unicode properties, is very risky, ultimately it will probably break some language because words arent just sequences of letters separated by whitespace in all languages.

Furthermore, by following Unicode, we keep QP simpler, and it won't unintentionally or unknowingly break for any existent or future languages (such as ones not even in Unicode yet).


> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867998#action_12867998 ] 

Michael McCandless commented on LUCENE-2465:
--------------------------------------------

How about if the quote is preceded by nothing (ie start of query), or by one of QP's "allowed" punctuation chars -- (, +, :, -, etc.

I think it is best if QP is as non-aggressive as it can be, when assuming a given character is in fact meant to be query syntax.  It makes QP as "language neutral" as it can realistically be...

So, eg, what does QP do today if a + shows up mid-term?  Eg, foo+bar

Or how does QP handle xxx: with no text after the :?


> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868000#action_12868000 ] 

Shai Erera commented on LUCENE-2465:
------------------------------------

Actually, I think this is a bug, or at least an inconsistent behavior. QP returns c:foo+bar for the query "foo+bar" (no quotes), when the default field is "c". Yet for "foo\"bar" it throws an exception ...

We handle such quotes properly, following rules like Mike suggests. We don't declare invalid characters, but rather "valid syntax". So a \" starts a phrase only if it follows whitespace, +/-, field:, ( etc. In all other cases, it is just returned w/ the word/token. If the app does anything with it, then good - otherwise it won't find matches, and one can decide if that was a mistake in the query, or perhaps there isn't such token (like in Hebrew when \" are permitted mid-words).

Also, GERSHAYIM is simply not a valid argument - users cannot type Unicode, they type text. It's like asking one to differentiate ' and ` - they are visually the same ... and if a character cannot be typed ...

So I think this should be fixed, and the above test case be added to QP.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868017#action_12868017 ] 

Robert Muir commented on LUCENE-2465:
-------------------------------------

{quote}
It's not that I disagree with what you say Robert, but I think we're arguing two different things. Please correct me if I'm wrong, but ঃ does not denote a field:value delimiter by the QueryParser, right? I've tried the following query "fooঃbar" and it was still parsed to c:fooঃbar.
{quote}

Yes you are wrong, the problem is that often the colon : is substituted for this character. So must we change the queryparser syntax to try to disambiguate when : is really visarga, versus when : is a field name? No we shouldnt, just like we shouldnt change the query parser to try to disambiguate when " is really gershayim.

Its not just Hebrew and Bengali either, the problem exists in other languages, if you try you can probably find some natural use of a queryparser operator in some language. Its just an example to show that the problem is not unique to Hebrew, and that the disambiguation/charset conversion doesn't belong in the queryparser, but instead is up to you.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2465.
---------------------------------

    Resolution: Won't Fix

This isn't a bug, as mentioned *you* need to use the correct Unicode character, it does not matter
if its on your users keyboard or not. 

Its your responsibility, to disambiguate (with whatever logic you want), that U+0022 should really be  U+05F4.
then it will work correctly with Lucene (including StandardTokenizer).


> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Itamar Syn-Hershko (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867993#action_12867993 ] 

Itamar Syn-Hershko commented on LUCENE-2465:
--------------------------------------------

My point exactly - no one uses that character, and it will require a double pass on the string *always*. I pretty much have rest my case already, and it would have been clearer to you if you have been reading the language. Isn't Google treating those chars the same, or Wikipedia using just double-quotes, proof enough to my argument that double-quotes are allowed to be mid-word, that they 99.9% of the time are used that way, and that this isn't an incorrect behavior?

For Hebrew or other multi-lingual systems this will require always preparing the string before calling parse(), and this is definitely an unwanted behavior. Since the solution is *that* simple and non-breaking, I don't see why not just fix it - bug or not.

Any other opinions on the matter?

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868013#action_12868013 ] 

Shai Erera commented on LUCENE-2465:
------------------------------------

It's not that I disagree with what you say Robert, but I think we're arguing two different things. Please correct me if I'm wrong, but ঃ does not denote a field:value delimiter by the QueryParser, right? I've tried the following query "fooঃbar" and it was still parsed to c:fooঃbar.

So my point is - the query syntax declares some characters that if present in the query string are parsed accordingly, but also if they adhere to some format (e.g. foo+bar does not tokenize on '+'). And we don't declare a Unicode standard for the query syntax - only \u0022 is considered a quote for phrase, and not all the other double quote forms, like Gershayim.

Therefore, I don't think the examples of Chinese et al. are relevant because it is not the job of the QP to parse them properly, but the job of the Analyzer. The QP needs to tokenize the terms properly, and needs to build the query tree properly. For that, it declares several characters, ' ' (space) is one of them, and if the user wants to write proper queries, he should use them. The QP itself does not follow any Unicode standard right?

A Hebrew example to describe the quotes problem is, and I write it in English 'cause you don't read Hebrew ... yet :), US"A, which is the acronym of United States of America. That's a valid word. The user, following the current guidelines, should write it as US\"A if he wants the quote to be retained, and the question is whether we can do something to relax that requirement. The same would follow for any other query syntax reserved character that is used not in its correct syntax-place ...

I do agree though that such change is problematic backwards-wise, b/c previously failing queries may suddenly succeed. Specifically, the parser won't throw ParseException if the user makes a mistake, in languages in which " is not a valid mid-word character. But I also feel that the current behavior is wrong .. or at least too restrictive. And ... the reserved characters behave inconsistently. BTW, FWIW "foo:" (w/ : and no value) also throws ParseException ...

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the query when hitting a double-quotes char, even if it is mid-word. For example, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group of words surrounded by double quotes as defined by http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it say double-quotes will also tokenize the input. Arguably, a phrase should only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew acronyms impossible. Hebrew acronyms contain one double-quotes char in the middle of a word (for example, MNK"L), hence causing the QP to throw a syntax exception, since it is expecting another double-quotes to create a phrase query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check if a whitespace precedes the double-quote when a phrase opening is expected, or peek to see if a whitespace follows the double-quotes if a phrase closing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be relied on anyway, and allow Hebrew queries to be correctly parsed also when containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org