You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Robert Muir <rc...@gmail.com> on 2010/05/12 15:25:25 UTC

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko <it...@code972.com> wrote:
> The QueryParser also fails to correctly parse Hebrew acronyms; although not
> being an integral part of the current discussion, I thought this would be
> the best place to bring that up.
>

Just as I don't think Analysis should do QueryParsing, I don't think
QueryParsing should do Analysis either.
Similar problems to this exist in other languages (I have to escape :
for some, because lucene wants to interpret it as a field name).

But this can be easily remedied on the application side, its
documented and understood that the double-quote is a special
character, and there is an escape mechanism so you can escape the ones
you think are acronyms.

This issue is about about a buggy implementation: its not documented
and only internal to how the queryparser determines what is a phrase
query or not (and, contrary to what you would believe from the
documentation, the choice of whether or not to make a PhraseQuery is
not based on syntax one bit!)

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Mark Miller <ma...@gmail.com>.

On 5/12/10 11:24 AM, Robert Muir wrote:
> On Wed, May 12, 2010 at 11:16 AM, Mark Miller<ma...@gmail.com>  wrote:
>>
>> Thats a major exaggeration - quoting text plays a large role in whether or
>> not you will get a phrase query.
>>
>
> No, it has nothing to do with it in the implementation. It only
> "escapes the whitespace", but is discarded. This is clear from looking
> at the grammar.
>
> The logic then to determine if you get a phrase query is the huge mess
> of code in getFieldQuery, but its not based on the double quotes at
> all.
>
> For example a list of chinese or thai words gets a phrase query, only
> because they don't use whitespace between words.
> But a similar list of english words gets a boolean query.
>

Quotes play a part, or quoting something would simply not create a 
phrase query - quoting something ensures that it hits the analyzer as 
one chunk, rather than getting meta parsed by the grammar and fed to the 
analyzer a token at a time. This ensures that multiple tokens hit the 
funky logic to create a phrase query. The grammar specifically looks for 
quoted chunks.

-- 
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Robert Muir <rc...@gmail.com>.

On Wed, May 12, 2010 at 11:16 AM, Mark Miller <ma...@gmail.com> wrote:
>
> Thats a major exaggeration - quoting text plays a large role in whether or
> not you will get a phrase query.
>

No, it has nothing to do with it in the implementation. It only
"escapes the whitespace", but is discarded. This is clear from looking
at the grammar.

The logic then to determine if you get a phrase query is the huge mess
of code in getFieldQuery, but its not based on the double quotes at
all.

For example a list of chinese or thai words gets a phrase query, only
because they don't use whitespace between words.
But a similar list of english words gets a boolean query.

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Mark Miller <ma...@gmail.com>.

On 5/12/10 9:25 AM, Robert Muir wrote:
>(and, contrary to what you would believe from the
> documentation, the choice of whether or not to make a PhraseQuery is
> not based on syntax one bit!)
>

Thats a major exaggeration - quoting text plays a large role in whether 
or not you will get a phrase query.


-- 
- Mark

http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Itamar Syn-Hershko <it...@code972.com>.

Again, this is not a hack, and that was exactly my point. As I said:

> resolving this is very simple, by just applying a correct logic 
> (ignore double-quotes followed by a char) which isn't enforced today 
> and once it will be, it won't cause any cases of unexpected behavior.

It is just valid for English queries to ignore double-quotes in mid-word
instead of tokenizing upon it if not followed by an empty char, as it is in
Hebrew.

Itamar. 

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Thursday, May 13, 2010 3:24 AM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

Internationalization doesn't work by just piling hacks for language X,
language Y, and language Z on top of each other.

Just like I want the English hack removed, I strongly recommend against
adding any Hebrew hack.

On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko <it...@code972.com>
wrote:
> I think we understand each other perfectly well. I still think 
> resolving this is very simple, by just applying a correct logic 
> (ignore double-quotes followed by a char) which isn't enforced today 
> and once it will be, it won't cause any cases of unexpected behavior. 
> This isn't an analysis related task, and I'm not sure what  makes you 
> insist so bad. I will be openning a dedicated JIRA ticket for this 
> discussion if this won't become part of the current one.
>
> Itamar.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Thursday, May 13, 2010 1:42 AM
> To: dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't 
> generate phrasequeries based on term count
>
> On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko 
> <it...@code972.com>
> wrote:
>> Never did I request the QP to do Analysis. I simply mentioned this 
>> bug
>> - what this definitely is -
>
> Its definitely not a bug for Hebrew, there is a unicode character for 
> gershayim (U+05F4), so technically this should be used according to
unicode.
>
> Its arguably your responsibility to convert your data to unicode 
> before passing it thru Lucene, and that includes disambiguating when a 
> double quote should be gershayim
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For 
> additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For 
> additional commands, e-mail: dev-help@lucene.apache.org
>
>



--
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Robert Muir <rc...@gmail.com>.

Internationalization doesn't work by just piling hacks for language X,
language Y, and language Z on top of each other.

Just like I want the English hack removed, I strongly recommend
against adding any Hebrew hack.

On Wed, May 12, 2010 at 6:55 PM, Itamar Syn-Hershko <it...@code972.com> wrote:
> I think we understand each other perfectly well. I still think resolving
> this is very simple, by just applying a correct logic (ignore double-quotes
> followed by a char) which isn't enforced today and once it will be, it won't
> cause any cases of unexpected behavior. This isn't an analysis related task,
> and I'm not sure what  makes you insist so bad. I will be openning a
> dedicated JIRA ticket for this discussion if this won't become part of the
> current one.
>
> Itamar.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Thursday, May 13, 2010 1:42 AM
> To: dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
> phrasequeries based on term count
>
> On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko <it...@code972.com>
> wrote:
>> Never did I request the QP to do Analysis. I simply mentioned this bug
>> - what this definitely is -
>
> Its definitely not a bug for Hebrew, there is a unicode character for
> gershayim (U+05F4), so technically this should be used according to unicode.
>
> Its arguably your responsibility to convert your data to unicode before
> passing it thru Lucene, and that includes disambiguating when a double quote
> should be gershayim
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Itamar Syn-Hershko <it...@code972.com>.

I think we understand each other perfectly well. I still think resolving
this is very simple, by just applying a correct logic (ignore double-quotes
followed by a char) which isn't enforced today and once it will be, it won't
cause any cases of unexpected behavior. This isn't an analysis related task,
and I'm not sure what  makes you insist so bad. I will be openning a
dedicated JIRA ticket for this discussion if this won't become part of the
current one.

Itamar.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Thursday, May 13, 2010 1:42 AM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko <it...@code972.com>
wrote:
> Never did I request the QP to do Analysis. I simply mentioned this bug 
> - what this definitely is -

Its definitely not a bug for Hebrew, there is a unicode character for
gershayim (U+05F4), so technically this should be used according to unicode.

Its arguably your responsibility to convert your data to unicode before
passing it thru Lucene, and that includes disambiguating when a double quote
should be gershayim

--
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Robert Muir <rc...@gmail.com>.

On Wed, May 12, 2010 at 6:30 PM, Itamar Syn-Hershko <it...@code972.com> wrote:
> Never did I request the QP to do Analysis. I simply mentioned this bug -
> what this definitely is -

Its definitely not a bug for Hebrew, there is a unicode character for
gershayim (U+05F4), so technically this should be used according to
unicode.

Its arguably your responsibility to convert your data to unicode
before passing it thru Lucene, and that includes disambiguating when a
double quote should be gershayim

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count

Posted by Itamar Syn-Hershko <it...@code972.com>.

Never did I request the QP to do Analysis. I simply mentioned this bug -
what this definitely is - so you could tackle it while you're at it. This is
an definitely relevant to a discussion about re-making how the QP determines
what is a legit PhraseQuery and what is not.

The fix is quite easy I believe - just make sure you don't identify a
double-quote as a trigger for starting or ending a phrase unless it is
followed by a white-space (or another non-char). An English query like
'Foo"bar"' (with no enclosing quotes...) is invalid anyway (although it is
not handled as such at the moment).

I cannot handle this on the application side, simply because there the
double-quote char is NOT a special character. As I mentioned, for Hebrew it
is part of the word, pretty much like Niqqud is. If the user has entered a
textual query with an acronym, there's no point in me parsing it once just
to escape what I suspect are acronyms and then send it to the core QP, or
just create the queries by myself. All this being valid in light of my
second paragraph in this message - the fix is easy and also correct for the
basic, non-Hebrew, implementation.

Itamar.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, May 12, 2010 4:25 PM
To: dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
phrasequeries based on term count

On Wed, May 12, 2010 at 6:05 AM, Itamar Syn-Hershko <it...@code972.com>
wrote:
> The QueryParser also fails to correctly parse Hebrew acronyms; 
> although not being an integral part of the current discussion, I 
> thought this would be the best place to bring that up.
>

Just as I don't think Analysis should do QueryParsing, I don't think
QueryParsing should do Analysis either.
Similar problems to this exist in other languages (I have to escape :
for some, because lucene wants to interpret it as a field name).

But this can be easily remedied on the application side, its documented and
understood that the double-quote is a special character, and there is an
escape mechanism so you can escape the ones you think are acronyms.

This issue is about about a buggy implementation: its not documented and
only internal to how the queryparser determines what is a phrase query or
not (and, contrary to what you would believe from the documentation, the
choice of whether or not to make a PhraseQuery is not based on syntax one
bit!)

--
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org