You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2011/08/17 23:40:45 UTC

Strange change to query parser behaviour in recent versions

Hi all.

Suppose I am searching for - 限定

In 3.0, QueryParser would parse this as a phrase query.  In 3.3, it
parses it as a boolean query, but offers an option to treat it like a
phrase.  Why would the default be not to do this?  Surely you would
always want it to become a phrase query.

The new parser (StandardQueryParser) parses it as a boolean query
also, and this is where I actually noticed the change (I noticed the
change in QueryParser when I tried to make a code example to show the
difference between the two.)  Is there an equivalent setting to make
it generate a phrase query instead?  Curently I am working around this
by inserting a QueryNodeProcessor which converts all unquoted field
queries to quoted queries.

Since we claim to support multiple language, if there is a good reason
for this *not* to be a phrase query, maybe I shouldn't be doing this
workaround?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange change to query parser behaviour in recent versions

Posted by Trejkaz <tr...@trypticon.org>.

On Sat, Aug 20, 2011 at 7:00 PM, Robert Muir <rc...@gmail.com> wrote:
> On Sat, Aug 20, 2011 at 3:34 AM, Trejkaz <tr...@trypticon.org> wrote:
>
>>
>> As an aside, Google's behaviour seems to follow the "old" way.  For
>> instance, [[ 限定 ]] returns 640,000,000 hits and [[ 限 定 ]] returns
>> 772,000,000.  (Interestingly, [[ "限定" ]] returns 643,000,000 hits.
>> Slightly more than you might expect.)
>>
>
> No it doesn't. query on 北京医科大学
>
> You are confusing tokenization with query-generation itself: if you
> want 限定 to be treated as a compound then use a tokenizer that does
> this.

Nope.  I'm not confusing the two, I just haven't seen the source code
for Google, so I can't say which level it was doing it at.  For my
example it seemed pretty opaque.

That's a good example, though.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange change to query parser behaviour in recent versions

Posted by Robert Muir <rc...@gmail.com>.

On Sat, Aug 20, 2011 at 3:34 AM, Trejkaz <tr...@trypticon.org> wrote:

>
> As an aside, Google's behaviour seems to follow the "old" way.  For
> instance, [[ 限定 ]] returns 640,000,000 hits and [[ 限 定 ]] returns
> 772,000,000.  (Interestingly, [[ "限定" ]] returns 643,000,000 hits.
> Slightly more than you might expect.)
>

No it doesn't. query on 北京医科大学

You are confusing tokenization with query-generation itself: if you
want 限定 to be treated as a compound then use a tokenizer that does
this.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange change to query parser behaviour in recent versions

Posted by Trejkaz <tr...@trypticon.org>.

On Fri, Aug 19, 2011 at 11:05 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> See LUCENE-2458 for the backstory.
>
> the argument was that while phrase queries were historicly generated by
> the query parser when a single (white space deliminated) "chunk" of query
> parser input produced multiple tokens, that logic didn't make sense in CJK
> type langauges where whitespace is not semanticly meaning full to seperate
> "terms"
>
> As i understand it: both [[ 限 定 ]] and [[ 限定 ]] should be treated
> equivilently in asian langauges, so they *both* become BooleanQueries for
> those two words (using the default query operator)

It's odd.  I thought that automatically generating phrase queries was
actually useful specifically *for* CJK languages, as it essentially
allows searching for a "word" as if it is really being tokenised as
one (which of course it isn't.  Not with StandardTokenizer, anyway.)

Since the Javadoc said it wasn't good for all, I assumed it had to be
something more obscure than CJK.  But now I'll have to ask our users
in those countries to see if the old way it works is actually
inconvenient for them.  If it is, we'll probably just adopt the new
way and remove our hack.

> I don't neccessarily agree with the fact that the default was changed, but
> (unless i'm completley missing something) it was changed in a way that
> should be back compatible if you use a consistent Version param on your
> QueryParser instance.

This is true.  QueryParser itself is fine (default aside), it's
StandardQueryParser which currently offers no choice, which is where I
first encountered this surprising behaviour.  In fact, the reason I
discovered it was because we had unit tests parsing Japanese queries
and confirming that they did come back as phrases.  :)

As an aside, Google's behaviour seems to follow the "old" way.  For
instance, [[ 限定 ]] returns 640,000,000 hits and [[ 限 定 ]] returns
772,000,000.  (Interestingly, [[ "限定" ]] returns 643,000,000 hits.
Slightly more than you might expect.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Strange change to query parser behaviour in recent versions

Posted by Chris Hostetter <ho...@fucit.org>.

See LUCENE-2458 for the backstory.

the argument was that while phrase queries were historicly generated by 
the query parser when a single (white space deliminated) "chunk" of query 
parser input produced multiple tokens, that logic didn't make sense in CJK 
type langauges where whitespace is not semanticly meaning full to seperate 
"terms"

As i understand it: both [[ 限 定 ]] and [[ 限定 ]] should be treated 
equivilently in asian langauges, so they *both* become BooleanQueries for 
those two words (using the default query operator)

I don't neccessarily agree with the fact that the default was changed, but 
(unless i'm completley missing something) it was changed in a way that 
should be back compatible if you use a consistent Version param on your 
QueryParser instance.

https://issues.apache.org/jira/browse/LUCENE-2458


: Hi all.
: 
: Suppose I am searching for - 限定
: 
: In 3.0, QueryParser would parse this as a phrase query.  In 3.3, it
: parses it as a boolean query, but offers an option to treat it like a
: phrase.  Why would the default be not to do this?  Surely you would
: always want it to become a phrase query.
: 
: The new parser (StandardQueryParser) parses it as a boolean query
: also, and this is where I actually noticed the change (I noticed the
: change in QueryParser when I tried to make a code example to show the
: difference between the two.)  Is there an equivalent setting to make
: it generate a phrase query instead?  Curently I am working around this
: by inserting a QueryNodeProcessor which converts all unquoted field
: queries to quoted queries.
: 
: Since we claim to support multiple language, if there is a good reason
: for this *not* to be a phrase query, maybe I shouldn't be doing this
: workaround?
: 
: TX
: 
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
: 
: 

-Hoss