You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Jakl <ja...@gmail.com> on 2012/01/23 18:07:58 UTC

edismax/dismax/Lucene Query Parser converts some fields to be "mandatory"

Hi, I've been wondering why some of my queries did not return the
results I expected. A debugQuery resulted in the following:

<str name="querystring">
"java"^0.0 OR "haskell"^0.0 OR "python"^0.0 OR ("ruby"^0.0) AND
(("programming"^0.0)) OR "programming language"^0.0 OR "code
coding"^0.0 OR -"mobile"^0.0 OR -"android"^0.0 OR -"microsoft"^0.0 OR
-"windows"^0.0
</str>
<str name="parsedquery">
+(DisjunctionMaxQuery((stemmedText:java))
DisjunctionMaxQuery((stemmedText:0.0))
DisjunctionMaxQuery((stemmedText:haskell))
DisjunctionMaxQuery((stemmedText:0.0))
DisjunctionMaxQuery((stemmedText:python))
DisjunctionMaxQuery((stemmedText:0.0))
DisjunctionMaxQuery((stemmedText:ruby))
+DisjunctionMaxQuery((stemmedText:0.0))
DisjunctionMaxQuery((stemmedText:program))
DisjunctionMaxQuery((stemmedText:0.0))
DisjunctionMaxQuery((stemmedText:"program language"))
DisjunctionMaxQuery((stemmedText:0.0))
DisjunctionMaxQuery((stemmedText:"code code"))
DisjunctionMaxQuery((stemmedText:0.0))
-DisjunctionMaxQuery((stemmedText:mobile))
DisjunctionMaxQuery((stemmedText:0.0))
-DisjunctionMaxQuery((stemmedText:android))
DisjunctionMaxQuery((stemmedText:0.0))
-DisjunctionMaxQuery((stemmedText:microsoft))
DisjunctionMaxQuery((stemmedText:0.0))
-DisjunctionMaxQuery((stemmedText:window))
DisjunctionMaxQuery((stemmedText:0.0))) ()
</str>

Why is the "java" part marked mandatory (using the + notation)? It
seems that these rewritings seem to happen when the queries get quite
long, is there a way to prevent Solr from assuming I wanted "java" to
be a mandatory term, or to deduce any mandatory fields at all?

I've tried it with the ExtendedDismaxQParser and the DismaxQParser,
both yield the same parsedquery.

The LuceneQParser yielded the following:
<str name="querystring">
"java"^0.0 OR "haskell"^0.0 OR "python"^0.0 OR ("ruby"^0.0) AND
(("programming"^0.0)) OR "programming language"^0.0 OR "code
coding"^0.0 OR -"mobile"^0.0 OR -"android"^0.0 OR -"microsoft"^0.0 OR
-"windows"^0.0
</str>
<str name="parsedquery">
stemmedText:java^0.0 stemmedText:haskell^0.0 stemmedText:python^0.0
+stemmedText:ruby^0.0 +stemmedText:program^0.0
PhraseQuery(stemmedText:"program language"^0.0)
PhraseQuery(stemmedText:"code code"^0.0) -stemmedText:mobile^0.0
-stemmedText:android^0.0 -stemmedText:microsoft^0.0
-stemmedText:window^0.0
</str>

Now, Solr thinks I want "ruby" (and program, the stemmed version of
programming) to be mandatory... .

I'm running Solr 3.5 on Linux 64bit.

Any suggestions would be greatly appreciated,
Michael

Re: edismax/dismax/Lucene Query Parser converts some fields to be "mandatory"

Posted by Michael Jakl <ja...@gmail.com>.
On Tue, Jan 24, 2012 at 06:27, Erick Erickson <er...@gmail.com> wrote:
> Well, at root the Lucene query parser makes no claim of
> enforcing boolean logic. Think in terms of MUST, SHOULD
> and NOT instead.
>
> Here's a good writeup...
>
> http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/

Hi, thank you! This explanation (and your clarification) is exactly
what I was searching.
Cheers,
Michael

Re: edismax/dismax/Lucene Query Parser converts some fields to be "mandatory"

Posted by Erick Erickson <er...@gmail.com>.
Well, at root the Lucene query parser makes no claim of
enforcing boolean logic. Think in terms of MUST, SHOULD
and NOT instead.

Here's a good writeup...

http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/

Best
Erick

On Mon, Jan 23, 2012 at 2:43 PM, Michael Jakl <ja...@gmail.com> wrote:
> On Mon, Jan 23, 2012 at 22:05, Erick Erickson <er...@gmail.com> wrote:
>> Right. Essentially, the precedence is given to AND, so this is parsed
>> as though it were python OR (ruby AND programming) OR "programming language"
>
> That's exactly what I'd expect, but the problem is that "ruby" is
> marked as mandatory, that is, I don't get any articles not containing
> ruby, whereas the query, as I'd interpret it, should allow articles
> containing only python as well.
>
> Maybe another example illustrates my problem.
> If I search for "awordthatdoesnotexistinmyindex AND java OR python"
> (assuming that java and python occur in my index), I won't get any
> articles because awordthatdoesnotexistinmyindex isn't to be found
> anywhere.
>
> The query parser outputs:
> +(+DisjunctionMaxQuery((stemmedText:awordthatdoesnotexistinmyindex))
> +DisjunctionMaxQuery((stemmedText:java))
> DisjunctionMaxQuery((stemmedText:python)))
>
> Is this not boolean logic as one might expect? Are clauses containing
> AND always mandatory? I'm sorry to insist here, but it seems so
> counter intuitive to me.
>
> Thanks for your patience,
> Michael

Re: edismax/dismax/Lucene Query Parser converts some fields to be "mandatory"

Posted by Michael Jakl <ja...@gmail.com>.
On Mon, Jan 23, 2012 at 22:05, Erick Erickson <er...@gmail.com> wrote:
> Right. Essentially, the precedence is given to AND, so this is parsed
> as though it were python OR (ruby AND programming) OR "programming language"

That's exactly what I'd expect, but the problem is that "ruby" is
marked as mandatory, that is, I don't get any articles not containing
ruby, whereas the query, as I'd interpret it, should allow articles
containing only python as well.

Maybe another example illustrates my problem.
If I search for "awordthatdoesnotexistinmyindex AND java OR python"
(assuming that java and python occur in my index), I won't get any
articles because awordthatdoesnotexistinmyindex isn't to be found
anywhere.

The query parser outputs:
+(+DisjunctionMaxQuery((stemmedText:awordthatdoesnotexistinmyindex))
+DisjunctionMaxQuery((stemmedText:java))
DisjunctionMaxQuery((stemmedText:python)))

Is this not boolean logic as one might expect? Are clauses containing
AND always mandatory? I'm sorry to insist here, but it seems so
counter intuitive to me.

Thanks for your patience,
Michael

Re: edismax/dismax/Lucene Query Parser converts some fields to be "mandatory"

Posted by Erick Erickson <er...@gmail.com>.
Right. Essentially, the precedence is given to AND, so this is parsed
as though it were python OR (ruby AND programming) OR "programming language"

Best
Erick

On Mon, Jan 23, 2012 at 10:55 AM, Michael Jakl <ja...@gmail.com> wrote:
> Hi!
>
> On Mon, Jan 23, 2012 at 18:42, Erick Erickson <er...@gmail.com> wrote:
>> Count your parentheses (anyone here speak Lisp?) I think that +
>> is outside the entire clause, meaning it's saying that there is
>> a single mandatory clause, and it's the whole thing....
>
> You're right in that case it's the whole query. Pardon me, I chose a
> bad example. Using your input concerning the boost values, here
> another (cleaner) example (edited for readability):
>
>
> <str name="querystring">
> "java"
>  OR "haskell"
>  OR "python"
>  OR "ruby"
>  AND "programming"
>  OR "programming language"
>  OR "code coding"
>  OR -"mobile"
>  OR -"android"
>  OR -"microsoft"
>  OR -"windows"
> </str>
> <str name="parsedquery">
> +(
>  DisjunctionMaxQuery((stemmedText:java))
>  DisjunctionMaxQuery((stemmedText:haskell))
>  DisjunctionMaxQuery((stemmedText:python))
>  +DisjunctionMaxQuery((stemmedText:ruby))
>  +DisjunctionMaxQuery((stemmedText:program))
>  DisjunctionMaxQuery((stemmedText:"program language"))
>  DisjunctionMaxQuery((stemmedText:"code code"))
>  -DisjunctionMaxQuery((stemmedText:mobile))
>  -DisjunctionMaxQuery((stemmedText:android))
>  -DisjunctionMaxQuery((stemmedText:microsoft))
>  -DisjunctionMaxQuery((stemmedText:window))
> )
> </str>
>
> I've tried this using the three mentioned query parsers, all promote
> "ruby" and "program" to be mandatory. I was hoping for a
> "dontBeTooSmart=true" switch or something.
>
>> But boosting by 0.0 is probably a really bad thing. This may be
>> dropping all the scores to 0, which means "no match". The
>> default boost is 1.0 since it's multiplied to influence the score,
>> not added. So I'd try either not boosting or making
>> it something other than 0.
>
> Thank you very much for spotting this. The FAQ[1] is a bit confusing
> on that matter, if a boost of 0.0001 is still a boost, so 0.0 must be
> no boost at all, at least that was my logic.
>
> Cheers,
> Michael
>
>  1: http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F

Re: edismax/dismax/Lucene Query Parser converts some fields to be "mandatory"

Posted by Michael Jakl <ja...@gmail.com>.
Hi!

On Mon, Jan 23, 2012 at 18:42, Erick Erickson <er...@gmail.com> wrote:
> Count your parentheses (anyone here speak Lisp?) I think that +
> is outside the entire clause, meaning it's saying that there is
> a single mandatory clause, and it's the whole thing....

You're right in that case it's the whole query. Pardon me, I chose a
bad example. Using your input concerning the boost values, here
another (cleaner) example (edited for readability):


<str name="querystring">
"java"
 OR "haskell"
 OR "python"
 OR "ruby"
 AND "programming"
 OR "programming language"
 OR "code coding"
 OR -"mobile"
 OR -"android"
 OR -"microsoft"
 OR -"windows"
</str>
<str name="parsedquery">
+(
  DisjunctionMaxQuery((stemmedText:java))
  DisjunctionMaxQuery((stemmedText:haskell))
  DisjunctionMaxQuery((stemmedText:python))
 +DisjunctionMaxQuery((stemmedText:ruby))
 +DisjunctionMaxQuery((stemmedText:program))
 DisjunctionMaxQuery((stemmedText:"program language"))
 DisjunctionMaxQuery((stemmedText:"code code"))
 -DisjunctionMaxQuery((stemmedText:mobile))
 -DisjunctionMaxQuery((stemmedText:android))
 -DisjunctionMaxQuery((stemmedText:microsoft))
 -DisjunctionMaxQuery((stemmedText:window))
)
</str>

I've tried this using the three mentioned query parsers, all promote
"ruby" and "program" to be mandatory. I was hoping for a
"dontBeTooSmart=true" switch or something.

> But boosting by 0.0 is probably a really bad thing. This may be
> dropping all the scores to 0, which means "no match". The
> default boost is 1.0 since it's multiplied to influence the score,
> not added. So I'd try either not boosting or making
> it something other than 0.

Thank you very much for spotting this. The FAQ[1] is a bit confusing
on that matter, if a boost of 0.0001 is still a boost, so 0.0 must be
no boost at all, at least that was my logic.

Cheers,
Michael

 1: http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F

Re: edismax/dismax/Lucene Query Parser converts some fields to be "mandatory"

Posted by Erick Erickson <er...@gmail.com>.
Count your parentheses (anyone here speak Lisp?) I think that +
is outside the entire clause, meaning it's saying that there is
a single mandatory clause, and it's the whole thing....

But boosting by 0.0 is probably a really bad thing. This may be
dropping all the scores to 0, which means "no match". The
default boost is 1.0 since it's multiplied to influence the score,
not added. So I'd try either not boosting or making
it something other than 0.

Best
Erick

On Mon, Jan 23, 2012 at 9:07 AM, Michael Jakl <ja...@gmail.com> wrote:
> Hi, I've been wondering why some of my queries did not return the
> results I expected. A debugQuery resulted in the following:
>
> <str name="querystring">
> "java"^0.0 OR "haskell"^0.0 OR "python"^0.0 OR ("ruby"^0.0) AND
> (("programming"^0.0)) OR "programming language"^0.0 OR "code
> coding"^0.0 OR -"mobile"^0.0 OR -"android"^0.0 OR -"microsoft"^0.0 OR
> -"windows"^0.0
> </str>
> <str name="parsedquery">
> +(DisjunctionMaxQuery((stemmedText:java))
> DisjunctionMaxQuery((stemmedText:0.0))
> DisjunctionMaxQuery((stemmedText:haskell))
> DisjunctionMaxQuery((stemmedText:0.0))
> DisjunctionMaxQuery((stemmedText:python))
> DisjunctionMaxQuery((stemmedText:0.0))
> DisjunctionMaxQuery((stemmedText:ruby))
> +DisjunctionMaxQuery((stemmedText:0.0))
> DisjunctionMaxQuery((stemmedText:program))
> DisjunctionMaxQuery((stemmedText:0.0))
> DisjunctionMaxQuery((stemmedText:"program language"))
> DisjunctionMaxQuery((stemmedText:0.0))
> DisjunctionMaxQuery((stemmedText:"code code"))
> DisjunctionMaxQuery((stemmedText:0.0))
> -DisjunctionMaxQuery((stemmedText:mobile))
> DisjunctionMaxQuery((stemmedText:0.0))
> -DisjunctionMaxQuery((stemmedText:android))
> DisjunctionMaxQuery((stemmedText:0.0))
> -DisjunctionMaxQuery((stemmedText:microsoft))
> DisjunctionMaxQuery((stemmedText:0.0))
> -DisjunctionMaxQuery((stemmedText:window))
> DisjunctionMaxQuery((stemmedText:0.0))) ()
> </str>
>
> Why is the "java" part marked mandatory (using the + notation)? It
> seems that these rewritings seem to happen when the queries get quite
> long, is there a way to prevent Solr from assuming I wanted "java" to
> be a mandatory term, or to deduce any mandatory fields at all?
>
> I've tried it with the ExtendedDismaxQParser and the DismaxQParser,
> both yield the same parsedquery.
>
> The LuceneQParser yielded the following:
> <str name="querystring">
> "java"^0.0 OR "haskell"^0.0 OR "python"^0.0 OR ("ruby"^0.0) AND
> (("programming"^0.0)) OR "programming language"^0.0 OR "code
> coding"^0.0 OR -"mobile"^0.0 OR -"android"^0.0 OR -"microsoft"^0.0 OR
> -"windows"^0.0
> </str>
> <str name="parsedquery">
> stemmedText:java^0.0 stemmedText:haskell^0.0 stemmedText:python^0.0
> +stemmedText:ruby^0.0 +stemmedText:program^0.0
> PhraseQuery(stemmedText:"program language"^0.0)
> PhraseQuery(stemmedText:"code code"^0.0) -stemmedText:mobile^0.0
> -stemmedText:android^0.0 -stemmedText:microsoft^0.0
> -stemmedText:window^0.0
> </str>
>
> Now, Solr thinks I want "ruby" (and program, the stemmed version of
> programming) to be mandatory... .
>
> I'm running Solr 3.5 on Linux 64bit.
>
> Any suggestions would be greatly appreciated,
> Michael