You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Em <ma...@yahoo.de> on 2011/10/05 09:42:57 UTC

How is Number of Boolean Clauses calculated - Minimum Should Match?

Hello list,

in what way does BooleanQuery calculates the number of its clauses? Is
this number based on the analyzed query or based on the raw query-string?

Imagine you got a StopFilter or a SynonymFilter applied to a
BooleanQuery during analyzing - the number of clauses could shrink or
increase.

I remind that in connection with the MinimumShouldMatch-param there may
occur problems if you query fields with an applied StopFilter and some
fields without.

I tried to answer a question on mailinglists and noticed that I am
relatively unsure about how MM is calculated in general and how
especially in Solr (since I am not sure, I am a little bit confused when
I made a code review).

Thank you!

Regards,
Em

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Chris Hostetter <ho...@fucit.org>.
: From my understanding this could be also dangerous for queries that
: reduce the number of tokens.
: Imagine: Search Engine => SE (reduced to SE).
: This should have the same impact on the min should match as a stopword, no?

Not really ... assuming you mean *query* based synonyms, then a multiword 
synonym used in the query string isn't going to be respected unless it's 
explicilty quoted, because each "chunk" of query parser input is analyzed 
independently.  (remember: the QueryParser parses according to it's own 
meta-characters -- including whitespace -- before passing any parts of hte 
input to the individual analyzers)

Even if it is quoted, and it reduces to one term in fieldA, but remains 
two terms in fieldB, the number of clauses isn't affected because the end 
result for each chunk is what's used to create the DisjunctionMaxQuery 
objects that are used as the clauses in the top level BooleanQuery.

: What if I remove a stopword but add another token when synonyms come in?

try it ... you'll see what i mean.

(when it comes to query parsing, no amount of textual description can 
substitue fo first hand experience and experimentation -- i've written 
documenation, blogs, emails ... i've even done training classes where i've 
discussed this specific thing for ~1 hour -- nothing makes it hit home 
like having people sit down and actually play with the config and see the 
output)



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Em <ma...@yahoo.de>.
Hoss,

did you have a look on the responses you got?
The first one is really interesting. It asks about the behaviour when
synonyms come into play.

>From my understanding this could be also dangerous for queries that
reduce the number of tokens.
Imagine: Search Engine => SE (reduced to SE).
This should have the same impact on the min should match as a stopword, no?

What if I remove a stopword but add another token when synonyms come in?

Just some thoughts :).

Regards,
Em

Am 07.10.2011 19:37, schrieb Em:
> Hi Hoss,
> 
> I read your article.
> 
> I have to review the solr-code but with the help of your pseudo-code I
> think I understand what goes on now.
> 
> Thank you!
> 
> Regards,
> Em
> 
> Am 05.10.2011 20:19, schrieb Chris Hostetter:
>>
>> : > Presumably this query would fail, since you've only got three clauses.
>> : >  Easy to verify.
>> : 
>> : Seems like different behaviour compared to Solr. Probably Solr is
>> : intelligent enough to reduce the parameter to the maximum value if it is
>> : too large.
>>
>> correct, the dismax parser in solr is smart enough not to calculate an 
>> illegal value for minNrShouldMatch using the mm param.
>>
>> : >> If so, what is the problem in Solr with Stopwords and the Dismax-Parser?
>>
>> the problem people sometimes have understanding the interaction of the dismax 
>> parser and stopwords comes from using sotpwords in the analyzers 
>> for *some* fields they are querying but not others, and then being 
>> suprised that the stopwords are still part of their overall query (in the 
>> fields where they didn't use them in their analyzer)...
>>
>> https://wiki.apache.org/solr/DisMax
>> http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
>>
>> ...note in particula the "Where people tend to get tripped up..." para in 
>> that blog post
>>
>>
>> -Hoss
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Em <ma...@yahoo.de>.
Hi Hoss,

I read your article.

I have to review the solr-code but with the help of your pseudo-code I
think I understand what goes on now.

Thank you!

Regards,
Em

Am 05.10.2011 20:19, schrieb Chris Hostetter:
> 
> : > Presumably this query would fail, since you've only got three clauses.
> : >  Easy to verify.
> : 
> : Seems like different behaviour compared to Solr. Probably Solr is
> : intelligent enough to reduce the parameter to the maximum value if it is
> : too large.
> 
> correct, the dismax parser in solr is smart enough not to calculate an 
> illegal value for minNrShouldMatch using the mm param.
> 
> : >> If so, what is the problem in Solr with Stopwords and the Dismax-Parser?
> 
> the problem people sometimes have understanding the interaction of the dismax 
> parser and stopwords comes from using sotpwords in the analyzers 
> for *some* fields they are querying but not others, and then being 
> suprised that the stopwords are still part of their overall query (in the 
> fields where they didn't use them in their analyzer)...
> 
> https://wiki.apache.org/solr/DisMax
> http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
> 
> ...note in particula the "Where people tend to get tripped up..." para in 
> that blog post
> 
> 
> -Hoss
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Chris Hostetter <ho...@fucit.org>.
: > Presumably this query would fail, since you've only got three clauses.
: >  Easy to verify.
: 
: Seems like different behaviour compared to Solr. Probably Solr is
: intelligent enough to reduce the parameter to the maximum value if it is
: too large.

correct, the dismax parser in solr is smart enough not to calculate an 
illegal value for minNrShouldMatch using the mm param.

: >> If so, what is the problem in Solr with Stopwords and the Dismax-Parser?

the problem people sometimes have understanding the interaction of the dismax 
parser and stopwords comes from using sotpwords in the analyzers 
for *some* fields they are querying but not others, and then being 
suprised that the stopwords are still part of their overall query (in the 
fields where they didn't use them in their analyzer)...

https://wiki.apache.org/solr/DisMax
http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

...note in particula the "Where people tend to get tripped up..." para in 
that blog post


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Em <ma...@yahoo.de>.
Hi Ian,

thanks for the fast feedback.
>> If the MM was set to 4 (too many), than this means all queries have to
>> match?
>
> Presumably this query would fail, since you've only got three clauses.
>  Easy to verify.

Seems like different behaviour compared to Solr. Probably Solr is
intelligent enough to reduce the parameter to the maximum value if it is
too large.

I'll wait a little bit, before reposting my question on the Solr list.

Regards,
Em

Am 05.10.2011 12:51, schrieb Ian Lea:
> Sorry - you did say StopFilter or SynonymFilter but I started talking
> about oal.search.Filter instead.
> 
>> So if an Analyzer contains a StopFilter and the parser uses this
>> Analyzer, than the following will happen:
>>
>> Original:
>> "To be or not to be said Shakespeare"
>>
>> Stopwords: To, be, or
>>
>> Resulting BooleanClauses:
>> - not
>> - said
>> - Shakespeare
>>
>> Is this right?
> 
> Yes.
> 
>> If the MM was set to 4 (too many), than this means all queries have to
>> match?
> 
> Presumably this query would fail, since you've only got three clauses.
>  Easy to verify.
> 
>> If so, what is the problem in Solr with Stopwords and the Dismax-Parser?
> 
> That sounds like a different question, maybe one for the solr list.
> 
> 
> --
> Ian.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Ian Lea <ia...@gmail.com>.
Sorry - you did say StopFilter or SynonymFilter but I started talking
about oal.search.Filter instead.

> So if an Analyzer contains a StopFilter and the parser uses this
> Analyzer, than the following will happen:
>
> Original:
> "To be or not to be said Shakespeare"
>
> Stopwords: To, be, or
>
> Resulting BooleanClauses:
> - not
> - said
> - Shakespeare
>
> Is this right?

Yes.

> If the MM was set to 4 (too many), than this means all queries have to
> match?

Presumably this query would fail, since you've only got three clauses.
 Easy to verify.

> If so, what is the problem in Solr with Stopwords and the Dismax-Parser?

That sounds like a different question, maybe one for the solr list.


--
Ian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Em <ma...@yahoo.de>.
Hi,

thank you Uwe and Ian!

So if an Analyzer contains a StopFilter and the parser uses this
Analyzer, than the following will happen:

Original:
"To be or not to be said Shakespeare"

Stopwords: To, be, or

Resulting BooleanClauses:
- not
- said
- Shakespeare

Is this right?

If the MM was set to 4 (too many), than this means all queries have to
match?

If so, what is the problem in Solr with Stopwords and the Dismax-Parser?

Regards,
Em

Am 05.10.2011 11:39, schrieb Uwe Schindler:
> Hi,
> 
> The TooManyClausesException is thrown by BooleanQuery.add(Clause). Because
> of this, it can only count clauses actually added to the BooleanQuery -
> terms thrown away by QueryParser before are not counted as they will not be
> in the final query. If a token in the query parser expands to multiple
> synonyms, multiple clauses are added and count against the limit.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
>> -----Original Message-----
>> From: Ian Lea [mailto:ian.lea@gmail.com]
>> Sent: Wednesday, October 05, 2011 11:32 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: How is Number of Boolean Clauses calculated - Minimum Should
>> Match?
>>
>> It will work on the query, whether produced by a query parser or
>> constructed in code.  I don't see that the number of clauses will
>> change if you are applying filters.  Filters are not query clauses,
>> although it can get confusing if you start using stuff like
>> FilteredQuery or QueryWrapperFilter.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Oct 5, 2011 at 8:42 AM, Em <ma...@yahoo.de> wrote:
>>> Hello list,
>>>
>>> in what way does BooleanQuery calculates the number of its clauses? Is
>>> this number based on the analyzed query or based on the raw
> query-string?
>>>
>>> Imagine you got a StopFilter or a SynonymFilter applied to a
>>> BooleanQuery during analyzing - the number of clauses could shrink or
>>> increase.
>>>
>>> I remind that in connection with the MinimumShouldMatch-param there may
>>> occur problems if you query fields with an applied StopFilter and some
>>> fields without.
>>>
>>> I tried to answer a question on mailinglists and noticed that I am
>>> relatively unsure about how MM is calculated in general and how
>>> especially in Solr (since I am not sure, I am a little bit confused when
>>> I made a code review).
>>>
>>> Thank you!
>>>
>>> Regards,
>>> Em
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

The TooManyClausesException is thrown by BooleanQuery.add(Clause). Because
of this, it can only count clauses actually added to the BooleanQuery -
terms thrown away by QueryParser before are not counted as they will not be
in the final query. If a token in the query parser expands to multiple
synonyms, multiple clauses are added and count against the limit.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Ian Lea [mailto:ian.lea@gmail.com]
> Sent: Wednesday, October 05, 2011 11:32 AM
> To: java-user@lucene.apache.org
> Subject: Re: How is Number of Boolean Clauses calculated - Minimum Should
> Match?
> 
> It will work on the query, whether produced by a query parser or
> constructed in code.  I don't see that the number of clauses will
> change if you are applying filters.  Filters are not query clauses,
> although it can get confusing if you start using stuff like
> FilteredQuery or QueryWrapperFilter.
> 
> 
> --
> Ian.
> 
> 
> On Wed, Oct 5, 2011 at 8:42 AM, Em <ma...@yahoo.de> wrote:
> > Hello list,
> >
> > in what way does BooleanQuery calculates the number of its clauses? Is
> > this number based on the analyzed query or based on the raw
query-string?
> >
> > Imagine you got a StopFilter or a SynonymFilter applied to a
> > BooleanQuery during analyzing - the number of clauses could shrink or
> > increase.
> >
> > I remind that in connection with the MinimumShouldMatch-param there may
> > occur problems if you query fields with an applied StopFilter and some
> > fields without.
> >
> > I tried to answer a question on mailinglists and noticed that I am
> > relatively unsure about how MM is calculated in general and how
> > especially in Solr (since I am not sure, I am a little bit confused when
> > I made a code review).
> >
> > Thank you!
> >
> > Regards,
> > Em
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How is Number of Boolean Clauses calculated - Minimum Should Match?

Posted by Ian Lea <ia...@gmail.com>.
It will work on the query, whether produced by a query parser or
constructed in code.  I don't see that the number of clauses will
change if you are applying filters.  Filters are not query clauses,
although it can get confusing if you start using stuff like
FilteredQuery or QueryWrapperFilter.


--
Ian.


On Wed, Oct 5, 2011 at 8:42 AM, Em <ma...@yahoo.de> wrote:
> Hello list,
>
> in what way does BooleanQuery calculates the number of its clauses? Is
> this number based on the analyzed query or based on the raw query-string?
>
> Imagine you got a StopFilter or a SynonymFilter applied to a
> BooleanQuery during analyzing - the number of clauses could shrink or
> increase.
>
> I remind that in connection with the MinimumShouldMatch-param there may
> occur problems if you query fields with an applied StopFilter and some
> fields without.
>
> I tried to answer a question on mailinglists and noticed that I am
> relatively unsure about how MM is calculated in general and how
> especially in Solr (since I am not sure, I am a little bit confused when
> I made a code review).
>
> Thank you!
>
> Regards,
> Em
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org