You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jan Kurella <ja...@nokia.com> on 2010/11/26 14:39:53 UTC

DisMaxQuery calculating too high sumOfSquaredWeights?

Hi there,

I was composing a Query like the Solr.DisMaxQueryHandler would do on my 
own as I needed a different Tokenizing strategy for non whitespace 
separated languages and more. The concept I took from
http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

Assume now the following:
Documents having two fields "title" and "tag". User input can match any 
field but must be found almost fully
Document <title:blue star> <tag:have fun>

Query: "blue star fun"

And my Query from my query parser looks like the following:

BooleanQuery (
     DisjunctionMaxQuery (
         SpanTermQuery(title:blue),
         SpanTermQuery(tag:blue)
     ),
     DisjunctionMaxQuery (
         SpanTermQuery(title:star),
         SpanTermQuery(tag:star)
     ),
     DisjunctionMaxQuery (
         SpanTermQuery(title:fun),
         SpanTermQuery(tag:fun)
     ),
     minShouldMatch = 2
)

Obviously this is a "full match", meaning all three terms are found, and 
from subjective user perspective this should not be a big difference in 
the score to a pure OR-query "blue star fun" with all tokens in the same 
field. But surprisingly the score from the DMQuery is extremly low!

Looking into it it turns out, that the querynorm multiplied into each 
queryWeight of each SpanTermQuery is very small (0.16). It is calculated 
by the BooleanQuery by getting the sum of sumOfSquaredWeights() of each 
DMQuery. And here is the problem. The idf of the STQuery (or a 
TermQuery) used to elaborate the weight is very high for a Term not 
present (that is on purpose) Unfortunately the DMQuery takes the highest 
idf (assuming tie=0.0) from all clauses.

By concept for the whole dismax query the chance that there will be a 
Term not found in a concrete DMQuery is near 100%, especially if you 
search across many fields. Thus, the idf of a DMQuery is almost always 
equal to a Termquery which term will not be found. But For scoring only 
the clause of the DMQuery that hit will be taken into account. This 
leads to too small scores!

What I think would be the correct idf for a DMQuery with pure 
TermQueries would be rather something like

if any term matches
     take the highest (plus tiestuff) idf from these clauses,
else
     take the highest idf

Unfortunately, when calculating sumOfSquaredWeights(), the idf is 
already calculated in a general correct way and I do not see a way to to 
know in DisjunctionMaxQuery.DisjunctionMaxWeight.sumOfSquaredWeights() 
whether a returned currentWeight.sumOfSquaredWeights() comes from a 
TermQuery which only term has a df of 0?

How to solve this problem to get a "better" sumOfSquaredWeights() from 
DisMaxQuery? The current value does not reflect the intention of this query.

Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DisMaxQuery calculating too high sumOfSquaredWeights?

Posted by Jan Kurella <ja...@nokia.com>.
On 26.11.2010 14:50, ext Jan Kurella wrote:
> On 26.11.2010 14:39, ext Jan Kurella wrote:
>> Hi there,
>>
>> I was composing a Query like the Solr.DisMaxQueryHandler would do on 
>> my own as I needed a different Tokenizing strategy for non whitespace 
>> separated languages and more. The concept I took from
>> http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
>>
>> Assume now the following:
>> Documents having two fields "title" and "tag". User input can match 
>> any field but must be found almost fully
>> Document <title:blue star> <tag:have fun>
>>
>> Query: "blue star fun"
>>
>> And my Query from my query parser looks like the following:
>>
>> BooleanQuery (
>>     DisjunctionMaxQuery (
>>         SpanTermQuery(title:blue),
>>         SpanTermQuery(tag:blue)
>>     ),
>>     DisjunctionMaxQuery (
>>         SpanTermQuery(title:star),
>>         SpanTermQuery(tag:star)
>>     ),
>>     DisjunctionMaxQuery (
>>         SpanTermQuery(title:fun),
>>         SpanTermQuery(tag:fun)
>>     ),
>>     minShouldMatch = 2
>> )
>>
>> Obviously this is a "full match", meaning all three terms are found, 
>> and from subjective user perspective this should not be a big 
>> difference in the score to a pure OR-query "blue star fun" with all 
>> tokens in the same field. But surprisingly the score from the DMQuery 
>> is extremly low!
>>
>> Looking into it it turns out, that the querynorm multiplied into each 
>> queryWeight of each SpanTermQuery is very small (0.16). It is 
>> calculated by the BooleanQuery by getting the sum of 
>> sumOfSquaredWeights() of each DMQuery. And here is the problem. The 
>> idf of the STQuery (or a TermQuery) used to elaborate the weight is 
>> very high for a Term not present (that is on purpose) Unfortunately 
>> the DMQuery takes the highest idf (assuming tie=0.0) from all clauses.
>>
>> By concept for the whole dismax query the chance that there will be a 
>> Term not found in a concrete DMQuery is near 100%, especially if you 
>> search across many fields. Thus, the idf of a DMQuery is almost 
>> always equal to a Termquery which term will not be found. But For 
>> scoring only the clause of the DMQuery that hit will be taken into 
>> account. This leads to too small scores!
>>
>> What I think would be the correct idf for a DMQuery with pure 
>> TermQueries would be rather something like
>>
>> if any term matches
>>     take the highest (plus tiestuff) idf from these clauses,
>> else
>>     take the highest idf
>>
>> Unfortunately, when calculating sumOfSquaredWeights(), the idf is 
>> already calculated in a general correct way and I do not see a way to 
>> to know in 
>> DisjunctionMaxQuery.DisjunctionMaxWeight.sumOfSquaredWeights() 
>> whether a returned currentWeight.sumOfSquaredWeights() comes from a 
>> TermQuery which only term has a df of 0?
>>
>> How to solve this problem to get a "better" sumOfSquaredWeights() 
>> from DisMaxQuery? The current value does not reflect the intention of 
>> this query.
>>
>> Jan
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> What is about this?
>
> public float sumOfSquaredWeights() throws IOException {
>     float min = Float.MAX_VALUE, sum = 0.0f;
>     for (Weight currentWeight : weights) {
>         float sub = currentWeight.sumOfSquaredWeights();
>         sum += sub;
>         min = Math.min(min, sub);
>     }
>     if (min == Float.MAX_VALUE) min=0.0f;
>     float boost = getBoost();
>     return (((sum - min) * tieBreakerMultiplier * 
> tieBreakerMultiplier) + min) * boost * boost;
> }
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
I tried it with this little wrapper class:

public class MinDisMaxQuery extends DisjunctionMaxQuery {

     private final float _tieBreakerMultiplier;

     protected class MinDisMaxWeight extends DisjunctionMaxWeight {

         public MinDisMaxWeight(Searcher pSearcher) throws IOException {
             super(pSearcher);
         }

         @Override
         public float sumOfSquaredWeights() throws IOException {
             float min = Float.MAX_VALUE, sum = 0.0f;
             for (Weight currentWeight : weights) {
                 float sub = currentWeight.sumOfSquaredWeights();
                 sum += sub;
                 min = Math.min(min, sub);
             }
             if (min == Float.MAX_VALUE) {
                 min = 0.0f;
             }
             float boost = getBoost();
             return (((sum - min) * _tieBreakerMultiplier * 
_tieBreakerMultiplier) + min) * boost * boost;
         }

     }

     public MinDisMaxQuery(float pTieBreakerMultiplier) {
         super(pTieBreakerMultiplier);
         _tieBreakerMultiplier = pTieBreakerMultiplier;
     }

     public MinDisMaxQuery(Collection<Query> pDisjuncts, float 
pTieBreakerMultiplier) {
         super(pDisjuncts, pTieBreakerMultiplier);
         _tieBreakerMultiplier = pTieBreakerMultiplier;
     }

     @Override
     public Weight createWeight(Searcher pSearcher) throws IOException {
         return new MinDisMaxWeight(pSearcher);
     }

}


might this be the better approach for DisMax? at least the scores coming 
now look very promising!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DisMaxQuery calculating too high sumOfSquaredWeights?

Posted by Jan Kurella <ja...@nokia.com>.
On 26.11.2010 14:39, ext Jan Kurella wrote:
> Hi there,
>
> I was composing a Query like the Solr.DisMaxQueryHandler would do on 
> my own as I needed a different Tokenizing strategy for non whitespace 
> separated languages and more. The concept I took from
> http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
>
> Assume now the following:
> Documents having two fields "title" and "tag". User input can match 
> any field but must be found almost fully
> Document <title:blue star> <tag:have fun>
>
> Query: "blue star fun"
>
> And my Query from my query parser looks like the following:
>
> BooleanQuery (
>     DisjunctionMaxQuery (
>         SpanTermQuery(title:blue),
>         SpanTermQuery(tag:blue)
>     ),
>     DisjunctionMaxQuery (
>         SpanTermQuery(title:star),
>         SpanTermQuery(tag:star)
>     ),
>     DisjunctionMaxQuery (
>         SpanTermQuery(title:fun),
>         SpanTermQuery(tag:fun)
>     ),
>     minShouldMatch = 2
> )
>
> Obviously this is a "full match", meaning all three terms are found, 
> and from subjective user perspective this should not be a big 
> difference in the score to a pure OR-query "blue star fun" with all 
> tokens in the same field. But surprisingly the score from the DMQuery 
> is extremly low!
>
> Looking into it it turns out, that the querynorm multiplied into each 
> queryWeight of each SpanTermQuery is very small (0.16). It is 
> calculated by the BooleanQuery by getting the sum of 
> sumOfSquaredWeights() of each DMQuery. And here is the problem. The 
> idf of the STQuery (or a TermQuery) used to elaborate the weight is 
> very high for a Term not present (that is on purpose) Unfortunately 
> the DMQuery takes the highest idf (assuming tie=0.0) from all clauses.
>
> By concept for the whole dismax query the chance that there will be a 
> Term not found in a concrete DMQuery is near 100%, especially if you 
> search across many fields. Thus, the idf of a DMQuery is almost always 
> equal to a Termquery which term will not be found. But For scoring 
> only the clause of the DMQuery that hit will be taken into account. 
> This leads to too small scores!
>
> What I think would be the correct idf for a DMQuery with pure 
> TermQueries would be rather something like
>
> if any term matches
>     take the highest (plus tiestuff) idf from these clauses,
> else
>     take the highest idf
>
> Unfortunately, when calculating sumOfSquaredWeights(), the idf is 
> already calculated in a general correct way and I do not see a way to 
> to know in 
> DisjunctionMaxQuery.DisjunctionMaxWeight.sumOfSquaredWeights() whether 
> a returned currentWeight.sumOfSquaredWeights() comes from a TermQuery 
> which only term has a df of 0?
>
> How to solve this problem to get a "better" sumOfSquaredWeights() from 
> DisMaxQuery? The current value does not reflect the intention of this 
> query.
>
> Jan
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
What is about this?

public float sumOfSquaredWeights() throws IOException {
     float min = Float.MAX_VALUE, sum = 0.0f;
     for (Weight currentWeight : weights) {
         float sub = currentWeight.sumOfSquaredWeights();
         sum += sub;
         min = Math.min(min, sub);
     }
     if (min == Float.MAX_VALUE) min=0.0f;
     float boost = getBoost();
     return (((sum - min) * tieBreakerMultiplier * tieBreakerMultiplier) 
+ min) * boost * boost;
}





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org