You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Ryan <he...@gmail.com> on 2011/10/05 02:52:01 UTC

Scoring of DisMax in Solr

Hi,


When I examine the score calculation of DisMax in Solr,   it looks to me
that DisMax is using  tf x idf^2 instead of tf x idf.
Does anyone have insight why tf x idf is not used here?

Here is the score contribution from one one field:

score(q,c) =  queryWeight x fieldWeight
               = tf x idf x idf x queryNorm x fieldNorm

Here is the example that I used to derive the formula above. Clearly, idf is
multiplied twice in the score calculation.
*
http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent=on&debugQuery=true&fl=id,score
*

    <str name="6H500F0">
0.18314168 = (MATCH) sum of:
  0.18314168 = (MATCH) weight(text:gb in 1), product of:
    0.35845062 = queryWeight(text:gb), product of:
      2.3121865 = idf(docFreq=6, numDocs=26)
      0.15502669 = queryNorm
    0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
      1.4142135 = tf(termFreq(text:gb)=2)
      2.3121865 = idf(docFreq=6, numDocs=26)
      0.15625 = fieldNorm(field=text, doc=1)
</str>


Thanks!

Re: Scoring of DisMax in Solr

Posted by David Ryan <he...@gmail.com>.
Ok, here is the calculation of the score:

0.18314168  =  *2.3121865* * 0.15502669 * 1.4142135 * *2.3121865* * 0.15625

*2.3121865 is *multiplied twice here.  That is what I mean tf x idf^2 is
used instead of tf x idf.



On Wed, Oct 5, 2011 at 10:42 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> I don't see 2.3121865 * 2 anywhere in your debug output or something that
> looks like that.
>
>
> > Hi Markus,
> >
> > The idf calculation itself is correct.
> > What I am trying to understand here is  why idf value is multiplied twice
> > in the final score calculation. Essentially,  tf x idf^2 is used instead
> > of tf x idf.
> > I'd like to understand the rational behind that.
> >
> > On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
> > > In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
> > > 1 + ln(26 / 7) =~ 2.3121865
> > >
> > > I don't see a problem.
> > >
> > > > Hi,
> > > >
> > > >
> > > > When I examine the score calculation of DisMax in Solr,   it looks to
> > > > me that DisMax is using  tf x idf^2 instead of tf x idf.
> > > > Does anyone have insight why tf x idf is not used here?
> > > >
> > > > Here is the score contribution from one one field:
> > > >
> > > > score(q,c) =  queryWeight x fieldWeight
> > > >
> > > >                = tf x idf x idf x queryNorm x fieldNorm
> > > >
> > > > Here is the example that I used to derive the formula above. Clearly,
> > > > idf is multiplied twice in the score calculation.
> > > > *
> > >
> > >
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&inden
> > > t=
> > >
> > > > on&debugQuery=true&fl=id,score *
> > > >
> > > >     <str name="6H500F0">
> > > >
> > > > 0.18314168 = (MATCH) sum of:
> > > >   0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > > >     0.35845062 = queryWeight(text:gb), product of:
> > > >       2.3121865 = idf(docFreq=6, numDocs=26)
> > > >       0.15502669 = queryNorm
> > > >
> > > >     0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> > > >       1.4142135 = tf(termFreq(text:gb)=2)
> > > >       2.3121865 = idf(docFreq=6, numDocs=26)
> > > >       0.15625 = fieldNorm(field=text, doc=1)
> > > >
> > > > </str>
> > > >
> > > >
> > > > Thanks!
>

Re: Scoring of DisMax in Solr

Posted by Bill Bell <bi...@gmail.com>.
Markus,

The calculation is correct.

Look at your output.

Result = queryWeight(text:gb) * fieldWeight(text:gb in 1)

Result = (idf(docFreq=6, numDocs=26) * queryNorm) *
(tf(termFreq(text:gb)=2) * idf(docFreq=6, numDocs=26) *
fieldNorm(field=text, doc=1))

This you should notice that idf(docFreq=6, numDocs=26 is repeated twice.

This si just how the weight() is calculated.




> > 0.18314168 = (MATCH) sum of:
> >   0.18314168 = (MATCH) weight(text:gb in 1), product of:
> >     0.35845062 = queryWeight(text:gb), product of:
> >       2.3121865 = idf(docFreq=6, numDocs=26)
> >       0.15502669 = queryNorm
> >    
> >     0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> >       1.4142135 = tf(termFreq(text:gb)=2)
> >       2.3121865 = idf(docFreq=6, numDocs=26)
> >       0.15625 = fieldNorm(field=text, doc=1)





On 10/5/11 11:42 AM, "Markus Jelsma" <ma...@openindex.io> wrote:

>Hi,
>
>I don't see 2.3121865 * 2 anywhere in your debug output or something that
>looks like that.
>
>
>> Hi Markus,
>> 
>> The idf calculation itself is correct.
>> What I am trying to understand here is  why idf value is multiplied
>>twice
>> in the final score calculation. Essentially,  tf x idf^2 is used instead
>> of tf x idf.
>> I'd like to understand the rational behind that.
>> 
>> On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma
><ma...@openindex.io>wrote:
>> > In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
>> > 1 + ln(26 / 7) =~ 2.3121865
>> > 
>> > I don't see a problem.
>> > 
>> > > Hi,
>> > > 
>> > > 
>> > > When I examine the score calculation of DisMax in Solr,   it looks
>>to
>> > > me that DisMax is using  tf x idf^2 instead of tf x idf.
>> > > Does anyone have insight why tf x idf is not used here?
>> > > 
>> > > Here is the score contribution from one one field:
>> > > 
>> > > score(q,c) =  queryWeight x fieldWeight
>> > > 
>> > >                = tf x idf x idf x queryNorm x fieldNorm
>> > > 
>> > > Here is the example that I used to derive the formula above.
>>Clearly,
>> > > idf is multiplied twice in the score calculation.
>> > > *
>> > 
>> > 
>>http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&inden
>> > t=
>> > 
>> > > on&debugQuery=true&fl=id,score *
>> > > 
>> > >     <str name="6H500F0">
>> > > 
>> > > 0.18314168 = (MATCH) sum of:
>> > >   0.18314168 = (MATCH) weight(text:gb in 1), product of:
>> > >     0.35845062 = queryWeight(text:gb), product of:
>> > >       2.3121865 = idf(docFreq=6, numDocs=26)
>> > >       0.15502669 = queryNorm
>> > >     
>> > >     0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
>> > >       1.4142135 = tf(termFreq(text:gb)=2)
>> > >       2.3121865 = idf(docFreq=6, numDocs=26)
>> > >       0.15625 = fieldNorm(field=text, doc=1)
>> > > 
>> > > </str>
>> > > 
>> > > 
>> > > Thanks!



Re: Scoring of DisMax in Solr

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

I don't see 2.3121865 * 2 anywhere in your debug output or something that 
looks like that.


> Hi Markus,
> 
> The idf calculation itself is correct.
> What I am trying to understand here is  why idf value is multiplied twice
> in the final score calculation. Essentially,  tf x idf^2 is used instead
> of tf x idf.
> I'd like to understand the rational behind that.
> 
> On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
> > 1 + ln(26 / 7) =~ 2.3121865
> > 
> > I don't see a problem.
> > 
> > > Hi,
> > > 
> > > 
> > > When I examine the score calculation of DisMax in Solr,   it looks to
> > > me that DisMax is using  tf x idf^2 instead of tf x idf.
> > > Does anyone have insight why tf x idf is not used here?
> > > 
> > > Here is the score contribution from one one field:
> > > 
> > > score(q,c) =  queryWeight x fieldWeight
> > > 
> > >                = tf x idf x idf x queryNorm x fieldNorm
> > > 
> > > Here is the example that I used to derive the formula above. Clearly,
> > > idf is multiplied twice in the score calculation.
> > > *
> > 
> > http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&inden
> > t=
> > 
> > > on&debugQuery=true&fl=id,score *
> > > 
> > >     <str name="6H500F0">
> > > 
> > > 0.18314168 = (MATCH) sum of:
> > >   0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > >     0.35845062 = queryWeight(text:gb), product of:
> > >       2.3121865 = idf(docFreq=6, numDocs=26)
> > >       0.15502669 = queryNorm
> > >     
> > >     0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> > >       1.4142135 = tf(termFreq(text:gb)=2)
> > >       2.3121865 = idf(docFreq=6, numDocs=26)
> > >       0.15625 = fieldNorm(field=text, doc=1)
> > > 
> > > </str>
> > > 
> > > 
> > > Thanks!

Re: Scoring of DisMax in Solr

Posted by David Ryan <he...@gmail.com>.
Hi Markus,

The idf calculation itself is correct.
What I am trying to understand here is  why idf value is multiplied twice in
the final score calculation. Essentially,  tf x idf^2 is used instead of tf
x idf.
I'd like to understand the rational behind that.





On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma <ma...@openindex.io>wrote:

> In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
> 1 + ln(26 / 7) =~ 2.3121865
>
> I don't see a problem.
>
> > Hi,
> >
> >
> > When I examine the score calculation of DisMax in Solr,   it looks to me
> > that DisMax is using  tf x idf^2 instead of tf x idf.
> > Does anyone have insight why tf x idf is not used here?
> >
> > Here is the score contribution from one one field:
> >
> > score(q,c) =  queryWeight x fieldWeight
> >                = tf x idf x idf x queryNorm x fieldNorm
> >
> > Here is the example that I used to derive the formula above. Clearly, idf
> > is multiplied twice in the score calculation.
> > *
> >
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent=
> > on&debugQuery=true&fl=id,score *
> >
> >     <str name="6H500F0">
> > 0.18314168 = (MATCH) sum of:
> >   0.18314168 = (MATCH) weight(text:gb in 1), product of:
> >     0.35845062 = queryWeight(text:gb), product of:
> >       2.3121865 = idf(docFreq=6, numDocs=26)
> >       0.15502669 = queryNorm
> >     0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> >       1.4142135 = tf(termFreq(text:gb)=2)
> >       2.3121865 = idf(docFreq=6, numDocs=26)
> >       0.15625 = fieldNorm(field=text, doc=1)
> > </str>
> >
> >
> > Thanks!
>

Re: Scoring of DisMax in Solr

Posted by Markus Jelsma <ma...@openindex.io>.
In Lucene's default similarity idf = 1 + ln (numDocs / df + 1). 
1 + ln(26 / 7) =~ 2.3121865

I don't see a problem.

> Hi,
> 
> 
> When I examine the score calculation of DisMax in Solr,   it looks to me
> that DisMax is using  tf x idf^2 instead of tf x idf.
> Does anyone have insight why tf x idf is not used here?
> 
> Here is the score contribution from one one field:
> 
> score(q,c) =  queryWeight x fieldWeight
>                = tf x idf x idf x queryNorm x fieldNorm
> 
> Here is the example that I used to derive the formula above. Clearly, idf
> is multiplied twice in the score calculation.
> *
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent=
> on&debugQuery=true&fl=id,score *
> 
>     <str name="6H500F0">
> 0.18314168 = (MATCH) sum of:
>   0.18314168 = (MATCH) weight(text:gb in 1), product of:
>     0.35845062 = queryWeight(text:gb), product of:
>       2.3121865 = idf(docFreq=6, numDocs=26)
>       0.15502669 = queryNorm
>     0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
>       1.4142135 = tf(termFreq(text:gb)=2)
>       2.3121865 = idf(docFreq=6, numDocs=26)
>       0.15625 = fieldNorm(field=text, doc=1)
> </str>
> 
> 
> Thanks!

Re: Scoring of DisMax in Solr

Posted by David Ryan <he...@gmail.com>.
The example does not include the evidence.  But we do use eDisMax for
scoring in Solr.

The following is from solrconfig.xml:

<str name="defType">edismax</str>


Here is a short snippet of the explained result, where 0.1 is the Tie
breaker in DisMax/eDisMax.

6.446447 = (MATCH) max plus 0.1 times others of:

    0.63826215 = (MATCH) weight(description:sony^0.25 in 802), product of:

   .....


I noticed that in DefaultSimilarity,  tf x idf^2 is used instead of tf x
idf, as stated in your link.


I am wondering if anyone has insight why that DisMax/eDisMax adopts the same
approach using tf x idf^2

I will try java-user@lucene mailing list as well.




On Wed, Oct 5, 2011 at 11:30 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : Thanks! What's the procedure to report this if it's a bug?
> : EDisMax has similar behavior.
>
> what yo uare seeing isn't specific to dismax & edismax (in fact: there's
> no evidence in your example that dismax is even being used)
>
> what you are seeing is the basic scoring of a TermQuery using the
> DefaultSimilarity in lucene...
>
>
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/Similarity.html
>
> ...if you have specific questions about how/why this scoring forumala is
> used, i would suggest posting them to the java-user@lucene mailing list.
>
>
> -Hoss
>

Re: Scoring of DisMax in Solr

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks! What's the procedure to report this if it's a bug?
: EDisMax has similar behavior.

what yo uare seeing isn't specific to dismax & edismax (in fact: there's 
no evidence in your example that dismax is even being used)

what you are seeing is the basic scoring of a TermQuery using the  
DefaultSimilarity in lucene...

https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/Similarity.html

...if you have specific questions about how/why this scoring forumala is 
used, i would suggest posting them to the java-user@lucene mailing list.


-Hoss

Re: Scoring of DisMax in Solr

Posted by David Ryan <he...@gmail.com>.
Thanks! What's the procedure to report this if it's a bug?
EDisMax has similar behavior.

On Tue, Oct 4, 2011 at 11:24 PM, Bill Bell <bi...@gmail.com> wrote:

> This seems like a bug to me.
>
> On 10/4/11 6:52 PM, "David Ryan" <he...@gmail.com> wrote:
>
> >Hi,
> >
> >
> >When I examine the score calculation of DisMax in Solr,   it looks to me
> >that DisMax is using  tf x idf^2 instead of tf x idf.
> >Does anyone have insight why tf x idf is not used here?
> >
> >Here is the score contribution from one one field:
> >
> >score(q,c) =  queryWeight x fieldWeight
> >               = tf x idf x idf x queryNorm x fieldNorm
> >
> >Here is the example that I used to derive the formula above. Clearly, idf
> >is
> >multiplied twice in the score calculation.
> >*
> >
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent
> >=on&debugQuery=true&fl=id,score
> >*
> >
> >    <str name="6H500F0">
> >0.18314168 = (MATCH) sum of:
> >  0.18314168 = (MATCH) weight(text:gb in 1), product of:
> >    0.35845062 = queryWeight(text:gb), product of:
> >      2.3121865 = idf(docFreq=6, numDocs=26)
> >      0.15502669 = queryNorm
> >    0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> >      1.4142135 = tf(termFreq(text:gb)=2)
> >      2.3121865 = idf(docFreq=6, numDocs=26)
> >      0.15625 = fieldNorm(field=text, doc=1)
> ></str>
> >
> >
> >Thanks!
>
>
>

Re: Scoring of DisMax in Solr

Posted by Bill Bell <bi...@gmail.com>.
This seems like a bug to me.

On 10/4/11 6:52 PM, "David Ryan" <he...@gmail.com> wrote:

>Hi,
>
>
>When I examine the score calculation of DisMax in Solr,   it looks to me
>that DisMax is using  tf x idf^2 instead of tf x idf.
>Does anyone have insight why tf x idf is not used here?
>
>Here is the score contribution from one one field:
>
>score(q,c) =  queryWeight x fieldWeight
>               = tf x idf x idf x queryNorm x fieldNorm
>
>Here is the example that I used to derive the formula above. Clearly, idf
>is
>multiplied twice in the score calculation.
>*
>http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent
>=on&debugQuery=true&fl=id,score
>*
>
>    <str name="6H500F0">
>0.18314168 = (MATCH) sum of:
>  0.18314168 = (MATCH) weight(text:gb in 1), product of:
>    0.35845062 = queryWeight(text:gb), product of:
>      2.3121865 = idf(docFreq=6, numDocs=26)
>      0.15502669 = queryNorm
>    0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
>      1.4142135 = tf(termFreq(text:gb)=2)
>      2.3121865 = idf(docFreq=6, numDocs=26)
>      0.15625 = fieldNorm(field=text, doc=1)
></str>
>
>
>Thanks!