You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Ryan <he...@gmail.com> on 2011/10/05 02:52:01 UTC
Scoring of DisMax in Solr
Hi,
When I examine the score calculation of DisMax in Solr, it looks to me
that DisMax is using tf x idf^2 instead of tf x idf.
Does anyone have insight why tf x idf is not used here?
Here is the score contribution from one one field:
score(q,c) = queryWeight x fieldWeight
= tf x idf x idf x queryNorm x fieldNorm
Here is the example that I used to derive the formula above. Clearly, idf is
multiplied twice in the score calculation.
*
http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent=on&debugQuery=true&fl=id,score
*
<str name="6H500F0">
0.18314168 = (MATCH) sum of:
0.18314168 = (MATCH) weight(text:gb in 1), product of:
0.35845062 = queryWeight(text:gb), product of:
2.3121865 = idf(docFreq=6, numDocs=26)
0.15502669 = queryNorm
0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
1.4142135 = tf(termFreq(text:gb)=2)
2.3121865 = idf(docFreq=6, numDocs=26)
0.15625 = fieldNorm(field=text, doc=1)
</str>
Thanks!
Re: Scoring of DisMax in Solr
Posted by David Ryan <he...@gmail.com>.
Ok, here is the calculation of the score:
0.18314168 = *2.3121865* * 0.15502669 * 1.4142135 * *2.3121865* * 0.15625
*2.3121865 is *multiplied twice here. That is what I mean tf x idf^2 is
used instead of tf x idf.
On Wed, Oct 5, 2011 at 10:42 AM, Markus Jelsma
<ma...@openindex.io>wrote:
> Hi,
>
> I don't see 2.3121865 * 2 anywhere in your debug output or something that
> looks like that.
>
>
> > Hi Markus,
> >
> > The idf calculation itself is correct.
> > What I am trying to understand here is why idf value is multiplied twice
> > in the final score calculation. Essentially, tf x idf^2 is used instead
> > of tf x idf.
> > I'd like to understand the rational behind that.
> >
> > On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
> > > In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
> > > 1 + ln(26 / 7) =~ 2.3121865
> > >
> > > I don't see a problem.
> > >
> > > > Hi,
> > > >
> > > >
> > > > When I examine the score calculation of DisMax in Solr, it looks to
> > > > me that DisMax is using tf x idf^2 instead of tf x idf.
> > > > Does anyone have insight why tf x idf is not used here?
> > > >
> > > > Here is the score contribution from one one field:
> > > >
> > > > score(q,c) = queryWeight x fieldWeight
> > > >
> > > > = tf x idf x idf x queryNorm x fieldNorm
> > > >
> > > > Here is the example that I used to derive the formula above. Clearly,
> > > > idf is multiplied twice in the score calculation.
> > > > *
> > >
> > >
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&inden
> > > t=
> > >
> > > > on&debugQuery=true&fl=id,score *
> > > >
> > > > <str name="6H500F0">
> > > >
> > > > 0.18314168 = (MATCH) sum of:
> > > > 0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > > > 0.35845062 = queryWeight(text:gb), product of:
> > > > 2.3121865 = idf(docFreq=6, numDocs=26)
> > > > 0.15502669 = queryNorm
> > > >
> > > > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> > > > 1.4142135 = tf(termFreq(text:gb)=2)
> > > > 2.3121865 = idf(docFreq=6, numDocs=26)
> > > > 0.15625 = fieldNorm(field=text, doc=1)
> > > >
> > > > </str>
> > > >
> > > >
> > > > Thanks!
>
Re: Scoring of DisMax in Solr
Posted by Bill Bell <bi...@gmail.com>.
Markus,
The calculation is correct.
Look at your output.
Result = queryWeight(text:gb) * fieldWeight(text:gb in 1)
Result = (idf(docFreq=6, numDocs=26) * queryNorm) *
(tf(termFreq(text:gb)=2) * idf(docFreq=6, numDocs=26) *
fieldNorm(field=text, doc=1))
This you should notice that idf(docFreq=6, numDocs=26 is repeated twice.
This si just how the weight() is calculated.
> > 0.18314168 = (MATCH) sum of:
> > 0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > 0.35845062 = queryWeight(text:gb), product of:
> > 2.3121865 = idf(docFreq=6, numDocs=26)
> > 0.15502669 = queryNorm
> >
> > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> > 1.4142135 = tf(termFreq(text:gb)=2)
> > 2.3121865 = idf(docFreq=6, numDocs=26)
> > 0.15625 = fieldNorm(field=text, doc=1)
On 10/5/11 11:42 AM, "Markus Jelsma" <ma...@openindex.io> wrote:
>Hi,
>
>I don't see 2.3121865 * 2 anywhere in your debug output or something that
>looks like that.
>
>
>> Hi Markus,
>>
>> The idf calculation itself is correct.
>> What I am trying to understand here is why idf value is multiplied
>>twice
>> in the final score calculation. Essentially, tf x idf^2 is used instead
>> of tf x idf.
>> I'd like to understand the rational behind that.
>>
>> On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma
><ma...@openindex.io>wrote:
>> > In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
>> > 1 + ln(26 / 7) =~ 2.3121865
>> >
>> > I don't see a problem.
>> >
>> > > Hi,
>> > >
>> > >
>> > > When I examine the score calculation of DisMax in Solr, it looks
>>to
>> > > me that DisMax is using tf x idf^2 instead of tf x idf.
>> > > Does anyone have insight why tf x idf is not used here?
>> > >
>> > > Here is the score contribution from one one field:
>> > >
>> > > score(q,c) = queryWeight x fieldWeight
>> > >
>> > > = tf x idf x idf x queryNorm x fieldNorm
>> > >
>> > > Here is the example that I used to derive the formula above.
>>Clearly,
>> > > idf is multiplied twice in the score calculation.
>> > > *
>> >
>> >
>>http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&inden
>> > t=
>> >
>> > > on&debugQuery=true&fl=id,score *
>> > >
>> > > <str name="6H500F0">
>> > >
>> > > 0.18314168 = (MATCH) sum of:
>> > > 0.18314168 = (MATCH) weight(text:gb in 1), product of:
>> > > 0.35845062 = queryWeight(text:gb), product of:
>> > > 2.3121865 = idf(docFreq=6, numDocs=26)
>> > > 0.15502669 = queryNorm
>> > >
>> > > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
>> > > 1.4142135 = tf(termFreq(text:gb)=2)
>> > > 2.3121865 = idf(docFreq=6, numDocs=26)
>> > > 0.15625 = fieldNorm(field=text, doc=1)
>> > >
>> > > </str>
>> > >
>> > >
>> > > Thanks!
Re: Scoring of DisMax in Solr
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
I don't see 2.3121865 * 2 anywhere in your debug output or something that
looks like that.
> Hi Markus,
>
> The idf calculation itself is correct.
> What I am trying to understand here is why idf value is multiplied twice
> in the final score calculation. Essentially, tf x idf^2 is used instead
> of tf x idf.
> I'd like to understand the rational behind that.
>
> On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma
<ma...@openindex.io>wrote:
> > In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
> > 1 + ln(26 / 7) =~ 2.3121865
> >
> > I don't see a problem.
> >
> > > Hi,
> > >
> > >
> > > When I examine the score calculation of DisMax in Solr, it looks to
> > > me that DisMax is using tf x idf^2 instead of tf x idf.
> > > Does anyone have insight why tf x idf is not used here?
> > >
> > > Here is the score contribution from one one field:
> > >
> > > score(q,c) = queryWeight x fieldWeight
> > >
> > > = tf x idf x idf x queryNorm x fieldNorm
> > >
> > > Here is the example that I used to derive the formula above. Clearly,
> > > idf is multiplied twice in the score calculation.
> > > *
> >
> > http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&inden
> > t=
> >
> > > on&debugQuery=true&fl=id,score *
> > >
> > > <str name="6H500F0">
> > >
> > > 0.18314168 = (MATCH) sum of:
> > > 0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > > 0.35845062 = queryWeight(text:gb), product of:
> > > 2.3121865 = idf(docFreq=6, numDocs=26)
> > > 0.15502669 = queryNorm
> > >
> > > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> > > 1.4142135 = tf(termFreq(text:gb)=2)
> > > 2.3121865 = idf(docFreq=6, numDocs=26)
> > > 0.15625 = fieldNorm(field=text, doc=1)
> > >
> > > </str>
> > >
> > >
> > > Thanks!
Re: Scoring of DisMax in Solr
Posted by David Ryan <he...@gmail.com>.
Hi Markus,
The idf calculation itself is correct.
What I am trying to understand here is why idf value is multiplied twice in
the final score calculation. Essentially, tf x idf^2 is used instead of tf
x idf.
I'd like to understand the rational behind that.
On Wed, Oct 5, 2011 at 9:43 AM, Markus Jelsma <ma...@openindex.io>wrote:
> In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
> 1 + ln(26 / 7) =~ 2.3121865
>
> I don't see a problem.
>
> > Hi,
> >
> >
> > When I examine the score calculation of DisMax in Solr, it looks to me
> > that DisMax is using tf x idf^2 instead of tf x idf.
> > Does anyone have insight why tf x idf is not used here?
> >
> > Here is the score contribution from one one field:
> >
> > score(q,c) = queryWeight x fieldWeight
> > = tf x idf x idf x queryNorm x fieldNorm
> >
> > Here is the example that I used to derive the formula above. Clearly, idf
> > is multiplied twice in the score calculation.
> > *
> >
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent=
> > on&debugQuery=true&fl=id,score *
> >
> > <str name="6H500F0">
> > 0.18314168 = (MATCH) sum of:
> > 0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > 0.35845062 = queryWeight(text:gb), product of:
> > 2.3121865 = idf(docFreq=6, numDocs=26)
> > 0.15502669 = queryNorm
> > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> > 1.4142135 = tf(termFreq(text:gb)=2)
> > 2.3121865 = idf(docFreq=6, numDocs=26)
> > 0.15625 = fieldNorm(field=text, doc=1)
> > </str>
> >
> >
> > Thanks!
>
Re: Scoring of DisMax in Solr
Posted by Markus Jelsma <ma...@openindex.io>.
In Lucene's default similarity idf = 1 + ln (numDocs / df + 1).
1 + ln(26 / 7) =~ 2.3121865
I don't see a problem.
> Hi,
>
>
> When I examine the score calculation of DisMax in Solr, it looks to me
> that DisMax is using tf x idf^2 instead of tf x idf.
> Does anyone have insight why tf x idf is not used here?
>
> Here is the score contribution from one one field:
>
> score(q,c) = queryWeight x fieldWeight
> = tf x idf x idf x queryNorm x fieldNorm
>
> Here is the example that I used to derive the formula above. Clearly, idf
> is multiplied twice in the score calculation.
> *
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent=
> on&debugQuery=true&fl=id,score *
>
> <str name="6H500F0">
> 0.18314168 = (MATCH) sum of:
> 0.18314168 = (MATCH) weight(text:gb in 1), product of:
> 0.35845062 = queryWeight(text:gb), product of:
> 2.3121865 = idf(docFreq=6, numDocs=26)
> 0.15502669 = queryNorm
> 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> 1.4142135 = tf(termFreq(text:gb)=2)
> 2.3121865 = idf(docFreq=6, numDocs=26)
> 0.15625 = fieldNorm(field=text, doc=1)
> </str>
>
>
> Thanks!
Re: Scoring of DisMax in Solr
Posted by David Ryan <he...@gmail.com>.
The example does not include the evidence. But we do use eDisMax for
scoring in Solr.
The following is from solrconfig.xml:
<str name="defType">edismax</str>
Here is a short snippet of the explained result, where 0.1 is the Tie
breaker in DisMax/eDisMax.
6.446447 = (MATCH) max plus 0.1 times others of:
0.63826215 = (MATCH) weight(description:sony^0.25 in 802), product of:
.....
I noticed that in DefaultSimilarity, tf x idf^2 is used instead of tf x
idf, as stated in your link.
I am wondering if anyone has insight why that DisMax/eDisMax adopts the same
approach using tf x idf^2
I will try java-user@lucene mailing list as well.
On Wed, Oct 5, 2011 at 11:30 AM, Chris Hostetter
<ho...@fucit.org>wrote:
>
> : Thanks! What's the procedure to report this if it's a bug?
> : EDisMax has similar behavior.
>
> what yo uare seeing isn't specific to dismax & edismax (in fact: there's
> no evidence in your example that dismax is even being used)
>
> what you are seeing is the basic scoring of a TermQuery using the
> DefaultSimilarity in lucene...
>
>
> https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/Similarity.html
>
> ...if you have specific questions about how/why this scoring forumala is
> used, i would suggest posting them to the java-user@lucene mailing list.
>
>
> -Hoss
>
Re: Scoring of DisMax in Solr
Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks! What's the procedure to report this if it's a bug?
: EDisMax has similar behavior.
what yo uare seeing isn't specific to dismax & edismax (in fact: there's
no evidence in your example that dismax is even being used)
what you are seeing is the basic scoring of a TermQuery using the
DefaultSimilarity in lucene...
https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/Similarity.html
...if you have specific questions about how/why this scoring forumala is
used, i would suggest posting them to the java-user@lucene mailing list.
-Hoss
Re: Scoring of DisMax in Solr
Posted by David Ryan <he...@gmail.com>.
Thanks! What's the procedure to report this if it's a bug?
EDisMax has similar behavior.
On Tue, Oct 4, 2011 at 11:24 PM, Bill Bell <bi...@gmail.com> wrote:
> This seems like a bug to me.
>
> On 10/4/11 6:52 PM, "David Ryan" <he...@gmail.com> wrote:
>
> >Hi,
> >
> >
> >When I examine the score calculation of DisMax in Solr, it looks to me
> >that DisMax is using tf x idf^2 instead of tf x idf.
> >Does anyone have insight why tf x idf is not used here?
> >
> >Here is the score contribution from one one field:
> >
> >score(q,c) = queryWeight x fieldWeight
> > = tf x idf x idf x queryNorm x fieldNorm
> >
> >Here is the example that I used to derive the formula above. Clearly, idf
> >is
> >multiplied twice in the score calculation.
> >*
> >
> http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent
> >=on&debugQuery=true&fl=id,score
> >*
> >
> > <str name="6H500F0">
> >0.18314168 = (MATCH) sum of:
> > 0.18314168 = (MATCH) weight(text:gb in 1), product of:
> > 0.35845062 = queryWeight(text:gb), product of:
> > 2.3121865 = idf(docFreq=6, numDocs=26)
> > 0.15502669 = queryNorm
> > 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> > 1.4142135 = tf(termFreq(text:gb)=2)
> > 2.3121865 = idf(docFreq=6, numDocs=26)
> > 0.15625 = fieldNorm(field=text, doc=1)
> ></str>
> >
> >
> >Thanks!
>
>
>
Re: Scoring of DisMax in Solr
Posted by Bill Bell <bi...@gmail.com>.
This seems like a bug to me.
On 10/4/11 6:52 PM, "David Ryan" <he...@gmail.com> wrote:
>Hi,
>
>
>When I examine the score calculation of DisMax in Solr, it looks to me
>that DisMax is using tf x idf^2 instead of tf x idf.
>Does anyone have insight why tf x idf is not used here?
>
>Here is the score contribution from one one field:
>
>score(q,c) = queryWeight x fieldWeight
> = tf x idf x idf x queryNorm x fieldNorm
>
>Here is the example that I used to derive the formula above. Clearly, idf
>is
>multiplied twice in the score calculation.
>*
>http://localhost:8983/solr/select/?q=GB&version=2.2&start=0&rows=10&indent
>=on&debugQuery=true&fl=id,score
>*
>
> <str name="6H500F0">
>0.18314168 = (MATCH) sum of:
> 0.18314168 = (MATCH) weight(text:gb in 1), product of:
> 0.35845062 = queryWeight(text:gb), product of:
> 2.3121865 = idf(docFreq=6, numDocs=26)
> 0.15502669 = queryNorm
> 0.5109258 = (MATCH) fieldWeight(text:gb in 1), product of:
> 1.4142135 = tf(termFreq(text:gb)=2)
> 2.3121865 = idf(docFreq=6, numDocs=26)
> 0.15625 = fieldNorm(field=text, doc=1)
></str>
>
>
>Thanks!