You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Markus Jelsma - Buyways B.V." <ma...@buyways.nl> on 2009/11/03 11:30:17 UTC
tf*idf scoring
Hello list,
I have a question about Lucene's calculation of tf*idf value. I first
noticed that Solr's tf does not compare to tf values based on
calculation elsewhere such as
http://odin.himinbi.org/idf_to_item:item/comparing_tf%3Aidf_to_item%
3Aitem_similarity.xhtml or http://en.wikipedia.org/wiki/Tf%E2%80%93idf
The tf values returned by Solr are always integers and not normalized
against the length of the corpus whilst the field in which it resides
does not have omitNorms="true".
Consider the following documents where the field subject is of the
standard text_ws type:
<result name="response" numFound="6" start="0">
<doc>
<str name="subject">a b c</str>
</doc>
<doc>
<str name="subject">d e f</str>
</doc>
<doc>
<str name="subject">x y z</str>
</doc>
<doc>
<str name="subject">a d x</str>
</doc>
<doc>
<str name="subject">a e z</str>
</doc>
<doc>
<str name="subject">c f z</str>
</doc>
</result>
Now, Solr's TermVector results for the first document:
<lst name="doc-0">
<str name="uniqueKey">0</str>
<lst name="subject">
<lst name="a">
<int name="tf">1</int>
<lst name="positions">
<int name="position">0</int>
</lst>
<int name="df">3</int>
<double name="tf-idf">0.3333333333333333</double>
</lst>
<lst name="b">
<int name="tf">1</int>
<lst name="positions">
<int name="position">1</int>
</lst>
<int name="df">1</int>
<double name="tf-idf">1.0</double>
</lst>
<lst name="c">
<int name="tf">1</int>
<lst name="positions">
<int name="position">2</int>
</lst>
<int name="df">2</int>
<double name="tf-idf">0.5</double>
</lst>
</lst>
</lst>
According to different algorithms, the tf for term c would be 3 / 1 =
0.33 instead of 1 returned by Solr. Also, the tf*idf value i get is 0.5
for term c and i get 0.333 for term a. It looks like tf*idf is quotient
of document frequency and term frequency.
If i calculate tf*idf, for term c in the first document, according to
other algorithms it would be:
tf = 3 / 1 = 0.333
idf = ln(6 / 3) = 1.0986
tf*idf = 0.333 * 1.0986 = 0.3658
Can someone explain either the difference demonstrated or tell me what i
am possibly doing wrong?
Cheers,
-
Markus Jelsma Buyways B.V.
Technisch Architect Friesestraatweg 215c
http://www.buyways.nl 9743 AD Groningen
Alg. 050-853 6600 KvK 01074105
Tel. 050-853 6620 Fax. 050-3118124
Mob. 06-5025 8350 In: http://www.linkedin.com/in/markus17
Re: tf*idf scoring
Posted by "Markus Jelsma - Buyways B.V." <ma...@buyways.nl>.
Thank you for your explanation
On Tue, 2009-11-03 at 07:32 -0800, Grant Ingersoll wrote:
> On Nov 3, 2009, at 5:54 AM, Markus Jelsma - Buyways B.V. wrote:
> >
> >
> > I see, but why not return the true values of Lucene?
>
> I'm not sure what you mean by this. The TVC returns the term
> frequency and the document frequency and TF/DF as reported by
> Lucene. The actual raw values. What you are asking for is for
> the
> TVC to return some other normalized values above and beyond the
> literal interpretation TF/IDF. This can be done, it's not
> particularly hard, but it will require a patch or you can just do it
> in your application. I personally don't think the TVC should do it
> b/
> c there are other calculations/interpretations that one might do
> beyond/besides what you propose, so I'd rather just give back the
> raw
> data and let the user decide.
Re: tf*idf scoring
Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 3, 2009, at 5:54 AM, Markus Jelsma - Buyways B.V. wrote:
>
>
> I see, but why not return the true values of Lucene?
I'm not sure what you mean by this. The TVC returns the term
frequency and the document frequency and TF/DF as reported by
Lucene. The actual raw values. What you are asking for is for the
TVC to return some other normalized values above and beyond the
literal interpretation TF/IDF. This can be done, it's not
particularly hard, but it will require a patch or you can just do it
in your application. I personally don't think the TVC should do it b/
c there are other calculations/interpretations that one might do
beyond/besides what you propose, so I'd rather just give back the raw
data and let the user decide.
Re: tf*idf scoring
Posted by "Markus Jelsma - Buyways B.V." <ma...@buyways.nl>.
> >
> >
> > According to different algorithms, the tf for term c would be 3 / 1 =
> > 0.33 instead of 1 returned by Solr.
>
> I don't follow. The TF (term frequency) is the number of times the
> term c occurs in that particular document, i.e. 1 time.
I see that above, and below, i made some typo's. I wrote 3 / 1 = 0.3
instead of 1 / 3 = 0.33. Term c has a #occurences of 1 which the other
algorithms normalize by dividing by the number of terms. So instead of a
tf = #occurences (1) other algorithms do tf = #occurences / #terms
(0.33).
>
> > Also, the tf*idf value i get is 0.5
> > for term c and i get 0.333 for term a. It looks like tf*idf is
> > quotient
> > of document frequency and term frequency.
>
> Yes, indeed. IDF == Inverse Document Frequency, in other words, 1/DF.
Indeed, but most algorithms i have seen on this topic calculate idf by
ln(#docs / df), this is also true for Lucene as i read
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Similarity.html
idf(t) = 1 + log (numDocs / df + 1)
>
> >
> > If i calculate tf*idf, for term c in the first document, according to
> > other algorithms it would be:
> >
> > tf = 3 / 1 = 0.333
>
> 3/1 = 3, no? I don't see where in your docs above you could even get
> a 3 for the letter c.
Here's the other typo, i wrote again 3 / 1 = 0.33 what should've been
1 / 3 = 0.33, of course. The differences i see are:
tf (solr) = #occurences_of_term_T in document_D
tf (other) = #occurences_of_term_T in document_D / #terms_document_D
df (solr) = #occurences_of_term_T in all_documents
df (other) = #occurences_of_term_T in all_documents
idf (solr) = tf / df
idf (other) = ln(#documents / df)
tf*idf (solr) = tf / df
tf*idf (other) = tf * idf
>
> > idf = ln(6 / 3) = 1.0986
> > tf*idf = 0.333 * 1.0986 = 0.3658
> >
>
> I think the formulas you are looking at are doing operations to
> normalize the values, whereas the Solr/Lucene stuff above is telling
> you their raw values. Note, Lucene/Solr does length normalization,
> etc. too, it just isn't encoded into the TF or DF. For more on
> Lucene's scoring, see http://lucene.apache.org/java/2_9_0/scoring.html
>
I see, but why not return the true values of Lucene? I did not
reconfigure Solr's scheme to use another algorithm for similarity and
the above Lucene similarity docs state that they use similar
calculations as i have in DefaultSimilarty.
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
Re: tf*idf scoring
Posted by Grant Ingersoll <gs...@apache.org>.
Inline below
On Nov 3, 2009, at 2:30 AM, Markus Jelsma - Buyways B.V. wrote:
> Hello list,
>
>
> I have a question about Lucene's calculation of tf*idf value. I first
> noticed that Solr's tf does not compare to tf values based on
> calculation elsewhere such as
> http://odin.himinbi.org/idf_to_item:item/comparing_tf%3Aidf_to_item%
> 3Aitem_similarity.xhtml or http://en.wikipedia.org/wiki/Tf%E2%80%93idf
>
> The tf values returned by Solr are always integers and not normalized
> against the length of the corpus whilst the field in which it resides
> does not have omitNorms="true".
>
> Consider the following documents where the field subject is of the
> standard text_ws type:
>
> <result name="response" numFound="6" start="0">
> <doc>
> <str name="subject">a b c</str>
> </doc>
> <doc>
> <str name="subject">d e f</str>
> </doc>
> <doc>
> <str name="subject">x y z</str>
> </doc>
> <doc>
> <str name="subject">a d x</str>
> </doc>
> <doc>
> <str name="subject">a e z</str>
> </doc>
> <doc>
> <str name="subject">c f z</str>
> </doc>
> </result>
>
> Now, Solr's TermVector results for the first document:
>
> <lst name="doc-0">
> <str name="uniqueKey">0</str>
> <lst name="subject">
> <lst name="a">
> <int name="tf">1</int>
> <lst name="positions">
> <int name="position">0</int>
> </lst>
> <int name="df">3</int>
> <double name="tf-idf">0.3333333333333333</double>
> </lst>
> <lst name="b">
> <int name="tf">1</int>
> <lst name="positions">
> <int name="position">1</int>
> </lst>
> <int name="df">1</int>
> <double name="tf-idf">1.0</double>
> </lst>
> <lst name="c">
> <int name="tf">1</int>
> <lst name="positions">
> <int name="position">2</int>
> </lst>
> <int name="df">2</int>
> <double name="tf-idf">0.5</double>
> </lst>
> </lst>
> </lst>
>
>
> According to different algorithms, the tf for term c would be 3 / 1 =
> 0.33 instead of 1 returned by Solr.
I don't follow. The TF (term frequency) is the number of times the
term c occurs in that particular document, i.e. 1 time.
> Also, the tf*idf value i get is 0.5
> for term c and i get 0.333 for term a. It looks like tf*idf is
> quotient
> of document frequency and term frequency.
Yes, indeed. IDF == Inverse Document Frequency, in other words, 1/DF.
>
> If i calculate tf*idf, for term c in the first document, according to
> other algorithms it would be:
>
> tf = 3 / 1 = 0.333
3/1 = 3, no? I don't see where in your docs above you could even get
a 3 for the letter c.
> idf = ln(6 / 3) = 1.0986
> tf*idf = 0.333 * 1.0986 = 0.3658
>
I think the formulas you are looking at are doing operations to
normalize the values, whereas the Solr/Lucene stuff above is telling
you their raw values. Note, Lucene/Solr does length normalization,
etc. too, it just isn't encoded into the TF or DF. For more on
Lucene's scoring, see http://lucene.apache.org/java/2_9_0/scoring.html
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search