You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Th...@gmx.net on 2006/12/09 14:24:10 UTC

Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Hello Lucene users,

in the past, I asked a number of times about the scoring that was applied for Lucene 1.2 (which might also still be valid in current Lucene versions). At that time I was interested only based on curiosity, but now I would need it in order to write proper documentation.

At that time, I found answer on a higher level with the kind help of Joaquin Delgado in his posting ( http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200609.mbox/%3C45043CC5.3080103@oracle.com%3E ) who pointed me to this mailing list contribution ( http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/%3C000501c34ced$1f3b5c90$0500a8c0@ki%3E ).

According to these sources, the Lucene scoring formula in version 1.2 is:

score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d

where

* score (q,d) : score for document d given query q
* sum_t : sum for all terms t in q
* tf_q : the square root of the frequency of t in q
* tf_d : the square root of the frequency of t in d
* idf_t : log(numDocs/docFreq_t+1) + 1.0
* numDocs : number of documents in index
* docFreq_t : number of documents containing t
* norm_q : sqrt(sum_t((tf_q*idf_t)^2))
* norm_d_t : square root of number of tokens in d in the same field
as t
* boost_t : the user-specified boost for term t
* coord_q_d : number of terms in both query and document / number of
terms in query The coordination factor gives an AND-like boost to
documents that contain, e.g., all three terms in a three word
query over those that contain just two of the words.

This will allow me now to include the scoring formula as part of a documentation which will of great help. For verification, I have attached the formula as a picture generated from LaTeX. Please let me know if you find any mistake or if it think the formula could be simplified (I am not a mathematician...).

For even deeper understanding, I would like to ask a few further questions. I am not an expert in Information Retrieval, so I hope my questions are not too basic to be embarrassing. I read the paper by Erica Chisholm and Tamara G. Kolda (http://citeseer.ist.psu.edu/rd/12896645%2C198082%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/8004/http:zSzzSzwww.ca.sandia.govzSz%7EtgkoldazSzpaperszSzornl-tm-13756.pdf/new-term-weighting-formulas.pdf) to get a better idea of what kind of vector space scoring strategies exist in order to compare the Lucene scoring a bit with the rest of the world. My aim is basically to understand the strategic decisions that where made in Lucene (version 1.2). I have 3 questions:

1) tf_q and tf_d, basically all the term frequencies (TF) in the formula, are square roots in order to normalise the bias from large term frequencies. Looking through a number of IR papers, it seems that the "normal" way of normalising TF is log. What is the motivation for choosing square root instead? Is there a simple mathematical reason, or is there any empirical evidence that this is the better strategy. Are there any papers that argue for this decision (perhaps with empirical data or otherwise)?

2) In Lucenes scoring algorithm, the query part is normalised with norm_q which is sqrt(sum_t((tf_q*idf_t)^2)). In standard IR literature, this is referred to as Cosine Normalisation. The SMART system used this normalisation strategy, however only for the documents, not for the query. Queries were not normalised at all. The document terms in Lucene, on the other side, are only normalised with norm_d_t : which is the square root of the number of tokens in d (which are also terms in my case) in the same field as t. On this I have two sub questions:

2a) Why does Lucene normalise with Cosine Normalisation for the query? In a range of different IR system variations (as shown in Chisholm and Kolda's paper in the table on page 8) queries where not normalised at all. Is there a good reason or perhaps any empirical evidence that would support this decision?

2b) What is the motivation for the normalisation imposed on the documents (norm_d_t) which I have not seen before in any other system. Again, does anybody have pointers to literature on this?

3) What is the motivation for the additional normalisation of coord_q_d despite what is already described above? Again, is there any literature that argues this normalisation?

The answer of these questions would greatly help me to link this scoring formula with other IR strategies. This would help me to appreciate the value of this great IR library even more.

Any answer or partial answer on any of the questions would be greatly appreciated!

Best Regards and thanks in advance!
Karl

--
"Ein Herz f�r Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankesch�n: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Posted by Soeren Pekrul <so...@gmx.de>.

Hello Karl,

I’m very interested in the details of Lucene’s scoring as well.

Karl Koch wrote:
> For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with 
> 
> norm_q : sqrt(sum_t((tf_q*idf_t)^2))
> 
> which is also called cosine normalisation. This is a technique that is rather comprehensive and usually used for docuemnts only(!) in all systems I have seen so far.

I hope I have understood 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm 
and your problem correctly: "queryNorm(q) is a normalizing factor used 
to make scores between queries comparable."

For "normal" searches you don’t need to compare queries. You have just 
to compare the documents of a single query. Queries in a "normal" search 
have usually a different semantic, so you can’t really compare the 
results of different queries.

If you use Lucene for instance for classification of documents it is 
necessary to compare the results of different queries. You have 
documents to classify indexed at one site and the classes at the other 
side (thread "Store a document-like map" 
http://www.gossamer-threads.com/lists/lucene/java-user/42816). Than you 
can generate queries from the classes and search against the documents. 
The score of a matching document is the similarity of the document to 
the query build from the class. Now the queries have to be comparable.

You can transform a document into a query and a query into a document. 
That could be the reason normalizing a query like a document.

> For the documents Lucene employs its norm_d_t which is explained as:
> 
> norm_d_t : square root of number of tokens in d in the same field as t
> 
> basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... 
> 
> The paper you provided uses document normalisation in the following way:
> 
> norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))
> 
> I am not sure how this relates to norm_d_t.

"norm(t,d)   =   doc.getBoost()  •  lengthNorm(field)  •  ∏ f.getBoost()"
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm)

That seems to be in depended of the documents length. The factor 
lengthNorm(field) uses the documents length or better the field length: 
"Computes the normalization value for a field given the total number of 
terms contained in a field." 
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm).

"Implemented as 1/sqrt(numTerms)" 
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/DefaultSimilarity.html#lengthNorm(java.lang.String,%20int))

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Posted by Doron Cohen <DO...@il.ibm.com>.

"Karl Koch" <Th...@gmx.net> wrote:

> For the documents Lucene employs
> its norm_d_t which is explained as:
>
> norm_d_t : square root of number of tokens in d in the same field as t

Actually (by default) it is:
   1 / sqrt(#tokens in d with same field as t)

> basically just the square root of the number of unique terms in the
> document (since I do search over all fields always). I would have
> expected cosine normalisation here...
>
> The paper you provided uses document normalisation in the following way:
>
> norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))
>
> I am not sure how this relates to norm_d_t.

That system is less "field oriented" than Lucene, so you could say the
normalization there goes over all the fields.
The {0.8,0.2} args are parametric and control how aggressive this
normalization is.
If you used there {0,1} you would get
  1 / sqrt(#unique terms in d)
and that would be similar to Lucene's
  1 / sqrt(#tokens in d with same field as t)
however (in that system) that would have punish long documents too much and
would too much boost up stupid dummy short documents, and that's why the
{0.8,0.2} were introduced there.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Posted by Karl Koch <Th...@gmx.net>.

Hello Doron (and all the others who read here):),

thank you for your effort and your time. I really appreciate it. :)

I understand why normalisation is done in general. Mainly, to normalise the bias of oversized documents. In the literature I have read so far, there is usually a high effort on document normalisation (simply because they may be very different in size), and usually no normalisation on queries (simply because they are usually only a few words anyway). 

For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with 

norm_q : sqrt(sum_t((tf_q*idf_t)^2)) 

which is also called cosine normalisation. This is a technique that is rather comprehensive and usually used for docuemnts only(!) in all systems I have seen so far.  For the documents Lucene employs its norm_d_t which is explained as:

norm_d_t : square root of number of tokens in d in the same field as t

basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... 

The paper you provided uses document normalisation in the following way:

norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

I am not sure how this relates to norm_d_t. Did I misunderstood the Lucene formula or do I misinterpret something? I enclosed the formula as a graphic (from LaTeX) so you can have a look should this be the case here...

I will post the other questions separately since they actually also relate to the new Lucene scoring algoritm (they have not changed). Thank you for your time again :)

Karl


-------- Original-Nachricht --------
Datum:  Mon, 11 Dec 2006 22:41:56 -0800
Von: Doron Cohen <DO...@il.ibm.com>
An: java-user@lucene.apache.org
Betreff:  Re:  Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

> > Well it doesn't since there is not justification of why it is the
> > way it is. Its like saying, here is that car with 5 weels... enjoy
> driving.
> 
> > >  - I think the explanations there would also answer at least some of
> your
> > > questions.
> 
> I hoped it would answer *some* of the questions... (not all)
> 
> Let me try some more :-)
> 
> > 2a) Why does Lucene normalise with Cosine Normalisation for the
> > query? In a range of different IR system variations (as shown in
> > Chisholm and Kolda's paper in the table on page 8) queries where not
> > normalised at all. Is there a good reason or perhaps any empirical
> > evidence that would support this decision?
> 
> I think "why normalizing at all" is answered in
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
> ... "queryNorm(q) is a normalizing factor used to make scores between
> queries comparable. This factor does not affect document ranking (since
> all
> ranked documents are multiplied by the same factor), but rather just
> attempts to make scores from different queries (or even different indexes)
> comparable. This is a search time factor computed by the Similarity in
> effect at search time. The default computation in DefaultSimilarity is:
> queryNorm(q) = 1/sqrt(sumOfSquaredWeights)"
> 
> I think there were discussions on this normalization. Anyhow I am not the
> expert to justify this.
> 
> > 2b) What is the motivation for the normalisation imposed on the
> > documents (norm_d_t) which I have not seen before in any other
> > system. Again, does anybody have pointers to literature on this?
> 
> If I understand what you are asking, the logic is that verrry long
> documents could virtually contain the entire collection, and therefore
> should be "punished" for being too long, otherwise they would have an
> "unfair advantage" upon short documents. Google "search engine document
> length normalization" for more on this, or see also
> http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf
> 
> > Any answer or partial answer on any of the questions would be
> > greatly appreciated!
> 
> I hope this (although partial) helps at all,
> Doron
-- 
"Ein Herz f�r Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankesch�n: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Posted by Doron Cohen <DO...@il.ibm.com>.

> Well it doesn't since there is not justification of why it is the
> way it is. Its like saying, here is that car with 5 weels... enjoy
driving.

> >  - I think the explanations there would also answer at least some of
your
> > questions.

I hoped it would answer *some* of the questions... (not all)

Let me try some more :-)

> 2a) Why does Lucene normalise with Cosine Normalisation for the
> query? In a range of different IR system variations (as shown in
> Chisholm and Kolda's paper in the table on page 8) queries where not
> normalised at all. Is there a good reason or perhaps any empirical
> evidence that would support this decision?

I think "why normalizing at all" is answered in
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
... "queryNorm(q) is a normalizing factor used to make scores between
queries comparable. This factor does not affect document ranking (since all
ranked documents are multiplied by the same factor), but rather just
attempts to make scores from different queries (or even different indexes)
comparable. This is a search time factor computed by the Similarity in
effect at search time. The default computation in DefaultSimilarity is:
queryNorm(q) = 1/sqrt(sumOfSquaredWeights)"

I think there were discussions on this normalization. Anyhow I am not the
expert to justify this.

> 2b) What is the motivation for the normalisation imposed on the
> documents (norm_d_t) which I have not seen before in any other
> system. Again, does anybody have pointers to literature on this?

If I understand what you are asking, the logic is that verrry long
documents could virtually contain the entire collection, and therefore
should be "punished" for being too long, otherwise they would have an
"unfair advantage" upon short documents. Google "search engine document
length normalization" for more on this, or see also
http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf

> Any answer or partial answer on any of the questions would be
> greatly appreciated!

I hope this (although partial) helps at all,
Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Posted by Karl Koch <Th...@gmx.net>.

Well it doesn't since there is not justification of why it is the way it is. Its like saying, here is that car with 5 weels... enjoy driving. 

Karl 
-------- Original-Nachricht --------
Datum:  Sun, 10 Dec 2006 13:12:29 -0800
Von: Doron Cohen <DO...@il.ibm.com>
An: java-user@lucene.apache.org
Betreff:  Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

> TheRanger@gmx.net wrote:
> > According to these sources, the Lucene scoring formula in version 1.2
> is:
> >
> > score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
> > boost_t) * coord_q_d
> 
> Hi Karl,
> 
> A slightly more readable version of (the same) scoring formula is now in
> Lucene's Similarity jdocs -
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
>  - I think the explanations there would also answer at least some of your
> questions.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

Posted by Doron Cohen <DO...@il.ibm.com>.

TheRanger@gmx.net wrote:
> According to these sources, the Lucene scoring formula in version 1.2 is:
>
> score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
> boost_t) * coord_q_d

Hi Karl,

A slightly more readable version of (the same) scoring formula is now in
Lucene's Similarity jdocs -
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 - I think the explanations there would also answer at least some of your
questions.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org