You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chong-Ki Tsang <ah...@u.washington.edu> on 2003/07/18 07:25:06 UTC

Lucene's scoring algorithm

I am curious to know if the Lucene's scoring algorithm was updated in
the latest 1.3 version.

I find the following scoring algorithm in the Similarity class of JAVA
API documents. This method is different from the one shown in official
FAQ. Could you tell me which one is being used in 1.3? If the algorithm
was updated, please send me the formula. I will appreciate that.

 

Thanks,

Chong-Ki

 

The score of query q for document d is defined in terms of these methods
as follows: 


score(q,d) =

Σ

 
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#tf(int)> tf(t in d) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)> idf(t) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Fi
eld.html#getBoost()> getBoost(t.field in d) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#lengthNorm(java.lang.String, int)> lengthNorm(t.field in d) 

 *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#coord(int, int)> coord(q,d) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#queryNorm(float)> queryNorm(q) 


t in q 

 

 

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simil
arity.html

 

 

For the official FAQ, Lucene's scoring algorithm is shown as,

 

31. How does Lucene assigns scores to hits ?

Here is a quote from Doug himself (posted on July 2001 to the Lucene
users mailing list): 

 

For the record, Lucene's scoring algorithm is, roughly:

 

  score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)

 

where:

  score_d   : score for document d

  sum_t     : sum for all terms t

  tf_q      : the square root of the frequency of t in the query

  tf_d      : the square root of the frequency of t in d

  idf_t     : log(numDocs/docFreq_t+1) + 1.0

  numDocs   : number of documents in index

  docFreq_t : number of documents containing t

  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))

  norm_d_t  : square root of number of tokens in d in the same field as
t

 

(I hope that's right!)

 

[Doug later added...]

 

Make that:

  

  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d

 

where

 

  boost_t    : the user-specified boost for term t

  coord_q_d  : number of terms in both query and document / number of
terms in query

 

The coordination factor gives an AND-like boost to documents that
contain,

e.g., all three terms in a three word query over those that contain just
two

of the words.

 

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se
arch
<http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.s
earch&toc=faq#q31> &toc=faq#q31