You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chong-Ki Tsang <ah...@u.washington.edu> on 2003/07/18 07:25:06 UTC
Lucene's scoring algorithm
I am curious to know if the Lucene's scoring algorithm was updated in
the latest 1.3 version.
I find the following scoring algorithm in the Similarity class of JAVA
API documents. This method is different from the one shown in official
FAQ. Could you tell me which one is being used in 1.3? If the algorithm
was updated, please send me the formula. I will appreciate that.
Thanks,
Chong-Ki
The score of query q for document d is defined in terms of these methods
as follows:
score(q,d) =
Σ
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#tf(int)> tf(t in d) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#idf(org.apache.lucene.index.Term,
org.apache.lucene.search.Searcher)> idf(t) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Fi
eld.html#getBoost()> getBoost(t.field in d) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#lengthNorm(java.lang.String, int)> lengthNorm(t.field in d)
*
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#coord(int, int)> coord(q,d) *
<http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simi
larity.html#queryNorm(float)> queryNorm(q)
t in q
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Simil
arity.html
For the official FAQ, Lucene's scoring algorithm is shown as,
31. How does Lucene assigns scores to hits ?
Here is a quote from Doug himself (posted on July 2001 to the Lucene
users mailing list):
For the record, Lucene's scoring algorithm is, roughly:
score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)
where:
score_d : score for document d
sum_t : sum for all terms t
tf_q : the square root of the frequency of t in the query
tf_d : the square root of the frequency of t in d
idf_t : log(numDocs/docFreq_t+1) + 1.0
numDocs : number of documents in index
docFreq_t : number of documents containing t
norm_q : sqrt(sum_t((tf_q*idf_t)^2))
norm_d_t : square root of number of tokens in d in the same field as
t
(I hope that's right!)
[Doug later added...]
Make that:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where
boost_t : the user-specified boost for term t
coord_q_d : number of terms in both query and document / number of
terms in query
The coordination factor gives an AND-like boost to documents that
contain,
e.g., all three terms in a three word query over those that contain just
two
of the words.
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se
arch
<http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.s
earch&toc=faq#q31> &toc=faq#q31