You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Karl Koch <Th...@gmx.net> on 2006/09/10 00:35:36 UTC
Lucene 1.2 - scoring forumla needed
Hi,
I am looking for a mathematically correct IR scoring formula for Lucene 1.2. The description in the book (Lucene in Action, 2005 edition) is rather non-mathematical, also I am not sure if this is the one that also counts for Lucene 1.2 and not for later versions.
Perhaps Eric or Otis can directy comment on this? Is there any paper on the Lucene scoring algorithm that was published and describes the formula in depth?
Best Regards,
Karl
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Lucene 1.2 - scoring forumla needed
Posted by Joaquin Delgado <jo...@oracle.com>.
What do you mean by mathematically correct? Is there something incorrect
in the book?
According to a message posted some time ago at
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/%3C000501c34ced$1f3b5c90$0500a8c0@ki%3E
, where people first noticed a change in the scoring algorithm, the
official FAQ (for 1.2) had posted, from Doug himself the following formula:
score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
where
* score (q,d) : score for document d given query q
* sum_t : sum for all terms t in q
* tf_q : the square root of the frequency of t in q
* tf_d : the square root of the frequency of t in d
* idf_t : log(numDocs/docFreq_t+1) + 1.0
* numDocs : number of documents in index
* docFreq_t : number of documents containing t
* norm_q : sqrt(sum_t((tf_q*idf_t)^2))
* norm_d_t : square root of number of tokens in d in the same field
as t
* boost_t : the user-specified boost for term t
* coord_q_d : number of terms in both query and document / number of
terms in query The coordination factor gives an AND-like boost to
documents that contain, e.g., all three terms in a three word
query over those that contain just two of the words.
This is diffirent that the current scoring algorithm described at
http://lucene.apache.org/java/docs/scoring.html#Scoring which includes
field boosting, document length normalization, etc.
In any case these are variations of the TF-IDF weighted vector space
"cosine of the angle" between the document and the query vectors (also
known as cosine distance or normalized dot product - see
http://en.wikipedia.org/wiki/Dot_product). This computation treats
documents and queries as vectors in an N-dimensional space (N is the
number of unique terms excluding stopwords).
In statistics/probabilistc terms this can also be interpretated as a
geometrical interpretation of correlation between samples drawn from
two random variables Q and D (representing a query and a document -see
http://en.wikipedia.org/wiki/Correlation) whereas each data point
(TF-IDF weight) is an estimation of how much "information" each term
conveys. There are more complex probabilistc rankings algorithms which
take advantage of previous knowledge of relevance (pre-ranked documents
for example) in its computation primarily exploiting bayes theorem.
Both Vector Space Model and Probabilistic Model are well studied in
Information Retrieval Literature. See
http://www2.sims.berkeley.edu/courses/is202/f00/lectures/Lecture8_202.ppt
for an overview of Ranking and Feedback.
-- Joaquin Delgado
Karl Koch wrote:
>Hi,
>
>I am looking for a mathematically correct IR scoring formula for Lucene 1.2. The description in the book (Lucene in Action, 2005 edition) is rather non-mathematical, also I am not sure if this is the one that also counts for Lucene 1.2 and not for later versions.
>
>Perhaps Eric or Otis can directy comment on this? Is there any paper on the Lucene scoring algorithm that was published and describes the formula in depth?
>
>Best Regards,
>Karl
>
>