You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2002/09/11 12:52:29 UTC
Lucene's Ranking Function
In the FAQ it reads
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
* coord_q_d
1. I think the new document boost is missing, isn't it?
With that it should be something like
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
* coord_q_d * boost_d
Is that correct?
2. If I like the score to be independent of the number of terms in the
document (regarding them as essentially constant), is it enough to leave out
the norm_d_t factor?
I have seen that a norm factor between 0 and 255 is read with
IndexReader.norms() in TermScorer.score(). Is that the one?
>From what I further understand (and from digging in Witten/Moffat/Bell) the
norm_q factor is not calculated, since it stays the same for one query.
Just make some checkmarks, please :-)
Clemens
--------------------------------------
http://www.cmarschner.net
--------------------------------------
http://www.cmarschner.net
Re: Lucene's Ranking Function
Posted by Doug Cutting <cu...@lucene.com>.
Clemens Marschner wrote:
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
> * coord_q_d
>
> One last thing I wondered about: Is idf_t really going into that equation
> twice?
Yes. I think that's normal with tf/idf vector-space ranking methods.
> From what I see, idf_t/norm_q is completely left out, isn't it?
No. It is computed once at the beginning of query processing. See, for
example, TermQuery.sumOfSquaredWeights() and TermQuery.normalize(). The
former is called by the search code to compute norm_q and the latter is
passed norm_q once it has been computed so that the clause's scores may
be normalized.
> tf_q is applied although it is never calculated - if a term occurs more
> twice in the query (very unlikely, though) the whole sum is calculated
> twice. And for each term, the equation tf_d * idf_t / norm_d_t * boost_d *
> boost_f * boost_t is calculated.
You're right, tf_q is not in fact calculated.
Hope this helps.
Doug
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: Lucene's Ranking Function
Posted by Clemens Marschner <cm...@lanlab.de>.
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
* coord_q_d
One last thing I wondered about: Is idf_t really going into that equation
twice?
>From what I see, idf_t/norm_q is completely left out, isn't it?
tf_q is applied although it is never calculated - if a term occurs more
twice in the query (very unlikely, though) the whole sum is calculated
twice. And for each term, the equation tf_d * idf_t / norm_d_t * boost_d *
boost_f * boost_t is calculated.
Doug?
--Clemens
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: Lucene's Ranking Function
Posted by Clemens Marschner <cm...@lanlab.de>.
>I have seen that a norm factor between 0 and 255 is read with
>IndexReader.norms() in TermScorer.score().
I've seen now that this is an 8-bit float.
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: Lucene's Ranking Function
Posted by Doug Cutting <cu...@lucene.com>.
Clemens Marschner wrote:
> 1. I think the new document boost is missing, isn't it?
> With that it should be something like
>
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
> * coord_q_d * boost_d
> Is that correct?
Almost. This should actually be boost_d * boost_d_t, the boost factor
for the document multiplied by the boost for t's field in d.
> 2. If I like the score to be independent of the number of terms in the
> document (regarding them as essentially constant), is it enough to leave out
> the norm_d_t factor?
Yes. Note however that the quantity called 'norm' in the code is now
frequently actually norm_d_t * boost_t * boost_d_t. This quantity is
now computed at index time and stored in the norms file.
> I have seen that a norm factor between 0 and 255 is read with
> IndexReader.norms() in TermScorer.score(). Is that the one?
Yes, although see my note above.
> From what I further understand (and from digging in Witten/Moffat/Bell) the
> norm_q factor is not calculated, since it stays the same for one query.
Lucene calculates it anyway. It's cheap to compute: it is multiplied
together with the term boost and idf once per query term, then this
weight is used in subsequent computations.
Doug
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>