You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Clemens Marschner <cm...@lanlab.de> on 2002/09/11 12:52:29 UTC

Lucene's Ranking Function

In the FAQ it reads

 score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
* coord_q_d


1. I think the new document boost is missing, isn't it?
With that it should be something like

 score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
* coord_q_d * boost_d
Is that correct?


2. If I like the score to be independent of the number of terms in the
document (regarding them as essentially constant), is it enough to leave out
the norm_d_t factor?
I have seen that a norm factor between 0 and 255 is read with
IndexReader.norms() in TermScorer.score(). Is that the one?

>From what I further understand (and from digging in Witten/Moffat/Bell) the
norm_q factor is not calculated, since it stays the same for one query.

Just make some checkmarks, please :-)


Clemens






--------------------------------------
http://www.cmarschner.net





--------------------------------------
http://www.cmarschner.net

Re: Lucene's Ranking Function

Posted by Doug Cutting <cu...@lucene.com>.

Clemens Marschner wrote:
>  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
> * coord_q_d
> 
> One last thing I wondered about: Is idf_t really going into that equation
> twice?

Yes.  I think that's normal with tf/idf vector-space ranking methods.

> From what I see, idf_t/norm_q is completely left out, isn't it?

No.  It is computed once at the beginning of query processing.  See, for 
example, TermQuery.sumOfSquaredWeights() and TermQuery.normalize().  The 
former is called by the search code to compute norm_q and the latter is 
passed norm_q once it has been computed so that the clause's scores may 
be normalized.

> tf_q is applied although it is never calculated - if a term occurs more
> twice in the query (very unlikely, though) the whole sum is calculated
> twice. And for each term, the equation tf_d * idf_t / norm_d_t * boost_d *
> boost_f * boost_t is calculated.

You're right, tf_q is not in fact calculated.

Hope this helps.

Doug

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Lucene's Ranking Function

Posted by Clemens Marschner <cm...@lanlab.de>.

 score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
* coord_q_d

One last thing I wondered about: Is idf_t really going into that equation
twice?
>From what I see, idf_t/norm_q is completely left out, isn't it?

tf_q is applied although it is never calculated - if a term occurs more
twice in the query (very unlikely, though) the whole sum is calculated
twice. And for each term, the equation tf_d * idf_t / norm_d_t * boost_d *
boost_f * boost_t is calculated.

Doug?


--Clemens



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Lucene's Ranking Function

Posted by Clemens Marschner <cm...@lanlab.de>.


>I have seen that a norm factor between 0 and 255 is read with
>IndexReader.norms() in TermScorer.score(). 

I've seen now that this is an 8-bit float. 



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Lucene's Ranking Function

Posted by Doug Cutting <cu...@lucene.com>.

Clemens Marschner wrote:
> 1. I think the new document boost is missing, isn't it?
> With that it should be something like
> 
>  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
> * coord_q_d * boost_d
> Is that correct?

Almost.  This should actually be boost_d * boost_d_t, the boost factor 
for the document multiplied by the boost for t's field in d.

> 2. If I like the score to be independent of the number of terms in the
> document (regarding them as essentially constant), is it enough to leave out
> the norm_d_t factor?

Yes.  Note however that the quantity called 'norm' in the code is now 
frequently actually norm_d_t * boost_t * boost_d_t.  This quantity is 
now computed at index time and stored in the norms file.

> I have seen that a norm factor between 0 and 255 is read with
> IndexReader.norms() in TermScorer.score(). Is that the one?

Yes, although see my note above.

> From what I further understand (and from digging in Witten/Moffat/Bell) the
> norm_q factor is not calculated, since it stays the same for one query.

Lucene calculates it anyway.  It's cheap to compute: it is multiplied 
together with the term boost and idf once per query term, then this 
weight is used in subsequent computations.

Doug

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>