You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Karl Koch <Th...@gmx.net> on 2006/12/12 11:23:41 UTC
Lucene scoring: Term frequency normalisation
Hi,
I have a question about the current Lucene scoring algoritm. In this scoring algorithm, the term frequency is calcualted by using the square root of the number of occuring terms as described in
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_tf
Having read a number of IR papers and also in a number of IR books, I am quite familiar that log is used to normalise term frequency in order to prevent very high term frequencies from having too much an effect on the scoring.
However, what exactly is the advantage of using sqare root instead of log? Is there any scientific reason behind this? Does anybody know a paper about this issue? Any source of impirical evidence that this works better than the log? Is there perhaps another discussion thread in here which I have not seen.
Thank you advance,
Karl
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene scoring: Term frequency normalisation
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Dec 12, 2006, at 2:23 AM, Karl Koch wrote:
> However, what exactly is the advantage of using sqare root instead
> of log?
Speaking anecdotally, I wouldn't say there's an advantage. There's a
predictable effect: very long documents are rewarded, since the
damping factor is not as strong. For most of the engines I've built,
that hasn't been desirable.
In order to get optimal results for large collections, it's often
necessary to customize this, by overriding lengthNorm. IME, for
searching general content such as random html documents, the body
field needs a higher damping factor, but more importantly a plateau
at the top end to prevent very short documents from dominating the
results.
public float lengthNorm(String fieldName, int numTerms) {
numTerms = numTerms < 100 ? 100 : numTerms;
return (float)(1.0 / Math.sqrt(numTerms));
}
In contrast, you don't want the plateau for title fields, assuming
that malignant keyword stuffing isn't an issue.
This stuff is corpus specific, though.
http://www.mail-archive.com/java-user@lucene.apache.org/msg08496.html
> Is there any scientific reason behind this? Does anybody know a
> paper about this issue?
Here's one from 1997:
Lee, Chuang, and Seamons: "Document Ranking and the Vector Space Model"
http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf
> Is there perhaps another discussion thread in here which I have not
> seen.
http://www.mail-archive.com/java-dev@lucene.apache.org/msg04509.html
http://www.mail-archive.com/java-dev@lucene.apache.org/msg01704.html
Searching the mail archives for "lengthNorm" will turn up some more.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org