You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bob Carpenter <ca...@alias-i.com> on 2006/11/21 23:16:25 UTC

Re: a "fair" similarity

Michael D. Curtin wrote:
> Daniel Naber wrote:
> 
>> Hi,
>>
>> as some of you may have noticed, Lucene prefers shorter documents over 
>> longer ones, i.e. shorter documents get a higher ranking, even if the 
>> ratio "matched terms / total terms in document" is the same.

There's even more interesting kinds of "unfairness".

Suppose we have a document.  We can turn the
document into a query in the obvious way (a set
of boolean SHOULD clauses with term frequencies
given by counts in the doc).

Lucene's IDF scaling is only applied to the query.
This is great for performance, because the doc vectors
remain stable as new docs are added.

Then, in general:

score(doc,doc) < score(doc,doc')

if IDF(doc) = doc'.  That is, the inversely IDF-scaled
query matches a document better than the document itself.

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org