You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christoph Goller <go...@apache.org> on 2004/10/31 17:00:24 UTC

About Hit Scoring

I looked at the scoring mechanism more closely again. Some of you may
remember that there was a discussion about this recently. There was
especially some argument about the theoretical justification of
the current scoring algorithm. Chuck proposed that at least from
a theoretical perspective it would be good to apply a normalization
on the document vector and thus implement the cosine similarity.

Well, we found out that this cannot be implemented efficienty.
However, I now found out the the current algorithm has a very
intuitive theoretical justification. Some of you may already know
that, but I never looked into it that deeply.

Both the query and all documents are represented as vectors in term
vector space. The current scoring is simply the dot product of the
query with a document normalized by the length of the query vector
(if we skip the additional coord factor). Geometrically speaking this
is the distance of the document vector from the hyperplane through
the origin which is orthogonal to the query vector. See attached
figure.

Christoph





About Hit Scoring

Posted by Christoph Goller <go...@apache.org>.
It seems that the attatched jpeg got deleted somehow.