You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by jason <gi...@gmail.com> on 2006/04/28 07:54:51 UTC

for the similarity measure

Hi,

After reading the code, I found the similarity measure in Lucene is not the
same as the cosine coefficient measure commonly used. I dont know it is
correct. And I wonder whether i can use the cosine coefficient measure in
lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
coefficient measure.

Re: for the similarity measure

Posted by Sebastian Marius Kirsch <sk...@sebastian-kirsch.org>.

On Fri, Apr 28, 2006 at 01:54:51PM +0800, jason wrote:
> After reading the code, I found the similarity measure in Lucene is not the
> same as the cosine coefficient measure commonly used. I dont know it is
> correct. And I wonder whether i can use the cosine coefficient measure in
> lucene or maybe the Dice's coefficient, Jaccard's coefficient and overlap
> coefficient measure.

Noone seems to have answered this yet, so I guess I'll have a go.

I wrote down the following a while ago; I'm omitting boosts and coords
here, since you don't have to use them. It assumes that you are using
DefaultSimilarity and not a custom similarity implementation. You will
have to pick through the LaTeX code; it's rather difficult to render
formulas in ASCII.


Lucene uses a
modified vector-space model; the main scoring formula is
\begin{equation}
\label{eq:lucenescore}
\score(\qu, \doc) = \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)} \cdot
  \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2} 
  \sqrt{\vphantom{\sum_{\term\in\qu} \idf(\term)^2}\sum_{\term\in\doc} \tf(\term, \doc)}} 
\end{equation}
where 
\[ \idf(\term) = \log\frac{|\Doc|}{\docfreq(\term) + 1} + 1 \]
Scores are normalized to fall in a range of 0.0 to 1.0.

This weighting scheme is easily related to the standard vector-space
model by using \(\sqrt{\tf(\term, \doc)}\) instead of \(\tf(\term, \doc)\)
and defining \(\tf(\term,\qu)\equiv 1\). Then
\begin{align*}
  \score(\qu, \doc) &= \cos\angle(\vec{\qu}, \vec{\doc}) =
  \frac{\vec{\qu}\cdot\vec{\doc}}{\|\vec{\qu}\|\cdot \|\vec{\doc}\|}\\
  &= \frac{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \qu)}
      \idf(\term)\right)\left(\sqrt{\tf(\term, \doc)}
      \idf(\term)\right)}{ \sqrt{\sum_{\term\in\Term}
      \left(\sqrt{\tf(\term, \qu)} \idf(\term)\right)^2}
    \sqrt{\sum_{\term\in\Term} \left(\sqrt{\tf(\term, \doc)}
        \idf(\term)\right)^2}}\\
  &= \frac{\sum_{\term\in\qu} \sqrt{\tf(\term, \doc)}
    \idf(\term)^2}{\sqrt{\sum_{\term\in\qu} \idf(\term)^2}
    \sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}}
\end{align*} 
By omitting the term \(\idf(\term)^2\) from the term
\(\sqrt{\sum_{\term\in\doc} \tf(\term, \doc) \idf(\term)^2}\) in the
denominator, one arrives at the main scoring formula in
equation~(\ref{eq:lucenescore}).  Omitting the inverse document
frequency from the document normalization factor allows one to
precompute this factor and store it in the index; otherwise it would
be necessary to recompute the normalization factors every time a
document is added or deleted from the index.

-- 
Sebastian Kirsch <sk...@sebastian-kirsch.org> [http://www.sebastian-kirsch.org/]

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org