You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Barbara Krausz <bk...@web.de> on 2005/04/20 14:04:32 UTC
Scoring, cosine measure
Hi,
currently I'm writing my Bachelorthesis about Lucene. I searched for
theoretical information for example about the IR-model Lucene uses, but
I couldn't find anything so I had to figure it out on my own.
I think Lucene uses the vector space model with a variation of the
cosine measure (cosine measure is described in "modern information
retrieval"): Instead of a division by the length of the documentvector
it divides the score by the fieldNorm. So, the scoring formula can be
written as:
score(d,q)=sum (i=1 to t) ( wid* wiq/(sqrt(m)*|q|) )
t: number of distinct terms in the collection
wid: weight of term i in document d
wiq: weight of term i in the query
m: total number of terms in the field (sqrt(m)=fieldNorm)
|q|: length of queryvector q
The weight wid is the product of idf and tf. The weight wiq is just
idf. So the only difference to the cosine measure is the use of sqrt(m)
instead of the length of the documentvector, isn't it? Why? Is it too
difficult to compute this length?
Has anyone tried an index based on n-grams?
Thanks
Barbara
PS: Probably I will have more questions about Lucene in the next weeks :-)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Scoring, cosine measure
Posted by David Spencer <da...@tropo.com>.
Daniel Naber wrote:
> On Wednesday 20 April 2005 18:22, Paul Elschot wrote:
>
>
>>>Has anyone tried an index based on n-grams?
>>
>>Nutch has bigrams for phrases with frequently occurring words.
>
>
> Also the spell checker in SVN uses n-grams I think.
SVN here:
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/spellchecker/
Weblog entry about it, which has more links
http://www.searchmorph.com/weblog/index.php?id=23
>
> Regards
> Daniel
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Scoring, cosine measure
Posted by Andrzej Bialecki <ab...@getopt.org>.
Daniel Naber wrote:
> On Wednesday 20 April 2005 18:22, Paul Elschot wrote:
>
>
>>>Has anyone tried an index based on n-grams?
>>
>>Nutch has bigrams for phrases with frequently occurring words.
>
>
> Also the spell checker in SVN uses n-grams I think.
Yes, but Nutch uses word n-grams, whereas the spell checker uses
character n-grams.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Scoring, cosine measure
Posted by Daniel Naber <lu...@danielnaber.de>.
On Wednesday 20 April 2005 18:22, Paul Elschot wrote:
> > Has anyone tried an index based on n-grams?
>
> Nutch has bigrams for phrases with frequently occurring words.
Also the spell checker in SVN uses n-grams I think.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Scoring, cosine measure
Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 20 April 2005 14:04, Barbara Krausz wrote:
> Hi,
> currently I'm writing my Bachelorthesis about Lucene. I searched for
> theoretical information for example about the IR-model Lucene uses, but
> I couldn't find anything so I had to figure it out on my own.
> I think Lucene uses the vector space model with a variation of the
> cosine measure (cosine measure is described in "modern information
> retrieval"): Instead of a division by the length of the documentvector
> it divides the score by the fieldNorm. So, the scoring formula can be
> written as:
>
> score(d,q)=sum (i=1 to t) ( wid* wiq/(sqrt(m)*|q|) )
>
> t: number of distinct terms in the collection
> wid: weight of term i in document d
> wiq: weight of term i in the query
> m: total number of terms in the field (sqrt(m)=fieldNorm)
> |q|: length of queryvector q
There is also a coordination factor:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
> The weight wid is the product of idf and tf. The weight wiq is just
> idf. So the only difference to the cosine measure is the use of sqrt(m)
> instead of the length of the documentvector, isn't it? Why? Is it too
> difficult to compute this length?
The square root is a variation on the power norms of the extended
boolean model. It helps to make each extra occurrence in the same document
less important.
The field length is computed under the hood by counting the indexed terms.
Its square root is stored as a single byte value in a special representation
with 3 bits mantissa and 5 bits exponent.
> Has anyone tried an index based on n-grams?
Nutch has bigrams for phrases with frequently occurring words.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org