You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Barbara Krausz <bk...@web.de> on 2005/04/20 14:04:32 UTC

Scoring, cosine measure

Hi,
currently I'm writing my Bachelorthesis about Lucene. I searched for 
theoretical information for example about the IR-model Lucene uses, but 
I couldn't find anything so I had to figure it out on my own.
I think Lucene uses the vector space model with a variation of the 
cosine measure (cosine measure is described in "modern information 
retrieval"): Instead of a division by the length of the documentvector 
it divides the score by the fieldNorm. So, the scoring formula can be 
written as:

score(d,q)=sum (i=1 to t) ( wid* wiq/(sqrt(m)*|q|) )

t: number of distinct terms in the collection
wid: weight of term i in document d
wiq: weight of term i in the query
m: total number of terms in the field  (sqrt(m)=fieldNorm)
|q|: length of queryvector q

The weight wid is the product of idf and tf. The weight wiq is just 
idf.  So the only difference to the cosine measure is the use of sqrt(m) 
instead of the length of the documentvector, isn't it? Why? Is it too 
difficult to compute this length?
Has anyone tried an index based on n-grams?

Thanks
Barbara

PS: Probably I will have more questions about Lucene in the next weeks :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring, cosine measure

Posted by David Spencer <da...@tropo.com>.

Daniel Naber wrote:

> On Wednesday 20 April 2005 18:22, Paul Elschot wrote:
> 
> 
>>>Has anyone tried an index based on n-grams?
>>
>>Nutch has bigrams for phrases with frequently occurring words.
> 
> 
> Also the spell checker  in SVN uses n-grams I think.


SVN here:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/spellchecker/

Weblog entry about it, which has more links

http://www.searchmorph.com/weblog/index.php?id=23

> 
> Regards
>  Daniel
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring, cosine measure

Posted by Andrzej Bialecki <ab...@getopt.org>.

Daniel Naber wrote:
> On Wednesday 20 April 2005 18:22, Paul Elschot wrote:
> 
> 
>>>Has anyone tried an index based on n-grams?
>>
>>Nutch has bigrams for phrases with frequently occurring words.
> 
> 
> Also the spell checker  in SVN uses n-grams I think.

Yes, but Nutch uses word n-grams, whereas the spell checker uses 
character n-grams.



-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring, cosine measure

Posted by Daniel Naber <lu...@danielnaber.de>.

On Wednesday 20 April 2005 18:22, Paul Elschot wrote:

> > Has anyone tried an index based on n-grams?
>
> Nutch has bigrams for phrases with frequently occurring words.

Also the spell checker  in SVN uses n-grams I think.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring, cosine measure

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 20 April 2005 14:04, Barbara Krausz wrote:
> Hi,
> currently I'm writing my Bachelorthesis about Lucene. I searched for 
> theoretical information for example about the IR-model Lucene uses, but 
> I couldn't find anything so I had to figure it out on my own.
> I think Lucene uses the vector space model with a variation of the 
> cosine measure (cosine measure is described in "modern information 
> retrieval"): Instead of a division by the length of the documentvector 
> it divides the score by the fieldNorm. So, the scoring formula can be 
> written as:
> 
> score(d,q)=sum (i=1 to t) ( wid* wiq/(sqrt(m)*|q|) )
> 
> t: number of distinct terms in the collection
> wid: weight of term i in document d
> wiq: weight of term i in the query
> m: total number of terms in the field  (sqrt(m)=fieldNorm)
> |q|: length of queryvector q

There is also a coordination factor:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 
> The weight wid is the product of idf and tf. The weight wiq is just 
> idf.  So the only difference to the cosine measure is the use of sqrt(m) 
> instead of the length of the documentvector, isn't it? Why? Is it too 
> difficult to compute this length?

The square root is a variation on the power norms of the extended
boolean model. It helps to make each extra occurrence in the same document
less important.

The field length is computed under the hood by counting the indexed terms.
Its square root is stored as a single byte value in a special representation
with 3 bits mantissa and 5 bits exponent.

> Has anyone tried an index based on n-grams?

Nutch has bigrams for phrases with frequently occurring words.
 
Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org