You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Claudio Gennaro <cl...@isti.cnr.it> on 2009/08/13 18:28:51 UTC

Simple tf cosine similarity

I would like to know if there is a simple way to force Lucene to adopt the
simple cosine similarity of the term frequency vectors of the documents and
the query for ranking the result. In practice the score sc_i of the document
i should be given by:

sc_i = (D_i*Q)/(|D_i|*|Q|)

where D_i = vector of the term frequencies of document i;
	Q = vector of the term frequencies of the Query;
	* = scalar product;
	|| = norm of the vector (the square root of the sum of the squares
of the entries of the vector).

I wasn't able to find a way to evaluate |D_i|.

Thank you

Claudio


======================================
Ing. Claudio Gennaro, PhD
ISTI (Information Science and Technology Institute) 
Consiglio Nazionale delle Ricerche Area della Ricerca di Pisa (room I14) 
Via G. Moruzzi 1
56124 Pisa - ITALY
phone: +39 050 315 3077
mobile: +39 328 92 16 734
fax: +39 050 315 3464 or +39 050 315 2810
e-mail: claudio.gennaro@isti.cnr.it



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org