You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Fotis P <fo...@gmail.com> on 2015/05/21 16:49:41 UTC

Computing the similarity of documents

Hello everyone,

My task at hand is to compute the pairwise cosine similarity between a list
of documents.

I first index all the documents with DOCS_AND_FREQS option, then I
construct a query from every term of a document:

Query query =  parser.parse(document);

making sure to use the same analyzer in indexing and searching time.

I have also implemented my own similarity class so that I exclude coord(),
slopyfreq() etc. My implementation is here: http://pastebin.com/MArCs3ff

I still dont get the correct results however. Scoring results do make sense
from a search perspective, they are not however the values that I am
looking for.

I am bit lost as to what I should change to fine-tune the behaviour exactly
as I want it. The Lucene scoring formula for example confuses me with this
part: Σ tf(t in d)
<http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html#formula_tf>
This means that it only takes into account terms that exist in the query
(in my case a document) . Terms that exist in the other document but not in
the query do not alter the results, correct?

I hope what I am asking for is clear enough. If you need some more
information from me please ask.

Thank you in advance,

Fotios