You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Fotis P <fo...@gmail.com> on 2015/05/21 16:49:41 UTC
Computing the similarity of documents
Hello everyone,
My task at hand is to compute the pairwise cosine similarity between a list
of documents.
I first index all the documents with DOCS_AND_FREQS option, then I
construct a query from every term of a document:
Query query = parser.parse(document);
making sure to use the same analyzer in indexing and searching time.
I have also implemented my own similarity class so that I exclude coord(),
slopyfreq() etc. My implementation is here: http://pastebin.com/MArCs3ff
I still dont get the correct results however. Scoring results do make sense
from a search perspective, they are not however the values that I am
looking for.
I am bit lost as to what I should change to fine-tune the behaviour exactly
as I want it. The Lucene scoring formula for example confuses me with this
part: Σ tf(t in d)
<http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html#formula_tf>
This means that it only takes into account terms that exist in the query
(in my case a document) . Terms that exist in the other document but not in
the query do not alter the results, correct?
I hope what I am asking for is clear enough. If you need some more
information from me please ask.
Thank you in advance,
Fotios