You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kamal Najib <ka...@mytum.de> on 2009/07/02 10:49:34 UTC

Re: A simple Vector Space Model and TFIDF usage

Hallo Amir,
So far i understand, you have two sets of documents, let we say set1 and set2. If you want to get the Similarity between the two sets documents you have to index the docs of one and schearch  each doc of the others as a query, then you can get the similarity of the two documents. So:
1. Index the docs of the set1.
2. for each doc-element from the set2 do:
   create a query that contains the content text of the doc-element.
   Search them in your indexed docs from set2
   And from the hits you will get, you can get the score of the Similarity     between the doc-element and every hit.

Your diractory where your indexed docs are saved represents the vector space model you want to bild. If you want to see how lucene computes the score result, you can use the class explanation and similarity in lucene Api and you will see that lucene  deals with the documents and querys in the same way as a vector space model. In the class explanation you can see that lucene use the TF, IDF and DF to compute the result score.
Best regards.
Kamal.
Original Message:

Hi,
<br />It's my first experiment with Lucene. Please help me.
<br />I'm going to index a set of documents and create a feature vector for each of them. This vector contains all terms belong to the document that weight using TFIDF.
<br />After that I want to compute the cosine similarity between all documents and produce a doc-doc similarity matrix. My document set is large and it's important to have a scalable implementation.
<br />Would you please provide me a guideline or to-do list?
<br />Thank you and kind regards.
<br />
<br />
<br />
<br />      

--