You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Sebastian Menge <se...@uni-dortmund.de> on 2005/10/18 17:08:30 UTC

SImilarity between Terms

Hi all

Given an index, how can (if i can) get the similarity between _terms_? 

I read somewhere (In an Intro to IR) that a term can be seen as a
document. Can i do that with lucene, and how would one proceed? (a code
snippet would be great ..)

Thanks alot, Sebastian.

BTW: I found lucene when looking for a LSA component. I already asked
for that on the general-list. Other people are also looking for this
(e.g. fidde andersson). I already get asked whether i got any further.
So it seems that there is demand for such a component. If i were still a
student i would try to extend lucene to do something like that, but
today i dont have the ressources but perhaps another person has.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SImilarity between Terms

Posted by nh <ni...@yahoo.com>.

Hi
could you plz tell me how I can calculate the relevance of a doc to the
query with a combination of cosine similarity AND position of the words?



--
View this message in context: http://lucene.472066.n3.nabble.com/SImilarity-between-Terms-tp566603p4323903.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: SImilarity between Terms

Posted by Joaquin Delgado <jo...@oracle.com>.

Sebastian,

There is no simple way of calculating similarity between terms in Lucene.

Normally documents are represented in the Vector Space Model (VSM) where 
as some weight is associated to each unique term associated with the 
document (e.g. term frequency or number of times a term occurs within 
the document). This representation is used internally to calculate the 
similarity between documents, treating a query as a special case short 
document. Now, you can get these term vectors per documents with the 
Lucene API if the index was built with the term vectors option. You can 
try building a Term vs. Documents matrix by accumulating document term 
vectors and then applying some LSA or co-occurrence based calculations 
as a similarity, but this may be computationally very expensive if done 
with a huge matrix. Some sampling based techniques have been developed 
(please contact me directly if you wish to learn more about it).

Now, regarding your comment about seeing a term as a document, if you 
inverse the T x D matrix you may think of a term as a document where as 
the vector representation now contains entries with term weights 
associated with each document, thus similar vector space calculations 
(e.g. cosine-based similarity) can be drawn between terms. This just 
looks at a first degree of co-occurrence though (i.e. how many documents 
share the terms) and  does not capture semantic transitivity (second or 
higher degree of co-occurrence) which is very important to determine 
similarity between terms (i.e. synonyms, representing the same concept, 
may be use in different sub-sets of documents thus having low first 
degree of co-occurrence)

-- Joaquin

Sebastian Menge wrote:

>Hi all
>
>Given an index, how can (if i can) get the similarity between _terms_? 
>
>I read somewhere (In an Intro to IR) that a term can be seen as a
>document. Can i do that with lucene, and how would one proceed? (a code
>snippet would be great ..)
>
>Thanks alot, Sebastian.
>
>BTW: I found lucene when looking for a LSA component. I already asked
>for that on the general-list. Other people are also looking for this
>(e.g. fidde andersson). I already get asked whether i got any further.
>So it seems that there is demand for such a component. If i were still a
>student i would try to extend lucene to do something like that, but
>today i dont have the ressources but perhaps another person has.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org