You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Sebastian Menge <se...@uni-dortmund.de> on 2005/10/18 17:08:30 UTC
SImilarity between Terms
Hi all
Given an index, how can (if i can) get the similarity between _terms_?
I read somewhere (In an Intro to IR) that a term can be seen as a
document. Can i do that with lucene, and how would one proceed? (a code
snippet would be great ..)
Thanks alot, Sebastian.
BTW: I found lucene when looking for a LSA component. I already asked
for that on the general-list. Other people are also looking for this
(e.g. fidde andersson). I already get asked whether i got any further.
So it seems that there is demand for such a component. If i were still a
student i would try to extend lucene to do something like that, but
today i dont have the ressources but perhaps another person has.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: SImilarity between Terms
Posted by nh <ni...@yahoo.com>.
Hi
could you plz tell me how I can calculate the relevance of a doc to the
query with a combination of cosine similarity AND position of the words?
--
View this message in context: http://lucene.472066.n3.nabble.com/SImilarity-between-Terms-tp566603p4323903.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: SImilarity between Terms
Posted by Joaquin Delgado <jo...@oracle.com>.
Sebastian,
There is no simple way of calculating similarity between terms in Lucene.
Normally documents are represented in the Vector Space Model (VSM) where
as some weight is associated to each unique term associated with the
document (e.g. term frequency or number of times a term occurs within
the document). This representation is used internally to calculate the
similarity between documents, treating a query as a special case short
document. Now, you can get these term vectors per documents with the
Lucene API if the index was built with the term vectors option. You can
try building a Term vs. Documents matrix by accumulating document term
vectors and then applying some LSA or co-occurrence based calculations
as a similarity, but this may be computationally very expensive if done
with a huge matrix. Some sampling based techniques have been developed
(please contact me directly if you wish to learn more about it).
Now, regarding your comment about seeing a term as a document, if you
inverse the T x D matrix you may think of a term as a document where as
the vector representation now contains entries with term weights
associated with each document, thus similar vector space calculations
(e.g. cosine-based similarity) can be drawn between terms. This just
looks at a first degree of co-occurrence though (i.e. how many documents
share the terms) and does not capture semantic transitivity (second or
higher degree of co-occurrence) which is very important to determine
similarity between terms (i.e. synonyms, representing the same concept,
may be use in different sub-sets of documents thus having low first
degree of co-occurrence)
-- Joaquin
Sebastian Menge wrote:
>Hi all
>
>Given an index, how can (if i can) get the similarity between _terms_?
>
>I read somewhere (In an Intro to IR) that a term can be seen as a
>document. Can i do that with lucene, and how would one proceed? (a code
>snippet would be great ..)
>
>Thanks alot, Sebastian.
>
>BTW: I found lucene when looking for a LSA component. I already asked
>for that on the general-list. Other people are also looking for this
>(e.g. fidde andersson). I already get asked whether i got any further.
>So it seems that there is demand for such a component. If i were still a
>student i would try to extend lucene to do something like that, but
>today i dont have the ressources but perhaps another person has.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org