You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Lokeya <lo...@gmail.com> on 2007/08/16 20:20:16 UTC

Document Similarities lucene(particularly using doc id's)

Hi All,

I have the following set up: a) Indexed set of docs. b) Ran 1st query and
got tops docs  c) Fetched the id's from that and stored in a data structure.
d) Ran 2nd query , got top docs , fetched id's and stored in a data
structure.

Now i have 2 sets of doc ids (set 1) and (set 1).

I want to find out the document content similarity between these 2 sets(just
using doc ids information which i have).

Qn 1: Is it possible using any lucene api's. In that case can you point me
to the appropriate API's. I did a search at
:http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/index.html
But couldn't find anything.

Qn 2: If this can't be done then whats the best way to do this using already
available Lucene Api's.

Thanks in Advance.
-- 
View this message in context: http://www.nabble.com/Document-Similarities-lucene%28particularly-using-doc-id%27s%29-tf4281286.html#a12186723
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document Similarities lucene(particularly using doc id's)

Posted by Karl Wettin <ka...@gmail.com>.

20 aug 2007 kl. 05.19 skrev Lokeya:
> Grant Ingersoll-6 wrote:
>> On Aug 16, 2007, at 2:20 PM, Lokeya wrote:
>>> I want to find out the document content similarity

>> A common way of doing this is by calculating the cosine of the angle
>> between the two vectors.

> I can use the getTermFreqVector() on Index Reader and get it. But I am
> wondering whats the API which has to be used to find the similarity  
> between
> 2 such vectors which would give a score (doc-doc similairty in   
> essence).

Bob Carpenter wrote an article on the subject for "Lucene in Action".  
He also
works on LingPipe, a semi-free peice of software that might be  
helpful if
your Greek kung fu is too weak.

<http://www.alias-i.com/lingpipe/docs/api/com/aliasi/spell/ 
TfIdfDistance.html>



-- 
karl





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document Similarities lucene(particularly using doc id's)

Posted by Lokeya <lo...@gmail.com>.

Thanks for your reply. I will answer your questions below one by one:

What is the bigger goal you are trying to get at? 

I am trying to implement a "Search System" where we have basically 2 input
queries and output should be chain of documents which connects the 2 input
queries. These chain of documents should be ranked according to
relevance.This has following steps:

1.Create 2 end point document sets (for 2 user queries)
2. Expand queries (using centroid terms)
3. Run that expanded query and get an intermediate document set(which will
have documents that would connect end points)
4. Having 2 end points and an intermediate set, try to find the
connections(doc-doc similarities) and get the pathways(chain of documents
that connect 1st and 2nd query). So a chain can have maximum of 4 and
minimum of 1( doc same in all 3 sets which is the highly relevant for 2
input queries)
5. We have in mind specific applications which could use this set up and
right now doing a prototype system using citeseer documents as sample data
set which is around a million documents.

I am done with the steps 1, 2 and 3. And trying to get the 4th module done.

What do you need  the similarity score for? 

This will be used to rank documents and make pathways (which is the ultimate
goal)

Do you need to compare every item in set 1 against every item in set 2?

Yes I should do this . Like i said i have 3 sets :Left set for user query 1,
Right set for user query 2, and intermediate set for expanded query. First i
have to do n X n comparison for all unique docs in intermediate set and
store compared doc ids , score info in a data structure. Then take set 1
docs and compare it with intermediate set and look for docs having scores
more than a  fixed threshold, do this for Intermediate set and right side
set also. At the end we can come up with the relevant chain of documents.

In your initial reply there was a qn :

What do the doc ids have to do with the content? , the reason why i asked
this question was i was wondering if there is some implementation which
already exists(to do a doc-doc comparision based on doc-id's which would be
a cool one for users.)

Grant Ingersoll-6 wrote:
> 
> You would have write it, as it doesn't exist in Lucene (but could be  
> a useful contribution).  The easiest version is probably the cosine  
> similarity, described at http://en.wikipedia.org/wiki/Vector_space_model
> 
> Essentially, you have two vectors, and you need to calculate the  
> cosine of the angle between them.  Then you can do this for each one  
> in your set.
> 
> What is the bigger goal you are trying to get at?  What do you need  
> the similarity score for?  Do you need to compare every item in set 1  
> against every item in set 2?
> 
> On Aug 19, 2007, at 11:19 PM, Lokeya wrote:
> 
>>
>> Hi,
>>
>> Thanks for your reply.
>>
>> I can use the getTermFreqVector() on Index Reader and get it. But I am
>> wondering whats the API which has to be used to find the similarity  
>> between
>> 2 such vectors which would give a score (doc-doc similairty in   
>> essence).
>>
>> Thanks.
>>
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> Hi,
>>>
>>>
>>> On Aug 16, 2007, at 2:20 PM, Lokeya wrote:
>>>
>>>>
>>>> Hi All,
>>>>
>>>> I have the following set up: a) Indexed set of docs. b) Ran 1st
>>>> query and
>>>> got tops docs  c) Fetched the id's from that and stored in a data
>>>> structure.
>>>> d) Ran 2nd query , got top docs , fetched id's and stored in a data
>>>> structure.
>>>>
>>>> Now i have 2 sets of doc ids (set 1) and (set 1).
>>>>
>>>> I want to find out the document content similarity between these 2
>>>> sets(just
>>>> using doc ids information which i have).
>>>>
>>>
>>> Not sure what you mean here.  What do the doc ids have to do with the
>>> content?
>>>
>>>> Qn 1: Is it possible using any lucene api's. In that case can you
>>>> point me
>>>> to the appropriate API's. I did a search at
>>>> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
>>>> javadoc/index.html
>>>> But couldn't find anything.
>>>>
>>>
>>> It is possible if you use Term Vectors (see
>>> IndexReader.getTermFreqVector).  You will need to store (when you
>>> construct your Field) and load the term vectors and then calculate
>>> the similarity.  A common way of doing this is by calculating the
>>> cosine of the angle between the two vectors.
>>>
>>> -Grant
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/Document- 
>> Similarities-lucene%28particularly-using-doc-id%27s%29- 
>> tf4281286.html#a12229492
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Document-Similarities-lucene%28particularly-using-doc-id%27s%29-tf4281286.html#a12238336
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document Similarities lucene(particularly using doc id's)

Posted by Grant Ingersoll <gs...@apache.org>.

You would have write it, as it doesn't exist in Lucene (but could be  
a useful contribution).  The easiest version is probably the cosine  
similarity, described at http://en.wikipedia.org/wiki/Vector_space_model

Essentially, you have two vectors, and you need to calculate the  
cosine of the angle between them.  Then you can do this for each one  
in your set.

What is the bigger goal you are trying to get at?  What do you need  
the similarity score for?  Do you need to compare every item in set 1  
against every item in set 2?

On Aug 19, 2007, at 11:19 PM, Lokeya wrote:

>
> Hi,
>
> Thanks for your reply.
>
> I can use the getTermFreqVector() on Index Reader and get it. But I am
> wondering whats the API which has to be used to find the similarity  
> between
> 2 such vectors which would give a score (doc-doc similairty in   
> essence).
>
> Thanks.
>
>
>
> Grant Ingersoll-6 wrote:
>>
>> Hi,
>>
>>
>> On Aug 16, 2007, at 2:20 PM, Lokeya wrote:
>>
>>>
>>> Hi All,
>>>
>>> I have the following set up: a) Indexed set of docs. b) Ran 1st
>>> query and
>>> got tops docs  c) Fetched the id's from that and stored in a data
>>> structure.
>>> d) Ran 2nd query , got top docs , fetched id's and stored in a data
>>> structure.
>>>
>>> Now i have 2 sets of doc ids (set 1) and (set 1).
>>>
>>> I want to find out the document content similarity between these 2
>>> sets(just
>>> using doc ids information which i have).
>>>
>>
>> Not sure what you mean here.  What do the doc ids have to do with the
>> content?
>>
>>> Qn 1: Is it possible using any lucene api's. In that case can you
>>> point me
>>> to the appropriate API's. I did a search at
>>> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/
>>> javadoc/index.html
>>> But couldn't find anything.
>>>
>>
>> It is possible if you use Term Vectors (see
>> IndexReader.getTermFreqVector).  You will need to store (when you
>> construct your Field) and load the term vectors and then calculate
>> the similarity.  A common way of doing this is by calculating the
>> cosine of the angle between the two vectors.
>>
>> -Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://lucene.grantingersoll.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Document- 
> Similarities-lucene%28particularly-using-doc-id%27s%29- 
> tf4281286.html#a12229492
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document Similarities lucene(particularly using doc id's)

Posted by Lokeya <lo...@gmail.com>.

Hi,

Thanks for your reply.

I can use the getTermFreqVector() on Index Reader and get it. But I am
wondering whats the API which has to be used to find the similarity between
2 such vectors which would give a score (doc-doc similairty in  essence). 

Thanks.



Grant Ingersoll-6 wrote:
> 
> Hi,
> 
> 
> On Aug 16, 2007, at 2:20 PM, Lokeya wrote:
> 
>>
>> Hi All,
>>
>> I have the following set up: a) Indexed set of docs. b) Ran 1st  
>> query and
>> got tops docs  c) Fetched the id's from that and stored in a data  
>> structure.
>> d) Ran 2nd query , got top docs , fetched id's and stored in a data
>> structure.
>>
>> Now i have 2 sets of doc ids (set 1) and (set 1).
>>
>> I want to find out the document content similarity between these 2  
>> sets(just
>> using doc ids information which i have).
>>
> 
> Not sure what you mean here.  What do the doc ids have to do with the  
> content?
> 
>> Qn 1: Is it possible using any lucene api's. In that case can you  
>> point me
>> to the appropriate API's. I did a search at
>> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
>> javadoc/index.html
>> But couldn't find anything.
>>
> 
> It is possible if you use Term Vectors (see  
> IndexReader.getTermFreqVector).  You will need to store (when you  
> construct your Field) and load the term vectors and then calculate  
> the similarity.  A common way of doing this is by calculating the  
> cosine of the angle between the two vectors.
> 
> -Grant
> 
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Document-Similarities-lucene%28particularly-using-doc-id%27s%29-tf4281286.html#a12229492
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document Similarities lucene(particularly using doc id's)

Posted by Grant Ingersoll <gs...@apache.org>.

Hi,


On Aug 16, 2007, at 2:20 PM, Lokeya wrote:

>
> Hi All,
>
> I have the following set up: a) Indexed set of docs. b) Ran 1st  
> query and
> got tops docs  c) Fetched the id's from that and stored in a data  
> structure.
> d) Ran 2nd query , got top docs , fetched id's and stored in a data
> structure.
>
> Now i have 2 sets of doc ids (set 1) and (set 1).
>
> I want to find out the document content similarity between these 2  
> sets(just
> using doc ids information which i have).
>

Not sure what you mean here.  What do the doc ids have to do with the  
content?

> Qn 1: Is it possible using any lucene api's. In that case can you  
> point me
> to the appropriate API's. I did a search at
> :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
> javadoc/index.html
> But couldn't find anything.
>

It is possible if you use Term Vectors (see  
IndexReader.getTermFreqVector).  You will need to store (when you  
construct your Field) and load the term vectors and then calculate  
the similarity.  A common way of doing this is by calculating the  
cosine of the angle between the two vectors.

-Grant

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org