You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Aleksey Serba <as...@gmail.com> on 2006/01/20 14:49:36 UTC

Document similarity

Hello lucene people!
First of all, i would like to thank all of community participants (
developers, users, Erik and Otis for "Lucene in Action" book ) for
their great work.

As far as i understand it, there are two most popular approches
concerning document similarity:
1. "cosine metrics" using term vectors
2. constructing MoreLikeThis query by document

In my case, i need to filter similar documents in search results and
therefore determine document similarity during indexing process using
term vectors. Obviously, i can't compare currently indexing document
with all documents in my collection. Should i restrict documents in my
collection using constructing some kind of "LikeThis" query?
What's a best/common practices for such things?

Thanks in advance,
Alex Serba

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document similarity

Posted by Aleksey Serba <as...@gmail.com>.

Yonik, Klaus, thanks for your quick response.

Let me rephrase, i can't compare currently processed document with all
documents in my collection using angle between documents in
terms-vector space because of performance issues. As far as i can see,
i can avoid unnecessary operations. At first, i can build query from
document terms, fetch top N results and compute angle only for them.
Is it ok?

The second question is
How to generate some information about documents similarity to store
in lucene index?
For example, hash with the same values for similar documents or
something like that.
Thus it would be easy to filter "supplemental" results.

On 1/20/06, Yonik Seeley <ys...@gmail.com> wrote:
> If you didn't want to store term vectors you could also run the
> document fields through the analyzer yourself and collect the Tokens
> (you should still have the fields you just indexed... no need to
> retrieve it again).
>
> -Yonik
>
> On 1/20/06, Klaus <kl...@vommond.de> wrote:
> >
> > >In my case, i need to filter similar documents in search results and
> > >therefore determine document similarity during indexing process using
> > >term vectors. Obviously, i can't compare currently indexing document
> > >with all documents in my collection.
> >
> > Yes you can. Right after indexing the new documents fetch the termvector for
> > this document from the index. Computer some kind of weight for each term,
> > und construct a Boolean query from all terms. You can use the termweights to
> > boost the termqueries. The hits will be scored, this score is a measure for
> > the similarity between the documents.
> >
> > peace
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Document similarity

Posted by Yonik Seeley <ys...@gmail.com>.

If you didn't want to store term vectors you could also run the
document fields through the analyzer yourself and collect the Tokens
(you should still have the fields you just indexed... no need to
retrieve it again).

-Yonik

On 1/20/06, Klaus <kl...@vommond.de> wrote:
>
> >In my case, i need to filter similar documents in search results and
> >therefore determine document similarity during indexing process using
> >term vectors. Obviously, i can't compare currently indexing document
> >with all documents in my collection.
>
> Yes you can. Right after indexing the new documents fetch the termvector for
> this document from the index. Computer some kind of weight for each term,
> und construct a Boolean query from all terms. You can use the termweights to
> boost the termqueries. The hits will be scored, this score is a measure for
> the similarity between the documents.
>
> peace

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: Document similarity

Posted by Klaus <kl...@vommond.de>.

>In my case, i need to filter similar documents in search results and
>therefore determine document similarity during indexing process using
>term vectors. Obviously, i can't compare currently indexing document
>with all documents in my collection. 

Yes you can. Right after indexing the new documents fetch the termvector for
this document from the index. Computer some kind of weight for each term,
und construct a Boolean query from all terms. You can use the termweights to
boost the termqueries. The hits will be scored, this score is a measure for
the similarity between the documents.

peace 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org