You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rishabh Joshi <ri...@gmail.com> on 2007/11/16 10:27:15 UTC

Near Duplicate Documents

Hi,

I am evaluating "Solr 1.2" for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is "MoreLikeThisHandler" the implementation for near dups?

Rishabh

Re: Near Duplicate Documents

Posted by Ryan McKinley <ry...@gmail.com>.

Eswar K wrote:
> We have a scenario, where we want to find out documents which are similar in
> content. To elaborate a little more on what we mean here, lets take an
> example.
> 
> The example of this email chain in which we are interacting on, can be best
> used for illustrating the concept of near dupes (We are not getting confused
> with threads, they are two different things.). Each email in this thread is
> treated as a document by the system. A reply to the original mail also
> includes the original mail in which case it becomes a near duplicate of the
> orginal mail (depending on the percentage of similarity).  Similarly it goes
> on. The near dupes need not be limited to emails.
> 
> If we want to have such capability using Solr, can we use
> MoreLikeThisHandler or is there any other appropriate handler in Solr which
> we can use? What is the best way for achieving such a functionality?
> 

mess around with the MoreLikeThisHandler, see if it gives you what you 
are looking for.

Check:
http://wiki.apache.org/solr/MoreLikeThis

For your example, you would want to make sure that the 'type' field 
("email") is in the mlt.fl param.  Perhaps: mlt.fl=type,content

Re: Near Duplicate Documents

Posted by Mike Klaas <mi...@gmail.com>.

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

> Is there any idea implementing that feature in the up coming releases?

Not currently.  Feel free to contribute something if you find a good  
solution <g>.

-Mike


> On Nov 18, 2007 9:35 PM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>
>> On Nov 18, 2007 10:50 AM, Eswar K <kj...@gmail.com> wrote:
>>> We have a scenario, where we want to find out documents which are
>> similar in
>>> content. To elaborate a little more on what we mean here, lets  
>>> take an
>>> example.
>>>
>>> The example of this email chain in which we are interacting on,  
>>> can be
>> best
>>> used for illustrating the concept of near dupes (We are not getting
>> confused
>>> with threads, they are two different things.). Each email in this  
>>> thread
>> is
>>> treated as a document by the system. A reply to the original mail  
>>> also
>>> includes the original mail in which case it becomes a near  
>>> duplicate of
>> the
>>> orginal mail (depending on the percentage of similarity).   
>>> Similarly it
>> goes
>>> on. The near dupes need not be limited to emails.
>>
>> I think this is what's known as "shingling."  See
>> http://en.wikipedia.org/wiki/W-shingling
>> Lucene (and therefore Solr) does not implement shingling.  The
>> "MoreLikeThis" query might be close enough, however.
>>
>> -Stuart
>>

Re: Near Duplicate Documents

Posted by Eswar K <kj...@gmail.com>.

Is there any idea implementing that feature in the up coming releases?

Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra <ma...@stuartsierra.com> wrote:

> On Nov 18, 2007 10:50 AM, Eswar K <kj...@gmail.com> wrote:
> > We have a scenario, where we want to find out documents which are
> similar in
> > content. To elaborate a little more on what we mean here, lets take an
> > example.
> >
> > The example of this email chain in which we are interacting on, can be
> best
> > used for illustrating the concept of near dupes (We are not getting
> confused
> > with threads, they are two different things.). Each email in this thread
> is
> > treated as a document by the system. A reply to the original mail also
> > includes the original mail in which case it becomes a near duplicate of
> the
> > orginal mail (depending on the percentage of similarity).  Similarly it
> goes
> > on. The near dupes need not be limited to emails.
>
> I think this is what's known as "shingling."  See
> http://en.wikipedia.org/wiki/W-shingling
> Lucene (and therefore Solr) does not implement shingling.  The
> "MoreLikeThis" query might be close enough, however.
>
> -Stuart
>

Re: Near Duplicate Documents

Posted by Stuart Sierra <ma...@stuartsierra.com>.

On Nov 18, 2007 10:50 AM, Eswar K <kj...@gmail.com> wrote:
> We have a scenario, where we want to find out documents which are similar in
> content. To elaborate a little more on what we mean here, lets take an
> example.
>
> The example of this email chain in which we are interacting on, can be best
> used for illustrating the concept of near dupes (We are not getting confused
> with threads, they are two different things.). Each email in this thread is
> treated as a document by the system. A reply to the original mail also
> includes the original mail in which case it becomes a near duplicate of the
> orginal mail (depending on the percentage of similarity).  Similarly it goes
> on. The near dupes need not be limited to emails.

I think this is what's known as "shingling."  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
"MoreLikeThis" query might be close enough, however.

-Stuart

Re: Near Duplicate Documents

Posted by Eswar K <kj...@gmail.com>.

We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?

Regards,
Eswar

On Nov 18, 2007 9:06 PM, Ryan McKinley <ry...@gmail.com> wrote:

> I'm not sure I understand your question...
>
> A "near duplicate document" could mean a LOT of things depending on the
> context.
>
> perhaps you just need "fuzzy searching"?
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches
>
> or "proximity searches"?
>
> http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches
>
>
> MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is
> used to search for other similar documents based on the results of
> another query.
>
> ryan
>
>
> rishabh9 wrote:
> > Can anyone help me?
> >
> > Rishabh
> >
> >
> > rishabh9 wrote:
> >> Hi,
> >>
> >> I am evaluating "Solr 1.2" for my project and wanted to know if it can
> >> return near duplicate documents (near dups) and how do i go about it? I
> am
> >> not sure, but is "MoreLikeThisHandler" the implementation for near
> dups?
> >>
> >> Rishabh
> >>
> >>
> >
>
>

Re: Near Duplicate Documents

Posted by Ryan McKinley <ry...@gmail.com>.

I'm not sure I understand your question...

A "near duplicate document" could mean a LOT of things depending on the 
context.

perhaps you just need "fuzzy searching"?
http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches

or "proximity searches"?
http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches

MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is 
used to search for other similar documents based on the results of 
another query.

ryan

rishabh9 wrote:
> Can anyone help me?
> 
> Rishabh
> 
> 
> rishabh9 wrote:
>> Hi,
>>
>> I am evaluating "Solr 1.2" for my project and wanted to know if it can
>> return near duplicate documents (near dups) and how do i go about it? I am
>> not sure, but is "MoreLikeThisHandler" the implementation for near dups?
>>
>> Rishabh
>>
>>
>

Re: Near Duplicate Documents

Posted by rishabh9 <ri...@gmail.com>.

Can anyone help me?

Rishabh


rishabh9 wrote:
> 
> Hi,
> 
> I am evaluating "Solr 1.2" for my project and wanted to know if it can
> return near duplicate documents (near dups) and how do i go about it? I am
> not sure, but is "MoreLikeThisHandler" the implementation for near dups?
> 
> Rishabh
> 
> 

-- 
View this message in context: http://www.nabble.com/Near-Duplicate-Documents-tf4820111.html#a13819048
Sent from the Solr - User mailing list archive at Nabble.com.