You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ninad Raut <hb...@gmail.com> on 2009/09/23 11:44:29 UTC

Finding near duplicates which searching Documents

Hi,
When we have news content crawled we face a problme of same content being
repeated in many documents.  We want to add a near duplicate document filter
to detect such documents. Is there a way to do that in SOLR?
Regards,
Ninad Raut.

Re: Finding near duplicates which searching Documents

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 23, 2009, at 2:55 PM, Jason Rutherglen wrote:

> I think don't this handle near duplicates which would require some of
> the methods mentioned recently on the Mahout list.

It's pluggable and I believe the TextProfileSignature is a fuzzy  
implementation in Solr that was brought over from Nutch.

Agree on the Mahout discussion, too, though: http://www.lucidimagination.com/search/document/9d7ad3a882e2a944/finding_the_similarity_of_documents_using_mahout_for_deduplication#b0321c0f25f835a0

>
> On Wed, Sep 23, 2009 at 2:59 AM, Shalin Shekhar Mangar
> <sh...@gmail.com> wrote:
>> On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut <hbase.user.ninad@gmail.com 
>> >wrote:
>>
>>> Hi,
>>> When we have news content crawled we face a problme of same  
>>> content being
>>> repeated in many documents.  We want to add a near duplicate  
>>> document
>>> filter
>>> to detect such documents. Is there a way to do that in SOLR?
>>>
>>
>> Look at http://wiki.apache.org/solr/Deduplication
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Finding near duplicates which searching Documents

Posted by Jason Rutherglen <ja...@gmail.com>.
I think don't this handle near duplicates which would require some of
the methods mentioned recently on the Mahout list.

On Wed, Sep 23, 2009 at 2:59 AM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut <hb...@gmail.com>wrote:
>
>> Hi,
>> When we have news content crawled we face a problme of same content being
>> repeated in many documents.  We want to add a near duplicate document
>> filter
>> to detect such documents. Is there a way to do that in SOLR?
>>
>
> Look at http://wiki.apache.org/solr/Deduplication
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Finding near duplicates which searching Documents

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Sep 23, 2009 at 3:50 PM, Ninad Raut <hb...@gmail.com>wrote:

> Is this feature included in SOLR 1.4??
>

Yep.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Finding near duplicates which searching Documents

Posted by Ninad Raut <hb...@gmail.com>.
Is this feature included in SOLR 1.4??

On Wed, Sep 23, 2009 at 3:29 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut <hbase.user.ninad@gmail.com
> >wrote:
>
> > Hi,
> > When we have news content crawled we face a problme of same content being
> > repeated in many documents.  We want to add a near duplicate document
> > filter
> > to detect such documents. Is there a way to do that in SOLR?
> >
>
> Look at http://wiki.apache.org/solr/Deduplication
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Finding near duplicates which searching Documents

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Sep 23, 2009 at 3:14 PM, Ninad Raut <hb...@gmail.com>wrote:

> Hi,
> When we have news content crawled we face a problme of same content being
> repeated in many documents.  We want to add a near duplicate document
> filter
> to detect such documents. Is there a way to do that in SOLR?
>

Look at http://wiki.apache.org/solr/Deduplication

-- 
Regards,
Shalin Shekhar Mangar.