You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Furkan KAMACI <fu...@gmail.com> on 2013/09/22 21:02:33 UTC

Near Duplicate Document Detection at Solr

I want to detect near duplicate documents (for web documents). I know that
there is an algorithm called Winnowing and there is another technique used
by Google. However I also know that Solr has a component called
MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
easy to detect but near duplicate detection is much more behind it.

So I want to ask that what is the underlying algorithm Solr MoreLikeThis
component uses and can I use it for such kind of purposes?

Otherwise, I will implement an algorithm for near duplicate document
detection within few days and I will be proud to contribute and adopt it
into Solr.

Thanks;
Furkan KAMACI

RE: Near Duplicate Document Detection at Solr

Posted by Markus Jelsma <ma...@openindex.io>.
-----Original message-----
> From:Furkan KAMACI <fu...@gmail.com>
> Sent: Sunday 22nd September 2013 21:15
> To: solr-user@lucene.apache.org
> Subject: Re: Near Duplicate Document Detection at Solr
> 
> I've also know that there is another mechanism at Solr:
> http://wiki.apache.org/solr/Deduplication I think that I should add a
> custom signature because that is the most usable one for me:
> http://wiki.apache.org/solr/TextProfileSignature

Keep in mind, its results are really bad for short documents and does not work for languages not using whitespace.

> On the other hand are
> there any limitation for deduplication at SolrCloud?

Yes, it does not work:
https://issues.apache.org/jira/browse/SOLR-3473

> 
> What do you think?
> 
> 
> 2013/9/22 Furkan KAMACI <fu...@gmail.com>
> 
> > I want to detect near duplicate documents (for web documents). I know that
> > there is an algorithm called Winnowing and there is another technique used
> > by Google. However I also know that Solr has a component called
> > MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
> > easy to detect but near duplicate detection is much more behind it.
> >
> > So I want to ask that what is the underlying algorithm Solr MoreLikeThis
> > component uses and can I use it for such kind of purposes?
> >
> > Otherwise, I will implement an algorithm for near duplicate document
> > detection within few days and I will be proud to contribute and adopt it
> > into Solr.
> >
> > Thanks;
> > Furkan KAMACI
> >
> 

Re: Near Duplicate Document Detection at Solr

Posted by Furkan KAMACI <fu...@gmail.com>.
I've also know that there is another mechanism at Solr:
http://wiki.apache.org/solr/Deduplication I think that I should add a
custom signature because that is the most usable one for me:
http://wiki.apache.org/solr/TextProfileSignature On the other hand are
there any limitation for deduplication at SolrCloud?

What do you think?


2013/9/22 Furkan KAMACI <fu...@gmail.com>

> I want to detect near duplicate documents (for web documents). I know that
> there is an algorithm called Winnowing and there is another technique used
> by Google. However I also know that Solr has a component called
> MoreLikeThis. Google's page explains that *mirroring and plagiarism* is
> easy to detect but near duplicate detection is much more behind it.
>
> So I want to ask that what is the underlying algorithm Solr MoreLikeThis
> component uses and can I use it for such kind of purposes?
>
> Otherwise, I will implement an algorithm for near duplicate document
> detection within few days and I will be proud to contribute and adopt it
> into Solr.
>
> Thanks;
> Furkan KAMACI
>