You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Otis Gospodnetic <og...@yahoo.com> on 2008/08/07 17:29:41 UTC

New algo: Near duplicate detection

This sounds simple and apparently it's effective...should anyone want to give it a try:

http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: New algo: Near duplicate detection

Posted by Andrzej Bialecki <ab...@getopt.org>.

Dennis Kubes wrote:
> I just saw that as well.  I think it is worth a go implementing this.
> 
> Dennis
> 
> Otis Gospodnetic wrote:
>> This sounds simple and apparently it's effective...should anyone want 
>> to give it a try:
>>
>> http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html

Interesting, I agree it's worth checking. The reference to the use of 
inverted indexes is intriguing - perhaps we could use the already 
existing Lucene index which is being de-duplicated.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: New algo: Near duplicate detection

Posted by Dennis Kubes <ku...@apache.org>.

I just saw that as well.  I think it is worth a go implementing this.

Dennis

Otis Gospodnetic wrote:
> This sounds simple and apparently it's effective...should anyone want to give it a try:
> 
> http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>