You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Otis Gospodnetic <og...@yahoo.com> on 2008/08/07 17:29:41 UTC
New algo: Near duplicate detection
This sounds simple and apparently it's effective...should anyone want to give it a try:
http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: New algo: Near duplicate detection
Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> I just saw that as well. I think it is worth a go implementing this.
>
> Dennis
>
> Otis Gospodnetic wrote:
>> This sounds simple and apparently it's effective...should anyone want
>> to give it a try:
>>
>> http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html
Interesting, I agree it's worth checking. The reference to the use of
inverted indexes is intriguing - perhaps we could use the already
existing Lucene index which is being de-duplicated.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: New algo: Near duplicate detection
Posted by Dennis Kubes <ku...@apache.org>.
I just saw that as well. I think it is worth a go implementing this.
Dennis
Otis Gospodnetic wrote:
> This sounds simple and apparently it's effective...should anyone want to give it a try:
>
> http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>