You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Doug Cutting <cu...@lucene.com> on 2003/11/03 18:02:05 UTC

Re: subclassing of IndexReader

Christoph Goller wrote:
> That's quite interesting. I am currently involved in a small crawling 
> project. We only crawl a very limited number of news pages, some of them 
> several times per day. We found that there are often tiny changes on 
> these pages (spelling corrections, banner changes) which we would like 
> to ignore
> (classify as dublicate) while we want to recognize bigger changes. For 
> such a setting MD5 keys are
> not very helpful. How do you detect dublicates in Nutch?

Nutch currently only does MD5-based duplicate elimination.  So only 
exact duplicates are eliminated.

There's been a fair amount of work on better methods.  For example, 
there was Broder et. al.'s "Syntactic Clustering" work 
(http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/).

However I've never seen anyone demonstrate how such methods can be 
efficiently applied to huge collections.  Perhaps they can, but it's not 
obvious to me.  I've also not followed this literature closely.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org