You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2005/10/08 17:38:21 UTC

dedup between segments

Hi,

Is there a way that we could delete duplicated
documents for two segments?

I see there is DeleteDuplicates.java could do dedup
for a single segment based on doc's content MD5 and
URL. 

But, if I have two segments fetched in two time period
and not sure if there are documents duplicated. Should
I do dedup?

Is that done in IndexMerger.java? But I didn't see any
code logic to dedup, even within
org.apache.lucene.index.IndexWriter.

Does that mean document duplication is OK for multiple
segments?

thanks,

Michael Ji,


	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com