You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/07/17 01:57:33 UTC

[Nutch Wiki] Update of "bin/nutch dedup" by RobPettengill

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_dedup

New page:
dedup is an alias for net.nutch.indexer.!DeleteDuplicates

Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.

Usage: bin/nutch net.nutch.indexer.!DeleteDuplicates (-local | -ndfs <namenode:port>) [-workingdir <workingdir>] <segmentsDir>

[CommandLineOptions]