You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/07/17 02:28:37 UTC

[Nutch Wiki] Update of "bin/nutch mergesegs" by RobPettengill

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_mergesegs

New page:
mergesegs is an alias for net.nutch.tools.!SegmentMergeTool

This class cleans up accumulated segments data, and merges them into a single (or optionally multiple) segment(s), with no duplicates in it.

There are no prerequisites for its correct operation except for a set of already fetched segments (they don't have to contain parsed content, only fetcher output is required). This tool does not use DeleteDuplicates, but creates its own "master" index of all pages in all segments. Then it walks sequentially through this index and picks up only most recent versions of pages for every unique value of url or hash.

If some of the input segments are corrupted, this tool will attempt to repair them, using net.nutch.segment.!SegmentReader.fixSegment(!NutchFileSystem, File, boolean, boolean, boolean, boolean) method.

Output segment can be optionally split on the fly into several segments of fixed length.

The newly created segment(s) can be then optionally indexed, so that it can be either merged with more new segments, or used for searching as it is.

Old segments may be optionally removed, because all needed data has already been copied to the new merged segment. NOTE: this tool will remove also all corrupted input segments, which are not useable anyway - however, this option may be dangerous if you inadvertently included non-segment directories as input...

You may want to run SegmentMergeTool instead of following the manual procedures, with all options turned on, i.e. to merge segments into the output segment(s), index it, and then delete the original segments data.

Usage: bin/nutch net.nutch.tools.!SegmentMergeTool (-local | -nfs ...) (-dir <input_segments_dir> | seg1 seg2 ...) [-o <output_segments_dir>] [-max count] [-i] [-ds]
-dir <input_segments_dir>
path to directory containing input segments
seg1 seg2 seg3
individual paths to input segments
-o <output_segment_dir>
(optional) path to directory which will contain output segment(s).
NOTE: If not present, the original segments path will be used.
-max count
(optional) output multiple segments, each with maximum 'count' entries
-i
(optional) index the output segment when finished merging
-ds
(optional) delete the original input segments when finished