You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by radu mateescu <rb...@gmail.com> on 2005/10/19 04:18:14 UTC

Map-reduce based SegmentReader

Hello,
 Attached is the simplified version of SegmentReader using map-reduce.
 Synthax: ./nutch org.apache.nutch.crawl.SegmentReader segment
 It creates a segdump directory under segment structure which holds all
individual dump files along with the large file obtained through
concatenation of individual pieces. This file has the name given by
segment.dump.filename property (defaulted to dump).
 The structure of each dumped record is:
Recno::
CrawlDatum::
Content::
ParseData::
ParseText::
 Comments are welcome
 Thanks,
Radu