You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by POIRIER David <DP...@cross-systems.com> on 2008/05/30 10:47:20 UTC

fetching and parsing

Hello,

I have a fairly big crawl (50 000 documents) that I'd like to re-parse
without actually having to re-fetch it. I tried going segment by
segment. Let's say we have the following segment:

nutch/crawl-xyz/20080516162726

It contains the following directories:

nutch/crawl-xyz/20080516162726/content
nutch/crawl-xyz/20080516162726/crawl_fetch
nutch/crawl-xyz/20080516162726crawl_generate
nutch/crawl-xyz/20080516162726/crawl_parse
nutch/crawl-xyz/20080516162726/parse_data
nutch/crawl-xyz/20080516162726/parse_text

I first renamed the last 3 directories to something else, the idea being
to lure the ./bin/nutch parse command. I the lunched it:

./bin/nutch parse ./crawl-xyz/segments/20080516162752

It seemed to work, the 3 directories having been reconstructed. But when
I check the content of the crawl_parse directory, the size of the
part-00000 generated file was ridiculous, 1k, compared to the size of
the original one: 350k.

I guess I did something wrong...

My objective is actually fairly simple, force the execution of one
homemade parse plugin (it implements HtmlParseFilter) on all the stored
fetched data without, as I said, refetching everything. I know how to
take care of the rest to reconstruct the index.

Is this actually possible?

Thank you and good week-end!

David