You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by POIRIER David <DP...@cross-systems.com> on 2008/05/30 10:47:20 UTC
fetching and parsing
Hello,
I have a fairly big crawl (50 000 documents) that I'd like to re-parse
without actually having to re-fetch it. I tried going segment by
segment. Let's say we have the following segment:
nutch/crawl-xyz/20080516162726
It contains the following directories:
nutch/crawl-xyz/20080516162726/content
nutch/crawl-xyz/20080516162726/crawl_fetch
nutch/crawl-xyz/20080516162726crawl_generate
nutch/crawl-xyz/20080516162726/crawl_parse
nutch/crawl-xyz/20080516162726/parse_data
nutch/crawl-xyz/20080516162726/parse_text
I first renamed the last 3 directories to something else, the idea being
to lure the ./bin/nutch parse command. I the lunched it:
./bin/nutch parse ./crawl-xyz/segments/20080516162752
It seemed to work, the 3 directories having been reconstructed. But when
I check the content of the crawl_parse directory, the size of the
part-00000 generated file was ridiculous, 1k, compared to the size of
the original one: 350k.
I guess I did something wrong...
My objective is actually fairly simple, force the execution of one
homemade parse plugin (it implements HtmlParseFilter) on all the stored
fetched data without, as I said, refetching everything. I know how to
take care of the rest to reconstruct the index.
Is this actually possible?
Thank you and good week-end!
David