You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Euan Clark <eu...@nzs.com> on 2009/01/22 04:33:43 UTC
Extracting homepage content
Hi All,
I have a list of around 100k {URLs, segment} that I am certain exist in
the crawlset.
I want to extract only the page content of particular URLs.
Looked at:
nutch readseg -get <segment dir> "<url>" -nofetch -nogenerate -noparse -
noparsedata -noparsetext
This is taking roughly 5 seconds per URL within a content dir of around
250MB.
Same amount of time if the segment is sitting in /dev/shm (RAM) or disk
so it's looking like process overhead.
I'm guessing that this is due to Java initialisation.
At 100k URLs this extrapolates out to a run time of around 28 hrs.
Is there a faster way to do random access grabbing of page content?
(Nutch 0.9)