You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Euan Clark <eu...@nzs.com> on 2009/01/22 04:33:43 UTC

Extracting homepage content

Hi All,

I have a list of around 100k {URLs, segment} that I am certain exist in 
the crawlset.

I want to extract only the page content of particular URLs.

Looked at:

nutch readseg -get <segment dir> "<url>" -nofetch -nogenerate -noparse - 
noparsedata -noparsetext

This is taking roughly 5 seconds per URL within a content dir of around 
250MB.

Same amount of time if the segment is sitting in /dev/shm (RAM) or disk 
so it's looking like process overhead.
I'm guessing that this is due to Java initialisation.

At 100k URLs this extrapolates out to a run time of around 28 hrs.

Is there a faster way to do random access grabbing of page content?

(Nutch 0.9)