You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Zhen Zhen <zh...@cs.dal.ca> on 2006/09/18 19:07:54 UTC

Speed of reading local files

Hi

I have changed the protocol-http plugin so that Nutch will read from local
file system, instead of from the Internet, on those already-crawled pages.
(I tried to use FILE:// protocol, but it seemed to me the interconnection
information among pages were lost). Right now, I have made it work, but
it's very slow. It took 10 minutes executing "fetch" command on 400 pages.
And I was on a 4 CPU box with 4 threads. I am wondering if this is normal,
because this is euqal to 400 hours/box to read 1 million pages, which is
>15 days.

Any suggestion will be appreciated.

Zhen