You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ashish vyas <ma...@gmail.com> on 2012/03/30 11:58:10 UTC

Nutch on Hadoop cluster

Hi,

I have setup hadoop cluster(2 nodes) and trying to run nutch crawl on it.
Currently in our application we are running Nutch crawl without hadoop and
its taking lot of time to crawl. I am trying to improve the performance by
using hadoop cluster for crawl. But i found that Processing time is
increasing when i run on cluster vs pseudo. Also standalone Nutch
crawl(without hadoop) is taking less time than hadoop run. Please let me
know if there is any benchmark report/info for Nutch on small hadoop
cluster. I have found reports but those talk about 1000 nodes cluster. I
need some performance report for smaller cluster. It would be great if
anybody can help me with no. of URLS vs hadoop nodes for getting
performance improvement against standalone Nutch crawl.

Regards:
Ashish Vyas