You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by weishenyun <wl...@yahoo.com.cn> on 2013/07/02 04:37:49 UTC

RE: Multiple nutch jobs on a Hadoop cluster simultaneosuly

Hi Markus,

I have met the same problem. I tried to crawl millions of pages from a
single site by Nutch 2.0. Since Nutch will only use one reducer task to
fetch all the pages from the same domain/host, I tried to launch multiple
Nutch jobs on the same Hadoop cluster to accelerate the crawl speed. But it
seems that different jobs generated the same fetchlist. How can I configure
and set the crawl parameter to achieve my goal? For example, there are 10
million pages from a same site and they are already stored in the table. I
want to launch two jobs to fetch them in parallel. How can I configure so
that the first job will fetch the first 5 million pages and the second job
will fetch another 5 million?



--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-nutch-jobs-on-a-Hadoop-cluster-simultaneosuly-tp3985889p4074517.html
Sent from the Nutch - User mailing list archive at Nabble.com.