You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by weishenyun <wl...@yahoo.com.cn> on 2013/07/02 04:44:04 UTC

Running multiple nutch jobs to fetch a same site with millions of pages

Hi,

I tried to crawl millions of pages from a single site by Nutch 2.0. Since
Nutch will only use one reducer task to fetch all the pages from the same
domain/host, I tried to launch multiple Nutch jobs on the same Hadoop
cluster to accelerate the crawl speed. But it seems that different jobs
generated the same fetchlist. How can I configure and set the crawl
parameter to achieve my goal? For example, there are 10 million pages from a
same site and they are already stored in the table. I want to launch two
jobs to fetch them in parallel. How can I configure so that the first job
will fetch the first 5 million pages and the second job will fetch another 5
million?



--
View this message in context: http://lucene.472066.n3.nabble.com/Running-multiple-nutch-jobs-to-fetch-a-same-site-with-millions-of-pages-tp4074523.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Running multiple nutch jobs to fetch a same site with millions of pages

Posted by al...@aim.com.

You can decrease fetcher.server.delay. Another way is to split storage table and run many instances of nutch. However, if you do not own the server where the crawled domain hosted you could be blocked, since frequent requests might be  accepted as a Dos attack.

hth.
Alex. 
 

 

 

-----Original Message-----
From: weishenyun <wl...@yahoo.com.cn>
To: user <us...@nutch.apache.org>
Sent: Mon, Jul 1, 2013 8:17 pm
Subject: Re: Running multiple nutch jobs to fetch a same site with millions of pages


Hi alxsss,

I have tried that. I have set -numTasks > 1 and set mapred.reduce.tasks > 1.
But still only one reducer task tried to fetch all the pages from the same
site.



--
View this message in context: http://lucene.472066.n3.nabble.com/Running-multiple-nutch-jobs-to-fetch-a-same-site-with-millions-of-pages-tp4074523p4074539.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Running multiple nutch jobs to fetch a same site with millions of pages

Posted by weishenyun <wl...@yahoo.com.cn>.

Hi alxsss,

I have tried that. I have set -numTasks > 1 and set mapred.reduce.tasks > 1.
But still only one reducer task tried to fetch all the pages from the same
site.



--
View this message in context: http://lucene.472066.n3.nabble.com/Running-multiple-nutch-jobs-to-fetch-a-same-site-with-millions-of-pages-tp4074523p4074539.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Running multiple nutch jobs to fetch a same site with millions of pages

Posted by al...@aim.com.

Hi,

Try to run more than one reducer by adding numtask param option to the fetch command.

hth,
Alex. 

 

 

 

-----Original Message-----
From: weishenyun <wl...@yahoo.com.cn>
To: user <us...@nutch.apache.org>
Sent: Mon, Jul 1, 2013 7:44 pm
Subject: Running multiple nutch jobs to fetch a same site with millions of pages


Hi,

I tried to crawl millions of pages from a single site by Nutch 2.0. Since
Nutch will only use one reducer task to fetch all the pages from the same
domain/host, I tried to launch multiple Nutch jobs on the same Hadoop
cluster to accelerate the crawl speed. But it seems that different jobs
generated the same fetchlist. How can I configure and set the crawl
parameter to achieve my goal? For example, there are 10 million pages from a
same site and they are already stored in the table. I want to launch two
jobs to fetch them in parallel. How can I configure so that the first job
will fetch the first 5 million pages and the second job will fetch another 5
million?



--
View this message in context: http://lucene.472066.n3.nabble.com/Running-multiple-nutch-jobs-to-fetch-a-same-site-with-millions-of-pages-tp4074523.html
Sent from the Nutch - User mailing list archive at Nabble.com.