You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dustine Rene Bernasor <du...@thecyberguardian.com> on 2012/05/24 12:57:25 UTC

Multiple nutch jobs on a Hadoop cluster simultaneosuly

Hello

I was wondering, would it be possible to run multiple nutch jobs on a 
single Hadoop cluster at the same time? Like I would perform two crawls 
at the same time. Is this possible, granted there is enough memory or 
would this result in conflict?

RE: Multiple nutch jobs on a Hadoop cluster simultaneosuly

Posted by weishenyun <wl...@yahoo.com.cn>.
Hi Markus,

I have met the same problem. I tried to crawl millions of pages from a
single site by Nutch 2.0. Since Nutch will only use one reducer task to
fetch all the pages from the same domain/host, I tried to launch multiple
Nutch jobs on the same Hadoop cluster to accelerate the crawl speed. But it
seems that different jobs generated the same fetchlist. How can I configure
and set the crawl parameter to achieve my goal? For example, there are 10
million pages from a same site and they are already stored in the table. I
want to launch two jobs to fetch them in parallel. How can I configure so
that the first job will fetch the first 5 million pages and the second job
will fetch another 5 million?



--
View this message in context: http://lucene.472066.n3.nabble.com/Multiple-nutch-jobs-on-a-Hadoop-cluster-simultaneosuly-tp3985889p4074517.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Multiple nutch jobs on a Hadoop cluster simultaneosuly

Posted by Behnam Nikbakht <be...@gmail.com>.
i had same problem, you can search on runnting multiple jobs on hadoop, i
am using multiple pool with fair-scheduler, that every job belongs to one
pool (need some changes in nutch code) and then use fair-scheduler.

Re: Multiple nutch jobs on a Hadoop cluster simultaneosuly

Posted by Dustine Rene Bernasor <du...@thecyberguardian.com>.
Hello

I tried running solrindex on a 3-slave cluster currently performing fetch.
Only 1 slave is crawling, the other 2 seem to be idle (no activity on 
the hadoop log)
but the solrindex task is pending. If multiple nutch jobs are performed
will it wait until the current job is done?

On 5/24/2012 9:24 PM, Markus Jelsma wrote:
> Hi,
>
> Yes, this is no problem.
>
> Cheers
>
> -----Original message-----
>> From:Dustine Rene Bernasor<du...@thecyberguardian.com>
>> Sent: Thu 24-May-2012 12:58
>> To: user@nutch.apache.org
>> Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly
>>
>> Hello
>>
>> I was wondering, would it be possible to run multiple nutch jobs on a
>> single Hadoop cluster at the same time? Like I would perform two crawls
>> at the same time. Is this possible, granted there is enough memory or
>> would this result in conflict?
>>


RE: Multiple nutch jobs on a Hadoop cluster simultaneosuly

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

Yes, this is no problem.

Cheers
 
-----Original message-----
> From:Dustine Rene Bernasor <du...@thecyberguardian.com>
> Sent: Thu 24-May-2012 12:58
> To: user@nutch.apache.org
> Subject: Multiple nutch jobs on a Hadoop cluster simultaneosuly
> 
> Hello
> 
> I was wondering, would it be possible to run multiple nutch jobs on a 
> single Hadoop cluster at the same time? Like I would perform two crawls 
> at the same time. Is this possible, granted there is enough memory or 
> would this result in conflict?
>