You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vishal Shah <vi...@rediff.co.in> on 2006/09/04 10:43:00 UTC

# of tasks executed in parallel

Hi,
 
  I am using Nutch 0.9 for crawling. I recollect that
mapred.tasktracker.tasks.maximum can be used to control the max # of
tasks executed in parallel by a tasktracker.
 
  I am running a fetch with the following config:
 
3 machines
 
My mapred-default.xml contains:
 
mapred.map.tasks=13
mapred.reduce.tasks=7
mapred.tasktracker.tasks.maximum=4
 
I ran generate using -numFetchers=12, however while fetching I see that
only 2 tasks are running at a time on each machine (instead of 4).
 
Any pointers?
 
-vishal.

RE: # of tasks executed in parallel

Posted by Vishal Shah <vi...@rediff.co.in>.
Hi Dennis,

   I am trying to fetch 1M urls at a time. Each machine has similar
settings. I am pretty sure the problems are happening during the parse
phase. I tried using -noParsing option during fetch, and then parsing
using the parse command. The fetch works fine, but the parse stalls and
fails sometimes.

-vishal.

-----Original Message-----
From: Dennis Kubes [mailto:nutch-dev@dragonflymc.com] 
Sent: Friday, September 08, 2006 11:24 PM
To: nutch-user@lucene.apache.org
Subject: Re: # of tasks executed in parallel

How many urls are you fetching and does each machine have the same 
settings as below?

Remember that number of fetchers is number of fetcher threads per task 
per machine.  So you would be running 2 tasks per machine * 12 threads *

3 machines = 75 fetchers.

Dennis

Vishal Shah wrote:
> Hi,
>  
>   I am using Nutch 0.9 for crawling. I recollect that
> mapred.tasktracker.tasks.maximum can be used to control the max # of
> tasks executed in parallel by a tasktracker.
>  
>   I am running a fetch with the following config:
>  
> 3 machines
>  
> My mapred-default.xml contains:
>  
> mapred.map.tasks=13
> mapred.reduce.tasks=7
> mapred.tasktracker.tasks.maximum=4
>  
> I ran generate using -numFetchers=12, however while fetching I see
that
> only 2 tasks are running at a time on each machine (instead of 4).
>  
> Any pointers?
>  
> -vishal.
>
>   


Re: # of tasks executed in parallel

Posted by Dennis Kubes <nu...@dragonflymc.com>.
How many urls are you fetching and does each machine have the same 
settings as below?

Remember that number of fetchers is number of fetcher threads per task 
per machine.  So you would be running 2 tasks per machine * 12 threads * 
3 machines = 75 fetchers.

Dennis

Vishal Shah wrote:
> Hi,
>  
>   I am using Nutch 0.9 for crawling. I recollect that
> mapred.tasktracker.tasks.maximum can be used to control the max # of
> tasks executed in parallel by a tasktracker.
>  
>   I am running a fetch with the following config:
>  
> 3 machines
>  
> My mapred-default.xml contains:
>  
> mapred.map.tasks=13
> mapred.reduce.tasks=7
> mapred.tasktracker.tasks.maximum=4
>  
> I ran generate using -numFetchers=12, however while fetching I see that
> only 2 tasks are running at a time on each machine (instead of 4).
>  
> Any pointers?
>  
> -vishal.
>
>