You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kaveh minooie <ka...@plutoz.com> on 2012/02/10 20:24:23 UTC

number of map tasks for a fetch job

So I during a fetch job the situation that keeps happening is that the 
majority of my map tasks (and by that I mean 99% of them)finish lets say 
for example in 2 hours and then the entire cluster waits another 2 hours 
for this 1 or 2 remaining map task to finish cause they still happened 
to have stuff in their queue. the only way I see out of this situation 
is to have a lot more map tasks for fetch job than currently exist. Now 
I have 2 questions.
first is this a reasonable solution or is there other ways to go about 
dealing with this situation?

second one is that is there a way to increase the number of map tasks 
specifically for fetch job without effecting other tasks( parse, 
updatedb,...). I mean without increasing the number for  mapred.map.tasks?

Thanks,

-- 
Kaveh Minooie

www.plutoz.com

Re: number of map tasks for a fetch job

Posted by kaveh minooie <ka...@plutoz.com>.
I understand that it is a matter of distribution of urls but I was 
hopping that increasing the number of map tasks would cause each one of 
them to have a smaller queue and that would result in a better 
parallelism than what i have right now.

and limiting the number of urls per hosts is not really an option for me 
since I need to get them all.

On 02/10/2012 11:29 AM, Julien Nioche wrote:
> On 10 February 2012 19:24, kaveh minooie<ka...@plutoz.com>  wrote:
>
>> So I during a fetch job the situation that keeps happening is that the
>> majority of my map tasks (and by that I mean 99% of them)finish lets say
>> for example in 2 hours and then the entire cluster waits another 2 hours
>> for this 1 or 2 remaining map task to finish cause they still happened to
>> have stuff in their queue. the only way I see out of this situation is to
>> have a lot more map tasks for fetch job than currently exist.
>
>
> the number of tasks is irrelevant. It's the distribution per domain /
> hostname that usually matters
>
>
>
>> Now I have 2 questions.
>> first is this a reasonable solution or is there other ways to go about
>> dealing with this situation?
>>
>
> limit the number of URLS per hosts /  domain and use *fetcher.timelimit.mins
> *to make sure that it finishes within a set timeframe
>
>
> Julien
>

-- 
Kaveh Minooie

www.plutoz.com

Re: number of map tasks for a fetch job

Posted by Julien Nioche <li...@gmail.com>.
On 10 February 2012 19:24, kaveh minooie <ka...@plutoz.com> wrote:

> So I during a fetch job the situation that keeps happening is that the
> majority of my map tasks (and by that I mean 99% of them)finish lets say
> for example in 2 hours and then the entire cluster waits another 2 hours
> for this 1 or 2 remaining map task to finish cause they still happened to
> have stuff in their queue. the only way I see out of this situation is to
> have a lot more map tasks for fetch job than currently exist.


the number of tasks is irrelevant. It's the distribution per domain /
hostname that usually matters



> Now I have 2 questions.
> first is this a reasonable solution or is there other ways to go about
> dealing with this situation?
>

limit the number of URLS per hosts /  domain and use *fetcher.timelimit.mins
*to make sure that it finishes within a set timeframe


Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble