You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kaveh minooie <ka...@plutoz.com> on 2012/02/10 20:24:23 UTC
number of map tasks for a fetch job
So I during a fetch job the situation that keeps happening is that the
majority of my map tasks (and by that I mean 99% of them)finish lets say
for example in 2 hours and then the entire cluster waits another 2 hours
for this 1 or 2 remaining map task to finish cause they still happened
to have stuff in their queue. the only way I see out of this situation
is to have a lot more map tasks for fetch job than currently exist. Now
I have 2 questions.
first is this a reasonable solution or is there other ways to go about
dealing with this situation?
second one is that is there a way to increase the number of map tasks
specifically for fetch job without effecting other tasks( parse,
updatedb,...). I mean without increasing the number for mapred.map.tasks?
Thanks,
--
Kaveh Minooie
www.plutoz.com
Re: number of map tasks for a fetch job
Posted by kaveh minooie <ka...@plutoz.com>.
I understand that it is a matter of distribution of urls but I was
hopping that increasing the number of map tasks would cause each one of
them to have a smaller queue and that would result in a better
parallelism than what i have right now.
and limiting the number of urls per hosts is not really an option for me
since I need to get them all.
On 02/10/2012 11:29 AM, Julien Nioche wrote:
> On 10 February 2012 19:24, kaveh minooie<ka...@plutoz.com> wrote:
>
>> So I during a fetch job the situation that keeps happening is that the
>> majority of my map tasks (and by that I mean 99% of them)finish lets say
>> for example in 2 hours and then the entire cluster waits another 2 hours
>> for this 1 or 2 remaining map task to finish cause they still happened to
>> have stuff in their queue. the only way I see out of this situation is to
>> have a lot more map tasks for fetch job than currently exist.
>
>
> the number of tasks is irrelevant. It's the distribution per domain /
> hostname that usually matters
>
>
>
>> Now I have 2 questions.
>> first is this a reasonable solution or is there other ways to go about
>> dealing with this situation?
>>
>
> limit the number of URLS per hosts / domain and use *fetcher.timelimit.mins
> *to make sure that it finishes within a set timeframe
>
>
> Julien
>
--
Kaveh Minooie
www.plutoz.com
Re: number of map tasks for a fetch job
Posted by Julien Nioche <li...@gmail.com>.
On 10 February 2012 19:24, kaveh minooie <ka...@plutoz.com> wrote:
> So I during a fetch job the situation that keeps happening is that the
> majority of my map tasks (and by that I mean 99% of them)finish lets say
> for example in 2 hours and then the entire cluster waits another 2 hours
> for this 1 or 2 remaining map task to finish cause they still happened to
> have stuff in their queue. the only way I see out of this situation is to
> have a lot more map tasks for fetch job than currently exist.
the number of tasks is irrelevant. It's the distribution per domain /
hostname that usually matters
> Now I have 2 questions.
> first is this a reasonable solution or is there other ways to go about
> dealing with this situation?
>
limit the number of URLS per hosts / domain and use *fetcher.timelimit.mins
*to make sure that it finishes within a set timeframe
Julien
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble