You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Johan Oskarsson <jo...@oskarsson.nu> on 2008/01/15 18:15:50 UTC

Hadoop overhead

Hi.

I believe someone posted about this a while back, but it's worth 
mentioning again.

I just ran a job on our 10 node cluster where the input data was
~70 empty sequence files, with our default settings this ran about ~200 
mappers and ~70 reducers.

The job took almost exactly two minutes to finish.

How can we reduce this overhead?

* Pick number of mappers and reducers in a more dynamic way,
   depending on the size of the input?
* JVM reuse, one jvm per job instead of one per task?

Any other ideas?

/Johan

Re: Hadoop overhead

Posted by Ted Dunning <td...@veoh.com>.

There is some considerable and very understandable confusion about map
tasks, mappers and input splits.

It is true that for large inputs the input should ultimately be split into
chunks so that each core that you have has to process 10-100 pieces of data.
To do that, however, you only need one map task per core, plus perhaps an
extra or so to fill in any scheduling cracks.  The input splits are good
enough at splitting up the input data that you don't usually have to worry
about that unless your data is for some reason unsplittable.  In that case,
you have to make sure you have enough input files to provide sufficient
parallelism.  Then you just set the number of map tasks per machine at a
good level (I would recommend about 10 for you).  You should set the limit
on the number of map tasks per machine in hadoop-site.xml.

When you drill into the data on the web control panel for the map-reduce
system, you will eventually find data for individual map "tips".  These are
units of work and when they are assigned to tasks on nodes, you will be able
to see that.  Sometimes they will be assigned to more than one node if the
job tracker thinks that might be a good idea for fault tolerance.  Whenever
a tip is completed, other tasks working on the same node will be killed.

Does that help a bit?

On 1/16/08 2:50 AM, "Johan Oskarsson" <jo...@oskarsson.nu> wrote:

> I simply followed the wiki "The right level of parallelism for maps
> seems to be around 10-100 maps/node",
> http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces
> 
> We have 8 cores in each machine, so perhaps 100 mappers ought to be
> right, it's set to 157 in the config but hadoop used ~200 for the job,
> don't know why. That would of course help in this case, but what about
> when we process large datasets? Especially if a mapper fails.
> 
> Reducers I also setup to use ~1 per core, slightly less.
> 
> /Johan
> 
> Ted Dunning wrote:
>> Why so many mappers and reducers relative to the number of machines you
>> have?  This just causes excess heartache when running the job.
>> 
>> My standard practice is to run with a small factor larger than the number of
>> cores that I have (for instance 3 tasks on a 2 core machine).  In fact, I
>> find it most helpful to have the cluster defaults rule the choice except in
>> a few cases where I want one reducer or a few more than the standard 4
>> reducers.
>> 
>> 
>> On 1/15/08 9:15 AM, "Johan Oskarsson" <jo...@oskarsson.nu> wrote:
>> 
>>> Hi.
>>> 
>>> I believe someone posted about this a while back, but it's worth
>>> mentioning again.
>>> 
>>> I just ran a job on our 10 node cluster where the input data was
>>> ~70 empty sequence files, with our default settings this ran about ~200
>>> mappers and ~70 reducers.
>>> 
>>> The job took almost exactly two minutes to finish.
>>> 
>>> How can we reduce this overhead?
>>> 
>>> * Pick number of mappers and reducers in a more dynamic way,
>>>    depending on the size of the input?
>>> * JVM reuse, one jvm per job instead of one per task?
>>> 
>>> Any other ideas?
>>> 
>>> /Johan
>> 
>

Re: Hadoop overhead

Posted by Johan Oskarsson <jo...@oskarsson.nu>.

I simply followed the wiki "The right level of parallelism for maps 
seems to be around 10-100 maps/node", 
http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces

We have 8 cores in each machine, so perhaps 100 mappers ought to be 
right, it's set to 157 in the config but hadoop used ~200 for the job, 
don't know why. That would of course help in this case, but what about 
when we process large datasets? Especially if a mapper fails.

Reducers I also setup to use ~1 per core, slightly less.

/Johan

Ted Dunning wrote:
> Why so many mappers and reducers relative to the number of machines you
> have?  This just causes excess heartache when running the job.
> 
> My standard practice is to run with a small factor larger than the number of
> cores that I have (for instance 3 tasks on a 2 core machine).  In fact, I
> find it most helpful to have the cluster defaults rule the choice except in
> a few cases where I want one reducer or a few more than the standard 4
> reducers.
> 
> 
> On 1/15/08 9:15 AM, "Johan Oskarsson" <jo...@oskarsson.nu> wrote:
> 
>> Hi.
>>
>> I believe someone posted about this a while back, but it's worth
>> mentioning again.
>>
>> I just ran a job on our 10 node cluster where the input data was
>> ~70 empty sequence files, with our default settings this ran about ~200
>> mappers and ~70 reducers.
>>
>> The job took almost exactly two minutes to finish.
>>
>> How can we reduce this overhead?
>>
>> * Pick number of mappers and reducers in a more dynamic way,
>>    depending on the size of the input?
>> * JVM reuse, one jvm per job instead of one per task?
>>
>> Any other ideas?
>>
>> /Johan
>

Re: Hadoop overhead

Posted by Ted Dunning <td...@veoh.com>.

Why so many mappers and reducers relative to the number of machines you
have?  This just causes excess heartache when running the job.

My standard practice is to run with a small factor larger than the number of
cores that I have (for instance 3 tasks on a 2 core machine).  In fact, I
find it most helpful to have the cluster defaults rule the choice except in
a few cases where I want one reducer or a few more than the standard 4
reducers.

On 1/15/08 9:15 AM, "Johan Oskarsson" <jo...@oskarsson.nu> wrote:

> Hi.
> 
> I believe someone posted about this a while back, but it's worth
> mentioning again.
> 
> I just ran a job on our 10 node cluster where the input data was
> ~70 empty sequence files, with our default settings this ran about ~200
> mappers and ~70 reducers.
> 
> The job took almost exactly two minutes to finish.
> 
> How can we reduce this overhead?
> 
> * Pick number of mappers and reducers in a more dynamic way,
>    depending on the size of the input?
> * JVM reuse, one jvm per job instead of one per task?
> 
> Any other ideas?
> 
> /Johan