You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Eric Sammer <es...@cloudera.com> on 2010/05/11 18:59:58 UTC

Re: Problem with Hadoop Streaming and -D mapred.tasktracker.map.tasks.maximum option

The short answer is that with Hadoop, you generally do not decide the
exact number of map tasks that are spawned. The number of map tasks
spawned is usually a function of the number of blocks in the input
data set. Task trackers are configured with a number of slots for map
and reduce tasks. Tasks are assigned to slots on task trackers. By
default, task trackers have 2 map slots and 2 reduce slots per task
tracker.

The manner with which Hadoop assigns tasks to task trackers is based
on a number of factors.

You can attempt to control parallelization at a micro level (as you're
doing) but it's generally a bad idea. Not only are you not taking full
advantage of your cluster, but you are not taking advantage of what
Hadoop is actually good at. In fact, it may not be possible to control
it exactly as you wish. Is there a reason why you need to control
things so strictly? Do you need exactly a multiple of the number of
nodes, or an approximation thereof? What is the rationale for wanting
to run only one task per node?

On Mon, May 10, 2010 at 10:07 AM, Corneliu-Tudor Vlad
<co...@ens-lyon.fr> wrote:
>
> Hello
>
> I am a new user of Hadoop and I have some trouble using Hadoop Streaming and
> the "-D mapred.tasktracker.map.tasks.maximum" option.
>
> I'm experimenting with an unmanaged application (C++) which I want to run
> over several nodes in 2 scenarious
> 1) the number of maps (input splits) is equal to the number of nodes
> 2) the number of maps is a multiple of the number of nodes (5, 10, 20, ...
>
> Initially, when running the tests in scenario 1 I would sometimes get 2
> process/node on half the nodes. However I fixed this by adding the directive
> -D mapred.tasktracker.map.tasks.maximum=1, so everything works fine.
>
> In the case of scenario 2 (more maps than nodes) this directive no longer
> works, always obtaining 2 processes/node. I tested the even with putting
> maximum=5 and I still get 2 processes/node.
>
> The entire command I use is:
>
> /usr/bin/time --format="-duration:\t%e |\t-MFaults:\t%F
> |\t-ContxtSwitch:\t%w" \
>  /opt/hadoop/bin/hadoop jar
> /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
>  -D mapred.tasktracker.map.tasks.maximum=1 \
>  -D mapred.map.tasks=30 \
>  -D mapred.reduce.tasks=0 \
>  -D io.file.buffer.size=5242880 \
>  -libjars "/opt/hadoop/contrib/streaming/hadoop-7debug.jar" \
>  -input input/test553short \
>  -output out1 \
>  -mapper "/opt/jobdata/script_1k" \
>  -inputformat "me.MyInputFormat"
>
> I'm using is Debian Lenny x64, and Hadoop 0.20.2.
>
> My question is: why is this happening and how can I make it work properly
> (i.e. be able to limit exactly how many mappers I can have at 1 time per
> node)
>
> Thank you in advance,
> T
>
>

-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Re: Problem with Hadoop Streaming and -D mapred.tasktracker.map.tasks.maximum option

Posted by Corneliu-Tudor Vlad <co...@ens-lyon.fr>.

Hello

Thank you for the answer, it is a little clearer now. If you could  
point me to some additional reading, I will be grateful.

In the meantime I created an issue on Jira [ MAPREDUCE-1781 ] and I  
both received answer there and provided insight on my objective.

As I understand from both your answer and Hemanth's on Jira, I am  
using Hadoop in a non-standard way. The reason is that I am performing  
some test on the feasibility of using Hadoop as a parallelization  
framework for a highly-CPU & memory bounded application, but only from  
the point of view of the distributed computed, not multicore. That is  
why I only want 1 process at a time.

Additionally I will test it on a heterogenous datacenter, possibly  
with both dual cores and quads, thus even if I use 2 mappers at once,  
I won't fully use the power of the cluster (from what I understand).

 From what I tested today, my intended approach works with the  
tasks.maximum option in the config file at startup.

Thank you,
T


Quoting Eric Sammer <es...@cloudera.com>:

> The short answer is that with Hadoop, you generally do not decide the
> exact number of map tasks that are spawned. The number of map tasks
> spawned is usually a function of the number of blocks in the input
> data set. Task trackers are configured with a number of slots for map
> and reduce tasks. Tasks are assigned to slots on task trackers. By
> default, task trackers have 2 map slots and 2 reduce slots per task
> tracker.
>
> The manner with which Hadoop assigns tasks to task trackers is based
> on a number of factors.
>
> You can attempt to control parallelization at a micro level (as you're
> doing) but it's generally a bad idea. Not only are you not taking full
> advantage of your cluster, but you are not taking advantage of what
> Hadoop is actually good at. In fact, it may not be possible to control
> it exactly as you wish. Is there a reason why you need to control
> things so strictly? Do you need exactly a multiple of the number of
> nodes, or an approximation thereof? What is the rationale for wanting
> to run only one task per node?
>
> On Mon, May 10, 2010 at 10:07 AM, Corneliu-Tudor Vlad
> <co...@ens-lyon.fr> wrote:
>>
>> Hello
>>
>> I am a new user of Hadoop and I have some trouble using Hadoop Streaming and
>> the "-D mapred.tasktracker.map.tasks.maximum" option.
>>
>> I'm experimenting with an unmanaged application (C++) which I want to run
>> over several nodes in 2 scenarious
>> 1) the number of maps (input splits) is equal to the number of nodes
>> 2) the number of maps is a multiple of the number of nodes (5, 10, 20, ...
>>
>> Initially, when running the tests in scenario 1 I would sometimes get 2
>> process/node on half the nodes. However I fixed this by adding the directive
>> -D mapred.tasktracker.map.tasks.maximum=1, so everything works fine.
>>
>> In the case of scenario 2 (more maps than nodes) this directive no longer
>> works, always obtaining 2 processes/node. I tested the even with putting
>> maximum=5 and I still get 2 processes/node.
>>
>> The entire command I use is:
>>
>> /usr/bin/time --format="-duration:\t%e |\t-MFaults:\t%F
>> |\t-ContxtSwitch:\t%w" \
>>  /opt/hadoop/bin/hadoop jar
>> /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
>>  -D mapred.tasktracker.map.tasks.maximum=1 \
>>  -D mapred.map.tasks=30 \
>>  -D mapred.reduce.tasks=0 \
>>  -D io.file.buffer.size=5242880 \
>>  -libjars "/opt/hadoop/contrib/streaming/hadoop-7debug.jar" \
>>  -input input/test553short \
>>  -output out1 \
>>  -mapper "/opt/jobdata/script_1k" \
>>  -inputformat "me.MyInputFormat"
>>
>> I'm using is Debian Lenny x64, and Hadoop 0.20.2.
>>
>> My question is: why is this happening and how can I make it work properly
>> (i.e. be able to limit exactly how many mappers I can have at 1 time per
>> node)
>>
>> Thank you in advance,
>> T
>>
>>
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>