You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lokendra Singh <ls...@gmail.com> on 2011/01/19 18:49:45 UTC

How does mahout decide upon the number of map-reduce tasks to be launched (utilising multi-core nodes)

Hi all,

I am running KMeans algorithm from mahout-0.4 on a Hadoop (0.20.2) cluster.

Each node in my cluster has a Quad-core processor, hence I wished to launch
3 map and 3 reduce tasks on each node (1 core left for data-node and
tasktracker services).
Hence I set the properties :
mapred.tasktracker.map.tasks.maximum
& mapred.tasktracker.reduce.tasks.maximum to 3
and
mapred.map.tasks and mapred.reduce.tasks to 3*(no of nodes)

I tested running it on a 2 node and 6 node cluster, but in both cases only
total 5 map tasks & total 2 reducers are launched, which in case of 2 node
cluster utilizes ~3 cores on each node but it leads to underutilization of
resources in case of a 6 node cluster, where only ~1 core of each node is
used.

Please explain this behavior of these fixed no of map-reduce (5,2) tasks
being launched in both the cases.
I am guessing it to depends upon the input data for KMeans algorithm to
select the optimum number of map-red tasks (sorry, i did not test with
different input data). In that case, how to properly utilize the 6-node
cluster.


Regards
Lokendra

Re: How does mahout decide upon the number of map-reduce tasks to be launched (utilising multi-core nodes)

Posted by james q <ja...@gmail.com>.
Hey,

I'm a bit of a mahout / hadoop newbie myself, but from what I know, the
number of map tasks is determined solely bu the input. You can give it a
hint via mapred.map.tasks, but its only a hint. To change the number of map
tasks, you need to change dfs.block.size and mapred.max.split.size from the
default of 64M to something smaller (but a multiple of 512).

So it seems that 64M generated only 5 map tasks, when you want a total of 18
(3 map tasks on 6 machines). A block size of almost 1/4, around 17M, would
get you 18 map tasks ( -Ddfs.block.size=17825792
-Dmapred.max.split.size=17825792 ). I don't know if this is generally
advised by Mahout users, but it should help.

The number of reducers can be set explicitly to 18:
-Dmapred.reduce.tasks=18. However, you did set mapred.reduce.tasks to 3*(no
of nodes) ... are you sure that value is in all the node's conf files?

-- james

On Wed, Jan 19, 2011 at 12:49 PM, Lokendra Singh <ls...@gmail.com>wrote:

> Hi all,
>
> I am running KMeans algorithm from mahout-0.4 on a Hadoop (0.20.2) cluster.
>
> Each node in my cluster has a Quad-core processor, hence I wished to launch
> 3 map and 3 reduce tasks on each node (1 core left for data-node and
> tasktracker services).
> Hence I set the properties :
> mapred.tasktracker.map.tasks.maximum
> & mapred.tasktracker.reduce.tasks.maximum to 3
> and
> mapred.map.tasks and mapred.reduce.tasks to 3*(no of nodes)
>
> I tested running it on a 2 node and 6 node cluster, but in both cases only
> total 5 map tasks & total 2 reducers are launched, which in case of 2 node
> cluster utilizes ~3 cores on each node but it leads to underutilization of
> resources in case of a 6 node cluster, where only ~1 core of each node is
> used.
>
> Please explain this behavior of these fixed no of map-reduce (5,2) tasks
> being launched in both the cases.
> I am guessing it to depends upon the input data for KMeans algorithm to
> select the optimum number of map-red tasks (sorry, i did not test with
> different input data). In that case, how to properly utilize the 6-node
> cluster.
>
>
> Regards
> Lokendra
>