You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Juwei Shi <sh...@gmail.com> on 2011/04/19 11:58:21 UTC
Questions about MultithreadedMapper

Hi,



I am looking at the feature of multithreaded map tasks. I find that the new
API provides org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper class
to enable multi-thread in each map task. We can also set the number of
threads in the thread pool that will run the map function by
setNumberOfThreads API.



Here I want to clarify the scenarios in which we should enable the
multithreaded map tasks. Generally, Hadoop MapReduce provides the
mapred.tasktracker.map.tasks.maximum parameter to control capacity of
concurrent map tasks (also we have corresponding parameter for reduce
tasks). We can start more child task JVM to increase CPU utilization. We do
not need multithreaded tasks in most scenarios. However, multithreaded tasks
may be enabled in the specific scenarios:

1)      When the workload is bounded by Memory or I/O, not CPU. For example,
we want load input of running map task into memory, and we can only load 50
GB input to the cluster at most, but the CPU of the cluster is not fully
utilized. Then we can enable multithreaded tasks to increase the CPU
utilization.

2)      When the tasks are unbalanced. I have encountered this problem when
I process very large social graphs. If I assigned 200 map tasks (averagely 8
concurrent map tasks for each node, totally 7 nodes), 99% of tasks complete
within 1 hour. But the rest 1% of tasks will take more than 10 hours. This
is caused by un-balanced degree distribution of the social graph. The CPU
utilization of the running node is lower than 20% when most tasks complete.
I think that we can enable multi-threaded tasks now to increase the CPU
utilization.



My questions:

1.       Is above understanding right?

2.       Why there’s no multithreaded reducer interface?

3.       How to set right number of thread? (The number to enable all cores
being utilized?)

4.       I see some prior articles point out that we should pay attention to
thread safe when using multithreaded mapper. I can not quite understand
this. The basic model of MapReduce enables the naturally isolation of each
key. I guess a key should be processed within a thread even if we enable the
multithreaded mapper, how could multiple threads interact with each other?



Discussion and comments are welcomed!

-- 
- Juwei