You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by juber patel <ju...@gmail.com> on 2010/05/17 03:55:43 UTC

MultithreadedMapRunner or MultithreadedMapper?

Hello,


I am a bit confused between MultithreadedMapRunner and
MultithreadedMapper classes. Basically I have huge "side data" (4GB)
for the map part and I want it in memory. I don't want each mapper to
load its own copy of that data. So I decided to limit one mapper per
machine and and make it multithreaded so that all the cores are
utilized. The side data is read only and can be shared by all threads.

My question is: Which one of MultithreadedMapRunner and
MultithreadedMapper classes should I be using? Or they have to be used
together? (choose MultithreadedMapRunner in the config file and then
extend MultithreadedMapper for map tasks). I notice that one is in
mapred package and the other is in mapreduce package but neither is
deprecated. I can use the latest version of Hadoop since I am just
starting up.


thanks in advance,


Juber

Re: MultithreadedMapRunner or MultithreadedMapper?

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.
Hi Juber,

MultithreadedMapper uses new api that got introduced in branch 0.20, whereas MultithreadedMapRunner uses old interface.
MultithreadedMapRunner is deprecated in branch 0.21 through https://issues.apache.org/jira/browse/MAPREDUCE-465.
If you are using branch 0.20, you can use any one of them. But do not use them together.
I would prefer to use MultthreadedMapper, because the other will be deprecated in subsequent versions.

Thanks
Amareshwari

On 5/17/10 7:25 AM, "juber patel" <ju...@gmail.com> wrote:

Hello,


I am a bit confused between MultithreadedMapRunner and
MultithreadedMapper classes. Basically I have huge "side data" (4GB)
for the map part and I want it in memory. I don't want each mapper to
load its own copy of that data. So I decided to limit one mapper per
machine and and make it multithreaded so that all the cores are
utilized. The side data is read only and can be shared by all threads.

My question is: Which one of MultithreadedMapRunner and
MultithreadedMapper classes should I be using? Or they have to be used
together? (choose MultithreadedMapRunner in the config file and then
extend MultithreadedMapper for map tasks). I notice that one is in
mapred package and the other is in mapreduce package but neither is
deprecated. I can use the latest version of Hadoop since I am just
starting up.


thanks in advance,


Juber