You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Jyothish Soman <jy...@gmail.com> on 2010/06/11 07:30:51 UTC

Multithreaded Mapper and Map runner

Hi,

I am a newbie to Hadoop. I want to use the Multi threaded runner by default,
so I tried to change the MapTask.java code. it failed to compile using ant,
as mapreduce - mapred library conflict was there, Can you please suggest a
way through, so that  I can use the same.

Regards,
Jyothish Soman

Re: Multithreaded Mapper and Map runner

Posted by Ted Yu <yu...@gmail.com>.

If only thread is created to run mapper/reducer, how would
mapred.child.java.opts be effective ?

Please refer to src/mapred/org/apache/hadoop/mapred/TaskRunner.java which is
not very long.

On Wed, Jun 16, 2010 at 9:10 PM, Jyothish Soman <jy...@gmail.com>wrote:

>
> I have another doubt, for cross checking. The number set in
> mapred.tasktracker.map/reduce.tasks.maximum creates that many JVM instances,
> or does it just create that many threads. Though I could not see any
> explicit statement about it, it was pointed everywhere as if it is a JVM
> instance.
> Please do clarify
>
>
> On Mon, Jun 14, 2010 at 2:04 AM, Jyothish Soman <jy...@gmail.com>wrote:
>
>> Ok, understood this part, even though the architecture of hadoop is
>> designed for thread safety, the actual implementation level details make it
>> thread unsafe.
>>
>> Thank you for the comments, did a good background check and figured out
>> that staying within the hadoop framework, best way to manage multicore is
>> virtualization. Not just simple multithreading.
>>
>> Regards,
>> Jyothish Soman
>>
>>
>>
>> On Fri, Jun 11, 2010 at 7:09 PM, Aaron Kimball <aa...@cloudera.com>wrote:
>>
>>> This will likely break most programs you try to run. Many mapper
>>> implementations are not thread safe.
>>>
>>> That having been said, if you want to force all programs using the old
>>> API (org.apache.hadoop.mapred.*) to run on the multithreaded maprunner, you
>>> can do this by setting mapred.map.runner.class to
>>> org.apache.hadoop.mapred.lib.MultithreadedMapRunner in mapred-site.xml.
>>>
>>> Rather than do this in mapred-site.xml, it is far preferable to
>>> explicitly call jobConf.setMapRunnerClass() in the applications that require
>>> the multithreaded map runner.
>>>
>>> In the new API, the MapRunnable interface is not used. Instead the
>>> Mapper.run() method controls the execution of the map() method. For your own
>>> applications, you should subclass
>>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of
>>> o.a.h.mapreduce.Mapper. This will provide a multithreaded run() method. I am
>>> pretty sure that you cannot independently switch out the run() layer of an
>>> existing application except by modifying its source to subclass the
>>> MultithreadedMapper.
>>>
>>> Finally, you should really ask yourself why you're doing this. If you
>>> have multi-core machines, the best way to manage parallelism is to configure
>>> Hadoop to use multiple task slots per machine. Set
>>> mapred.tasktracker.map.tasks.maximum to '8' to use eight map tasks per node
>>> (This is changed to mapreduce.tasktracker.map.tasks.maximum in 0.21+). This
>>> allows single-threaded mapper code to efficiently process multiple input
>>> splits in parallel. The only time when it's better to use multithreaded
>>> maprunners is when a specific map() process is high-latency; e.g., you're
>>> running a web crawler in a mapper, and you want to overlap requests to
>>> foreign sites. But since this is not the norm, you should generally leave
>>> things singlethreaded.
>>>
>>> Hope this helps
>>> Cheers
>>> - Aaron
>>>
>>> On Fri, Jun 11, 2010 at 7:30 AM, Jyothish Soman <
>>> jyothish.soman@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am a newbie to Hadoop. I want to use the Multi threaded runner by
>>>> default, so I tried to change the MapTask.java code. it failed to compile
>>>> using ant, as mapreduce - mapred library conflict was there, Can you please
>>>> suggest a way through, so that  I can use the same.
>>>>
>>>> Regards,
>>>> Jyothish Soman
>>>>
>>>
>>>
>>
>

Re: Multithreaded Mapper and Map runner

Posted by Jyothish Soman <jy...@gmail.com>.

I have another doubt, for cross checking. The number set in
mapred.tasktracker.map/reduce.tasks.maximum creates that many JVM instances,
or does it just create that many threads. Though I could not see any
explicit statement about it, it was pointed everywhere as if it is a JVM
instance.
Please do clarify

On Mon, Jun 14, 2010 at 2:04 AM, Jyothish Soman <jy...@gmail.com>wrote:

> Ok, understood this part, even though the architecture of hadoop is
> designed for thread safety, the actual implementation level details make it
> thread unsafe.
>
> Thank you for the comments, did a good background check and figured out
> that staying within the hadoop framework, best way to manage multicore is
> virtualization. Not just simple multithreading.
>
> Regards,
> Jyothish Soman
>
>
>
> On Fri, Jun 11, 2010 at 7:09 PM, Aaron Kimball <aa...@cloudera.com> wrote:
>
>> This will likely break most programs you try to run. Many mapper
>> implementations are not thread safe.
>>
>> That having been said, if you want to force all programs using the old API
>> (org.apache.hadoop.mapred.*) to run on the multithreaded maprunner, you can
>> do this by setting mapred.map.runner.class to
>> org.apache.hadoop.mapred.lib.MultithreadedMapRunner in mapred-site.xml.
>>
>> Rather than do this in mapred-site.xml, it is far preferable to explicitly
>> call jobConf.setMapRunnerClass() in the applications that require the
>> multithreaded map runner.
>>
>> In the new API, the MapRunnable interface is not used. Instead the
>> Mapper.run() method controls the execution of the map() method. For your own
>> applications, you should subclass
>> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of
>> o.a.h.mapreduce.Mapper. This will provide a multithreaded run() method. I am
>> pretty sure that you cannot independently switch out the run() layer of an
>> existing application except by modifying its source to subclass the
>> MultithreadedMapper.
>>
>> Finally, you should really ask yourself why you're doing this. If you have
>> multi-core machines, the best way to manage parallelism is to configure
>> Hadoop to use multiple task slots per machine. Set
>> mapred.tasktracker.map.tasks.maximum to '8' to use eight map tasks per node
>> (This is changed to mapreduce.tasktracker.map.tasks.maximum in 0.21+). This
>> allows single-threaded mapper code to efficiently process multiple input
>> splits in parallel. The only time when it's better to use multithreaded
>> maprunners is when a specific map() process is high-latency; e.g., you're
>> running a web crawler in a mapper, and you want to overlap requests to
>> foreign sites. But since this is not the norm, you should generally leave
>> things singlethreaded.
>>
>> Hope this helps
>> Cheers
>> - Aaron
>>
>> On Fri, Jun 11, 2010 at 7:30 AM, Jyothish Soman <jyothish.soman@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I am a newbie to Hadoop. I want to use the Multi threaded runner by
>>> default, so I tried to change the MapTask.java code. it failed to compile
>>> using ant, as mapreduce - mapred library conflict was there, Can you please
>>> suggest a way through, so that  I can use the same.
>>>
>>> Regards,
>>> Jyothish Soman
>>>
>>
>>
>

Re: Multithreaded Mapper and Map runner

Posted by Jyothish Soman <jy...@gmail.com>.

Ok, understood this part, even though the architecture of hadoop is designed
for thread safety, the actual implementation level details make it thread
unsafe.

Thank you for the comments, did a good background check and figured out that
staying within the hadoop framework, best way to manage multicore is
virtualization. Not just simple multithreading.

Regards,
Jyothish Soman


On Fri, Jun 11, 2010 at 7:09 PM, Aaron Kimball <aa...@cloudera.com> wrote:

> This will likely break most programs you try to run. Many mapper
> implementations are not thread safe.
>
> That having been said, if you want to force all programs using the old API
> (org.apache.hadoop.mapred.*) to run on the multithreaded maprunner, you can
> do this by setting mapred.map.runner.class to
> org.apache.hadoop.mapred.lib.MultithreadedMapRunner in mapred-site.xml.
>
> Rather than do this in mapred-site.xml, it is far preferable to explicitly
> call jobConf.setMapRunnerClass() in the applications that require the
> multithreaded map runner.
>
> In the new API, the MapRunnable interface is not used. Instead the
> Mapper.run() method controls the execution of the map() method. For your own
> applications, you should subclass
> org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of
> o.a.h.mapreduce.Mapper. This will provide a multithreaded run() method. I am
> pretty sure that you cannot independently switch out the run() layer of an
> existing application except by modifying its source to subclass the
> MultithreadedMapper.
>
> Finally, you should really ask yourself why you're doing this. If you have
> multi-core machines, the best way to manage parallelism is to configure
> Hadoop to use multiple task slots per machine. Set
> mapred.tasktracker.map.tasks.maximum to '8' to use eight map tasks per node
> (This is changed to mapreduce.tasktracker.map.tasks.maximum in 0.21+). This
> allows single-threaded mapper code to efficiently process multiple input
> splits in parallel. The only time when it's better to use multithreaded
> maprunners is when a specific map() process is high-latency; e.g., you're
> running a web crawler in a mapper, and you want to overlap requests to
> foreign sites. But since this is not the norm, you should generally leave
> things singlethreaded.
>
> Hope this helps
> Cheers
> - Aaron
>
> On Fri, Jun 11, 2010 at 7:30 AM, Jyothish Soman <jy...@gmail.com>wrote:
>
>> Hi,
>>
>> I am a newbie to Hadoop. I want to use the Multi threaded runner by
>> default, so I tried to change the MapTask.java code. it failed to compile
>> using ant, as mapreduce - mapred library conflict was there, Can you please
>> suggest a way through, so that  I can use the same.
>>
>> Regards,
>> Jyothish Soman
>>
>
>

Re: Multithreaded Mapper and Map runner

Posted by Aaron Kimball <aa...@cloudera.com>.

This will likely break most programs you try to run. Many mapper
implementations are not thread safe.

That having been said, if you want to force all programs using the old API
(org.apache.hadoop.mapred.*) to run on the multithreaded maprunner, you can
do this by setting mapred.map.runner.class to
org.apache.hadoop.mapred.lib.MultithreadedMapRunner in mapred-site.xml.

Rather than do this in mapred-site.xml, it is far preferable to explicitly
call jobConf.setMapRunnerClass() in the applications that require the
multithreaded map runner.

In the new API, the MapRunnable interface is not used. Instead the
Mapper.run() method controls the execution of the map() method. For your own
applications, you should subclass
org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of
o.a.h.mapreduce.Mapper. This will provide a multithreaded run() method. I am
pretty sure that you cannot independently switch out the run() layer of an
existing application except by modifying its source to subclass the
MultithreadedMapper.

Finally, you should really ask yourself why you're doing this. If you have
multi-core machines, the best way to manage parallelism is to configure
Hadoop to use multiple task slots per machine. Set
mapred.tasktracker.map.tasks.maximum to '8' to use eight map tasks per node
(This is changed to mapreduce.tasktracker.map.tasks.maximum in 0.21+). This
allows single-threaded mapper code to efficiently process multiple input
splits in parallel. The only time when it's better to use multithreaded
maprunners is when a specific map() process is high-latency; e.g., you're
running a web crawler in a mapper, and you want to overlap requests to
foreign sites. But since this is not the norm, you should generally leave
things singlethreaded.

Hope this helps
Cheers
- Aaron

On Fri, Jun 11, 2010 at 7:30 AM, Jyothish Soman <jy...@gmail.com>wrote:

> Hi,
>
> I am a newbie to Hadoop. I want to use the Multi threaded runner by
> default, so I tried to change the MapTask.java code. it failed to compile
> using ant, as mapreduce - mapred library conflict was there, Can you please
> suggest a way through, so that  I can use the same.
>
> Regards,
> Jyothish Soman
>