You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Rob Stewart <ro...@gmail.com> on 2012/01/31 16:51:12 UTC

Hybrid Hadoop with fork/join ?

Hi,

I'm investigating the feasibility of a hybrid approach to parallel
programming, by fusing together the concurrent Java fork/join
libraries with Hadoop... MapReduce, a paradigm suited for scalable
execution over distributed memory + fork/join, a paradigm for optimal
multi-threaded shared memory execution.

I am aware that, to a degree, Hadoop can take advantage of multiple
core on a compute node, by setting the
mapred.tasktracker.<map/reduce>.tasks.maximum to be more than one. As
it states in the "Hadoop: The definitive guide" book, each task runs
in a separate JVM. So setting maximum map tasks to 3, will allow the
possibility of 3 JVMs running on the Operating System, right?

As an alternative to this approach, I am looking to introduce
fork/join into map tasks, and perhaps maybe too, throttling down the
maximum number of map tasks per node to 1. I would implement a
RecordReader that produces map tasks for more coarse granularity, and
that fork/join would subdivide to map task using multiple threads -
where the output of the join would be the output of the map task.

The motivation for this is that threads are a lot more lightweight
than initializing JVMs, which as the Hadoop book points out, takes a
second or so each time, unless mapred.job.resuse.jvm.num.tasks is set
higher than 1 (the default is 1). So for example, opting for small
granular maps for a particular job for a given input generates 1,000
map tasks, which will mean that 1,000 JVMs will be created on the
Hadoop cluster within the execution of the program. If I were to write
a bespoke RecordReader to increase the granularity of each map, so
much so that only 100 map tasks are created. The embedded fork/join
code would further split each map into 10 threads, to evaluate within
one JVM concurrently. I would expect this latter approach of multiple
threads to have better performance, than the clunkier multple-JVM
approach.

Has such a hybrid approach combining MapReduce with ForkJoin been
investigated for feasibility, or similar studies published? Are there
any important technical limitations that I should consider? Any
thoughts on the proposed multi-threaded distributed-shared memory
architecture are much appreciated from the Hadoop community!

--
Rob Stewart

Re: Hybrid Hadoop with fork/join ?

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

Rob,

Hadoop has as a way to run Map tasks in multithreading mode, look for the
MultithreadedMapRunner & MultithreadedMapper.

Thanks.

Alejandro.

On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart <ro...@gmail.com> wrote:

> Hi,
>
> I'm investigating the feasibility of a hybrid approach to parallel
> programming, by fusing together the concurrent Java fork/join
> libraries with Hadoop... MapReduce, a paradigm suited for scalable
> execution over distributed memory + fork/join, a paradigm for optimal
> multi-threaded shared memory execution.
>
> I am aware that, to a degree, Hadoop can take advantage of multiple
> core on a compute node, by setting the
> mapred.tasktracker.<map/reduce>.tasks.maximum to be more than one. As
> it states in the "Hadoop: The definitive guide" book, each task runs
> in a separate JVM. So setting maximum map tasks to 3, will allow the
> possibility of 3 JVMs running on the Operating System, right?
>
> As an alternative to this approach, I am looking to introduce
> fork/join into map tasks, and perhaps maybe too, throttling down the
> maximum number of map tasks per node to 1. I would implement a
> RecordReader that produces map tasks for more coarse granularity, and
> that fork/join would subdivide to map task using multiple threads -
> where the output of the join would be the output of the map task.
>
> The motivation for this is that threads are a lot more lightweight
> than initializing JVMs, which as the Hadoop book points out, takes a
> second or so each time, unless mapred.job.resuse.jvm.num.tasks is set
> higher than 1 (the default is 1). So for example, opting for small
> granular maps for a particular job for a given input generates 1,000
> map tasks, which will mean that 1,000 JVMs will be created on the
> Hadoop cluster within the execution of the program. If I were to write
> a bespoke RecordReader to increase the granularity of each map, so
> much so that only 100 map tasks are created. The embedded fork/join
> code would further split each map into 10 threads, to evaluate within
> one JVM concurrently. I would expect this latter approach of multiple
> threads to have better performance, than the clunkier multple-JVM
> approach.
>
> Has such a hybrid approach combining MapReduce with ForkJoin been
> investigated for feasibility, or similar studies published? Are there
> any important technical limitations that I should consider? Any
> thoughts on the proposed multi-threaded distributed-shared memory
> architecture are much appreciated from the Hadoop community!
>
> --
> Rob Stewart
>