You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Martin Jaggi <m....@gmail.com> on 2008/06/01 04:51:40 UTC

Realtime Map Reduce = Supercomputing for the Masses?

Concerning real-time Map Reduce within (and not only between) machines  
(multi-core & GPU), e.g. the Phoenix and Mars frameworks:

I'm really interested in very fast Map Reduce tasks, i.e. without much  
disk access. With the rise of multi-core systems, this could get more  
and more interesting, and could maybe even lead to something like  
'super-computing for everyone', or is that a bit overwhelming? Anyway  
I was nicely surprised to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/ 
) implementation of Map Reduce for multi-core CPUs (they won the best  
paper award at HPCA'07).

Recently also GPU computing was in the news again, pushed by Nvidia  
(check CUDA  http://www.nvidia.com/object/cuda_showcase.html ), and  
now also there a Map Reduce implementation called Mars became available:
http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
The Mars people say a the end of their paper "We are also interested  
in integrating Mars into the existing Map Reduce implementations such  
as Hadoop so that the Map Reduce framework can take the advantage of  
the parallelism among different machines as well as the parallelism  
within each machine."

What do you think of this, especially about the multi-core approach?  
Do you think these needs are already served by the current  
InMemoryFileSystem of Hadoop or not? Are there any plans of  
'integrating' one of the two above frameworks?
Or would it already be done by improving the significant intermediate  
data pairs overhead (https://issues.apache.org/jira/browse/ 
HADOOP-3366 )?

Any comments?

Re: In memory Map Reduce

Posted by Martin Jaggi <m....@gmail.com>.

Is there some statistics available to monitor which percentage of the  
pairs remains in memory, and which percentage was written to disk?

Or which are these exceptional cases that you mention?


> Hadoop goes to some lengths to make sure that things can stay in  
> memory as
> much as possible.  There are still cases, however, where intermediate
> results are  normally written to disk.  That means that implementors  
> will
> have those time scales in their head as they do things which will  
> inevitably
> make the trade-offs somewhat poor compared to a system that never  
> envisions
> intermediate data being written to disk.
>
> But other than guessing like this, I couldn't actually say how it  
> would turn
> out except that for very short jobs, moving jar files around and other
> startup costs can be the dominant cost.
>
> On Sun, Jun 1, 2008 at 5:05 AM, Martin Jaggi <m....@gmail.com>  
> wrote:
>
>>
>> So in the case that all intermediate pairs fit into the RAM of the  
>> cluster,
>> does the InMemoryFileSystem already allow the intermediate phase to  
>> be done
>> without much disk access? Or what would be the current bottleneck  
>> in Hadoop
>> in this scenario (huge computational load, not so much data in/out)
>> according to your opinion?

Re: In memory Map Reduce

Posted by Ted Dunning <te...@gmail.com>.

Hadoop goes to some lengths to make sure that things can stay in memory as
much as possible.  There are still cases, however, where intermediate
results are  normally written to disk.  That means that implementors will
have those time scales in their head as they do things which will inevitably
make the trade-offs somewhat poor compared to a system that never envisions
intermediate data being written to disk.

But other than guessing like this, I couldn't actually say how it would turn
out except that for very short jobs, moving jar files around and other
startup costs can be the dominant cost.

On Sun, Jun 1, 2008 at 5:05 AM, Martin Jaggi <m....@gmail.com> wrote:

>
> So in the case that all intermediate pairs fit into the RAM of the cluster,
> does the InMemoryFileSystem already allow the intermediate phase to be done
> without much disk access? Or what would be the current bottleneck in Hadoop
> in this scenario (huge computational load, not so much data in/out)
> according to your opinion?
>
>
>

Re: Qt 4.4 / QtConcurrent

Posted by Martin Jaggi <m....@gmail.com>.

Thanks, it's very nice to see that they integrated Map Reduce.

But as I understood it this does not work (yet) for distributed  
systems, but only on one single machine.


Am 01.06.2008 um 14:33 schrieb Brice Arnould:

> Hi !
> With Qt 4.4, Trolltech provides a GPLed implementation of an in memory
> map/reduce for many languages (at least c++ and Java) as a part of
> QtConcurrent.
> I have not used this yet, but in general their API are well tough  
> and their
> code very slick. You might want to have a look at this.
>
> Code sample :
> | QImage scaled(const QImage &image) {
> |    return image.scaled(100, 100);
> | }
> | QList<QImage> images = ...;
> | QFuture<QImage> thumbnails = QtConcurrent::mapped(images, scaled);
> Doc :
> http://doc.trolltech.com/4.4/qtconcurrentmap.html#map
> Qt 4.4 GPL :
> http://trolltech.com/downloads/opensource
> Qt 4.4 Commercial :
> http://trolltech.com/downloads/commercial
>
> Brice
>
> On dimanche 1 juin 2008, Martin Jaggi wrote:
>> Thanks for your comments!
>>
>> So in the case that all intermediate pairs fit into the RAM of the
>> cluster, does the InMemoryFileSystem already allow the intermediate
>> phase to be done without much disk access? Or what would be the
>> current bottleneck in Hadoop in this scenario (huge computational
>> load, not so much data in/out) according to your opinion?

Re: In memory Map Reduce

Posted by Brice Arnould <un...@vleu.net>.

Hi !
With Qt 4.4, Trolltech provides a GPLed implementation of an in memory 
map/reduce for many languages (at least c++ and Java) as a part of 
QtConcurrent.
I have not used this yet, but in general their API are well tough and their 
code very slick. You might want to have a look at this.

Code sample :
| QImage scaled(const QImage &image) {
|    return image.scaled(100, 100);
| }
| QList<QImage> images = ...;
| QFuture<QImage> thumbnails = QtConcurrent::mapped(images, scaled);
Doc :
http://doc.trolltech.com/4.4/qtconcurrentmap.html#map
Qt 4.4 GPL :
http://trolltech.com/downloads/opensource
Qt 4.4 Commercial :
http://trolltech.com/downloads/commercial

Brice

On dimanche 1 juin 2008, Martin Jaggi wrote:
> Thanks for your comments!
>
> So in the case that all intermediate pairs fit into the RAM of the
> cluster, does the InMemoryFileSystem already allow the intermediate
> phase to be done without much disk access? Or what would be the
> current bottleneck in Hadoop in this scenario (huge computational
> load, not so much data in/out) according to your opinion?

Re: In memory Map Reduce

Posted by Martin Jaggi <m....@gmail.com>.

Thanks for your comments!

So in the case that all intermediate pairs fit into the RAM of the  
cluster, does the InMemoryFileSystem already allow the intermediate  
phase to be done without much disk access? Or what would be the  
current bottleneck in Hadoop in this scenario (huge computational  
load, not so much data in/out) according to your opinion?


Am 01.06.2008 um 10:08 schrieb Ted Dunning:

> Hadoop is highly optimized towards handling datasets that are much  
> too large
> to fit into memory.  That means that there are many trade-offs that  
> have
> been made that make it much less useful for very short jobs or jobs  
> that
> would fit into memory easily.
>
> Multi-core implementations of map-reduce are very interesting for a  
> number
> of applications as are in-memory implementations for distributed
> architectures.  I don't think that anybody really knows yet how well  
> these
> other implementations will play with Hadoop.  The regimes that they  
> are
> designed to optimize are very different in terms of data scale,  
> number of
> machines and networking speed.  All of these constraints drive the  
> design in
> innumerable ways.
>
> On Sat, May 31, 2008 at 7:51 PM, Martin Jaggi <m....@gmail.com>  
> wrote:
>
>> Concerning real-time Map Reduce within (and not only between)  
>> machines
>> (multi-core & GPU), e.g. the Phoenix and Mars frameworks:
>>
>> I'm really interested in very fast Map Reduce tasks, i.e. without  
>> much disk
>> access. With the rise of multi-core systems, this could get more  
>> and more
>> interesting, and could maybe even lead to something like 'super- 
>> computing
>> for everyone', or is that a bit overwhelming? Anyway I was nicely  
>> surprised
>> to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/ 
>>  )
>> implementation of Map Reduce for multi-core CPUs (they won the best  
>> paper
>> award at HPCA'07).
>>
>> Recently also GPU computing was in the news again, pushed by Nvidia  
>> (check
>> CUDA  http://www.nvidia.com/object/cuda_showcase.html ), and now also
>> there a Map Reduce implementation called Mars became available:
>> http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
>> The Mars people say a the end of their paper "We are also  
>> interested in
>> integrating Mars into the existing Map Reduce implementations such  
>> as Hadoop
>> so that the Map Reduce framework can take the advantage of the  
>> parallelism
>> among different machines as well as the parallelism within each  
>> machine."
>>
>> What do you think of this, especially about the multi-core  
>> approach? Do you
>> think these needs are already served by the current  
>> InMemoryFileSystem of
>> Hadoop or not? Are there any plans of 'integrating' one of the two  
>> above
>> frameworks?
>> Or would it already be done by improving the significant  
>> intermediate data
>> pairs overhead (https://issues.apache.org/jira/browse/HADOOP-3366 )?
>>
>> Any comments?
>>
>
>
>
> -- 
> ted

Re: Realtime Map Reduce = Supercomputing for the Masses?

Posted by Steve Loughran <st...@apache.org>.

Alejandro Abdelnur wrote:
> Yes you would have to do it with classloaders (not 'hello world' but not
> 'rocket science' either).

That's where we differ.

I do actually think that classloaders are incredibly hard to get right, 
and I say that as someone who has single stepped through the Axis2 code 
in terror, and help soak-test Ant so that it doesnt leak even over 
extended builds, but even there we draw the line at saying "this lets 
you run forever". In fact, I think apache should only allow people to 
write classloader code that have passed some special malicious 
classloader competence test devised by all the teams.

> You'll be limited on using native libraries, even if you use classloaders
> properly as native libs can be loaded only once.

Plus there's that mess that is called "endorsed libraries", and you have 
to worry about native library leakage.

> You will have to ensure you get rid of the task classloader once the task is
> over (thus removing all singleton stuff that may be in it).

Which you can only do by getting rid of every single instance of every 
single class .. very, very hard to do this

> 
> You will have to put in place a security manager for the code running out
> the task classloader.
> 
> You'll end up doing somemthing similar to servlet containers webapp
> classloading model with the extra burden of hot-loading for each task run.
> Which in the end may have a similar overhead of bootstrapping a JVM for the
> task, this should be measured to see what is the time delta to see if it is
> worth the effort.

It really comes down to start time and total memory footprint. If you 
can afford the startup delay and the memory, then separate processes 
give you best isolation and robustness.

FWIW, in SmartFrog, we let you deploy components (as of last week, 
hadoop server components) in their own processes, by declaring the 
process name to use. These are pooled; you can deploy lots of things 
into a single process; once they are all terminated that process halts 
itself and all is well...there's a single root process on every box that 
can take others. Bringing up a nearly-empty child process is good for 
waiting for a faster deployment of other stuff. One issue here is always 
JVM options; should child processes have different parameters (like max 
heap size) from the root process.

For long lived apps, deploying things into child processes is the most 
robust; it keeps the root process lighter. Deploying into the root 
process is better for debugging (you just start that one process), and 
for simplifying liveness. When the root process dies, everything inside 
is guaranteed 100% dead. But the child processes have to wait to 
discover they have lost any links to a parent on that process (if they 
care about such things) and start timing out themselves.

-steve

Re: Realtime Map Reduce = Supercomputing for the Masses?

Posted by Alejandro Abdelnur <tu...@gmail.com>.

Yes you would have to do it with classloaders (not 'hello world' but not
'rocket science' either).

You'll be limited on using native libraries, even if you use classloaders
properly as native libs can be loaded only once.

You will have to ensure you get rid of the task classloader once the task is
over (thus removing all singleton stuff that may be in it).

You will have to put in place a security manager for the code running out
the task classloader.

You'll end up doing somemthing similar to servlet containers webapp
classloading model with the extra burden of hot-loading for each task run.
Which in the end may have a similar overhead of bootstrapping a JVM for the
task, this should be measured to see what is the time delta to see if it is
worth the effort.

A

On Mon, Jun 2, 2008 at 3:53 PM, Steve Loughran <st...@apache.org> wrote:

> Christophe Taton wrote:
>
>> Actually Hadoop could be made more friendly to such realtime Map/Reduce
>> jobs.
>> For instance, we could consider running all tasks inside the task tracker
>> jvm as separate threads, which could be implemented as another personality
>> of the TaskRunner.
>> I have been looking into this a couple of weeks ago...
>> Would you be interested in such a feature?
>>
>
> Why does that have benefits? So that you can share stuff via local data
> structures? Because you'd better be sharing classloaders if you are going to
> play that game. And that is very hard to get right (to the extent that I
> dont think any apache project other than Felix does it well)
>

Re: Realtime Map Reduce = Supercomputing for the Masses?

Posted by Christophe Taton <ta...@apache.org>.

Hi Steve,

On Mon, Jun 2, 2008 at 12:23 PM, Steve Loughran <st...@apache.org> wrote:

> Christophe Taton wrote:
>
>> Actually Hadoop could be made more friendly to such realtime Map/Reduce
>> jobs.
>> For instance, we could consider running all tasks inside the task tracker
>> jvm as separate threads, which could be implemented as another personality
>> of the TaskRunner.
>> I have been looking into this a couple of weeks ago...
>> Would you be interested in such a feature?
>>
>
> Why does that have benefits? So that you can share stuff via local data
> structures? Because you'd better be sharing classloaders if you are going to
> play that game. And that is very hard to get right (to the extent that I
> dont think any apache project other than Felix does it well)
>

The most obvious improvement to my mind concerns the memory footprint of the
infrastructure. Running jobs leads to at least 3 jvms per machine (the data
node, the task tracker and the task), if you forget parallelism and accept
to run only one task per node at a time. This is problematic if you have
machines with low memory capacities.

That said, I agree with your concerns about classloading.
I have actually been thinking that we might try to rely on osgi to do the
job, and package hadoop daemons, jobs and tasks as osgi bundles and
services; but I faced many tricky issues in doing that (the last one being
the resolution of configuration files by the classloaders).
To my mind, one short term and minimal way of achieving this would be to use
a URLClassLoader in conjunction with the hdfs URLStreamHandler, to let the
task tracker run tasks directly...

Cheers,
Christophe T.

Re: Realtime Map Reduce = Supercomputing for the Masses?

Posted by Steve Loughran <st...@apache.org>.

Christophe Taton wrote:
> Actually Hadoop could be made more friendly to such realtime Map/Reduce
> jobs.
> For instance, we could consider running all tasks inside the task tracker
> jvm as separate threads, which could be implemented as another personality
> of the TaskRunner.
> I have been looking into this a couple of weeks ago...
> Would you be interested in such a feature?

Why does that have benefits? So that you can share stuff via local data 
structures? Because you'd better be sharing classloaders if you are 
going to play that game. And that is very hard to get right (to the 
extent that I dont think any apache project other than Felix does it well)

Re: other implementations of TaskRunner

Posted by Martin Jaggi <m....@gmail.com>.

That would indeed be a nice idea, that there could be other  
implementations of TaskRunner suited for special hardware, or for in- 
memory systems.

But if the communication remains the same (HDFS with disk access),  
this would not necessarily make things faster in the shuffling phase  
etc.


Am 01.06.2008 um 10:26 schrieb Christophe Taton:

> Actually Hadoop could be made more friendly to such realtime Map/ 
> Reduce
> jobs.
> For instance, we could consider running all tasks inside the task  
> tracker
> jvm as separate threads, which could be implemented as another  
> personality
> of the TaskRunner.
> I have been looking into this a couple of weeks ago...
> Would you be interested in such a feature?
>
> Christophe T.
>
>
> On Sun, Jun 1, 2008 at 10:08 AM, Ted Dunning <te...@gmail.com>  
> wrote:
>
>> Hadoop is highly optimized towards handling datasets that are much  
>> too large
>> to fit into memory.  That means that there are many trade-offs that  
>> have
>> been made that make it much less useful for very short jobs or jobs  
>> that
>> would fit into memory easily.
>>
>> Multi-core implementations of map-reduce are very interesting for a  
>> number
>> of applications as are in-memory implementations for distributed
>> architectures.  I don't think that anybody really knows yet how  
>> well these
>> other implementations will play with Hadoop.  The regimes that they  
>> are
>> designed to optimize are very different in terms of data scale,  
>> number of
>> machines and networking speed.  All of these constraints drive the  
>> design
>> in innumerable ways.
>>
>> On Sat, May 31, 2008 at 7:51 PM, Martin Jaggi <m....@gmail.com>  
>> wrote:
>>
>>> Concerning real-time Map Reduce within (and not only between)  
>>> machines
>>> (multi-core & GPU), e.g. the Phoenix and Mars frameworks:
>>>
>>> I'm really interested in very fast Map Reduce tasks, i.e. without  
>>> much disk
>>> access. With the rise of multi-core systems, this could get more  
>>> and more
>>> interesting, and could maybe even lead to something like 'super- 
>>> computing
>>> for everyone', or is that a bit overwhelming? Anyway I was nicely  
>>> surprised
>>> to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/ 
>>>  )
>>> implementation of Map Reduce for multi-core CPUs (they won the  
>>> best paper
>>> award at HPCA'07).
>>>
>>> Recently also GPU computing was in the news again, pushed by  
>>> Nvidia (check
>>> CUDA  http://www.nvidia.com/object/cuda_showcase.html ), and now  
>>> also
>>> there a Map Reduce implementation called Mars became available:
>>> http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
>>> The Mars people say a the end of their paper "We are also  
>>> interested in
>>> integrating Mars into the existing Map Reduce implementations such  
>>> as Hadoop
>>> so that the Map Reduce framework can take the advantage of the  
>>> parallelism
>>> among different machines as well as the parallelism within each  
>>> machine."
>>>
>>> What do you think of this, especially about the multi-core  
>>> approach? Do you
>>> think these needs are already served by the current  
>>> InMemoryFileSystem of
>>> Hadoop or not? Are there any plans of 'integrating' one of the two  
>>> above
>>> frameworks?
>>> Or would it already be done by improving the significant  
>>> intermediate data
>>> pairs overhead (https://issues.apache.org/jira/browse/HADOOP-3366 )?
>>>
>>> Any comments?

Re: Realtime Map Reduce = Supercomputing for the Masses?

Posted by Edward Capriolo <ed...@gmail.com>.

I think that feature makes sense because starting JVM has overhead.

On Sun, Jun 1, 2008 at 4:26 AM, Christophe Taton <ta...@apache.org> wrote:
> Actually Hadoop could be made more friendly to such realtime Map/Reduce
> jobs.
> For instance, we could consider running all tasks inside the task tracker
> jvm as separate threads, which could be implemented as another personality
> of the TaskRunner.
> I have been looking into this a couple of weeks ago...
> Would you be interested in such a feature?
>
> Christophe T.
>
>
> On Sun, Jun 1, 2008 at 10:08 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> Hadoop is highly optimized towards handling datasets that are much too
>> large
>> to fit into memory.  That means that there are many trade-offs that have
>> been made that make it much less useful for very short jobs or jobs that
>> would fit into memory easily.
>>
>> Multi-core implementations of map-reduce are very interesting for a number
>> of applications as are in-memory implementations for distributed
>> architectures.  I don't think that anybody really knows yet how well these
>> other implementations will play with Hadoop.  The regimes that they are
>> designed to optimize are very different in terms of data scale, number of
>> machines and networking speed.  All of these constraints drive the design
>> in
>> innumerable ways.
>>
>> On Sat, May 31, 2008 at 7:51 PM, Martin Jaggi <m....@gmail.com> wrote:
>>
>> > Concerning real-time Map Reduce within (and not only between) machines
>> > (multi-core & GPU), e.g. the Phoenix and Mars frameworks:
>> >
>> > I'm really interested in very fast Map Reduce tasks, i.e. without much
>> disk
>> > access. With the rise of multi-core systems, this could get more and more
>> > interesting, and could maybe even lead to something like 'super-computing
>> > for everyone', or is that a bit overwhelming? Anyway I was nicely
>> surprised
>> > to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/<http://csl.stanford.edu/%7Echristos/sw/phoenix/>
>> <http://csl.stanford.edu/%7Echristos/sw/phoenix/>)
>> > implementation of Map Reduce for multi-core CPUs (they won the best paper
>> > award at HPCA'07).
>> >
>> > Recently also GPU computing was in the news again, pushed by Nvidia
>> (check
>> > CUDA  http://www.nvidia.com/object/cuda_showcase.html ), and now also
>> > there a Map Reduce implementation called Mars became available:
>> > http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
>> > The Mars people say a the end of their paper "We are also interested in
>> > integrating Mars into the existing Map Reduce implementations such as
>> Hadoop
>> > so that the Map Reduce framework can take the advantage of the
>> parallelism
>> > among different machines as well as the parallelism within each machine."
>> >
>> > What do you think of this, especially about the multi-core approach? Do
>> you
>> > think these needs are already served by the current InMemoryFileSystem of
>> > Hadoop or not? Are there any plans of 'integrating' one of the two above
>> > frameworks?
>> > Or would it already be done by improving the significant intermediate
>> data
>> > pairs overhead (https://issues.apache.org/jira/browse/HADOOP-3366 )?
>> >
>> > Any comments?
>> >
>>
>

Re: Realtime Map Reduce = Supercomputing for the Masses?

Posted by Christophe Taton <ta...@apache.org>.

Actually Hadoop could be made more friendly to such realtime Map/Reduce
jobs.
For instance, we could consider running all tasks inside the task tracker
jvm as separate threads, which could be implemented as another personality
of the TaskRunner.
I have been looking into this a couple of weeks ago...
Would you be interested in such a feature?

Christophe T.


On Sun, Jun 1, 2008 at 10:08 AM, Ted Dunning <te...@gmail.com> wrote:

> Hadoop is highly optimized towards handling datasets that are much too
> large
> to fit into memory.  That means that there are many trade-offs that have
> been made that make it much less useful for very short jobs or jobs that
> would fit into memory easily.
>
> Multi-core implementations of map-reduce are very interesting for a number
> of applications as are in-memory implementations for distributed
> architectures.  I don't think that anybody really knows yet how well these
> other implementations will play with Hadoop.  The regimes that they are
> designed to optimize are very different in terms of data scale, number of
> machines and networking speed.  All of these constraints drive the design
> in
> innumerable ways.
>
> On Sat, May 31, 2008 at 7:51 PM, Martin Jaggi <m....@gmail.com> wrote:
>
> > Concerning real-time Map Reduce within (and not only between) machines
> > (multi-core & GPU), e.g. the Phoenix and Mars frameworks:
> >
> > I'm really interested in very fast Map Reduce tasks, i.e. without much
> disk
> > access. With the rise of multi-core systems, this could get more and more
> > interesting, and could maybe even lead to something like 'super-computing
> > for everyone', or is that a bit overwhelming? Anyway I was nicely
> surprised
> > to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/<http://csl.stanford.edu/%7Echristos/sw/phoenix/>
> <http://csl.stanford.edu/%7Echristos/sw/phoenix/>)
> > implementation of Map Reduce for multi-core CPUs (they won the best paper
> > award at HPCA'07).
> >
> > Recently also GPU computing was in the news again, pushed by Nvidia
> (check
> > CUDA  http://www.nvidia.com/object/cuda_showcase.html ), and now also
> > there a Map Reduce implementation called Mars became available:
> > http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
> > The Mars people say a the end of their paper "We are also interested in
> > integrating Mars into the existing Map Reduce implementations such as
> Hadoop
> > so that the Map Reduce framework can take the advantage of the
> parallelism
> > among different machines as well as the parallelism within each machine."
> >
> > What do you think of this, especially about the multi-core approach? Do
> you
> > think these needs are already served by the current InMemoryFileSystem of
> > Hadoop or not? Are there any plans of 'integrating' one of the two above
> > frameworks?
> > Or would it already be done by improving the significant intermediate
> data
> > pairs overhead (https://issues.apache.org/jira/browse/HADOOP-3366 )?
> >
> > Any comments?
> >
>

Re: Realtime Map Reduce = Supercomputing for the Masses?

Posted by Ted Dunning <te...@gmail.com>.

Hadoop is highly optimized towards handling datasets that are much too large
to fit into memory.  That means that there are many trade-offs that have
been made that make it much less useful for very short jobs or jobs that
would fit into memory easily.

Multi-core implementations of map-reduce are very interesting for a number
of applications as are in-memory implementations for distributed
architectures.  I don't think that anybody really knows yet how well these
other implementations will play with Hadoop.  The regimes that they are
designed to optimize are very different in terms of data scale, number of
machines and networking speed.  All of these constraints drive the design in
innumerable ways.

On Sat, May 31, 2008 at 7:51 PM, Martin Jaggi <m....@gmail.com> wrote:

> Concerning real-time Map Reduce within (and not only between) machines
> (multi-core & GPU), e.g. the Phoenix and Mars frameworks:
>
> I'm really interested in very fast Map Reduce tasks, i.e. without much disk
> access. With the rise of multi-core systems, this could get more and more
> interesting, and could maybe even lead to something like 'super-computing
> for everyone', or is that a bit overwhelming? Anyway I was nicely surprised
> to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/<http://csl.stanford.edu/%7Echristos/sw/phoenix/>)
> implementation of Map Reduce for multi-core CPUs (they won the best paper
> award at HPCA'07).
>
> Recently also GPU computing was in the news again, pushed by Nvidia (check
> CUDA  http://www.nvidia.com/object/cuda_showcase.html ), and now also
> there a Map Reduce implementation called Mars became available:
> http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
> The Mars people say a the end of their paper "We are also interested in
> integrating Mars into the existing Map Reduce implementations such as Hadoop
> so that the Map Reduce framework can take the advantage of the parallelism
> among different machines as well as the parallelism within each machine."
>
> What do you think of this, especially about the multi-core approach? Do you
> think these needs are already served by the current InMemoryFileSystem of
> Hadoop or not? Are there any plans of 'integrating' one of the two above
> frameworks?
> Or would it already be done by improving the significant intermediate data
> pairs overhead (https://issues.apache.org/jira/browse/HADOOP-3366 )?
>
> Any comments?
>

-- 
ted