You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Martin Jaggi <m....@gmail.com> on 2008/06/01 14:12:23 UTC
Re: other implementations of TaskRunner

That would indeed be a nice idea, that there could be other  
implementations of TaskRunner suited for special hardware, or for in- 
memory systems.

But if the communication remains the same (HDFS with disk access),  
this would not necessarily make things faster in the shuffling phase  
etc.


Am 01.06.2008 um 10:26 schrieb Christophe Taton:

> Actually Hadoop could be made more friendly to such realtime Map/ 
> Reduce
> jobs.
> For instance, we could consider running all tasks inside the task  
> tracker
> jvm as separate threads, which could be implemented as another  
> personality
> of the TaskRunner.
> I have been looking into this a couple of weeks ago...
> Would you be interested in such a feature?
>
> Christophe T.
>
>
> On Sun, Jun 1, 2008 at 10:08 AM, Ted Dunning <te...@gmail.com>  
> wrote:
>
>> Hadoop is highly optimized towards handling datasets that are much  
>> too large
>> to fit into memory.  That means that there are many trade-offs that  
>> have
>> been made that make it much less useful for very short jobs or jobs  
>> that
>> would fit into memory easily.
>>
>> Multi-core implementations of map-reduce are very interesting for a  
>> number
>> of applications as are in-memory implementations for distributed
>> architectures.  I don't think that anybody really knows yet how  
>> well these
>> other implementations will play with Hadoop.  The regimes that they  
>> are
>> designed to optimize are very different in terms of data scale,  
>> number of
>> machines and networking speed.  All of these constraints drive the  
>> design
>> in innumerable ways.
>>
>> On Sat, May 31, 2008 at 7:51 PM, Martin Jaggi <m....@gmail.com>  
>> wrote:
>>
>>> Concerning real-time Map Reduce within (and not only between)  
>>> machines
>>> (multi-core & GPU), e.g. the Phoenix and Mars frameworks:
>>>
>>> I'm really interested in very fast Map Reduce tasks, i.e. without  
>>> much disk
>>> access. With the rise of multi-core systems, this could get more  
>>> and more
>>> interesting, and could maybe even lead to something like 'super- 
>>> computing
>>> for everyone', or is that a bit overwhelming? Anyway I was nicely  
>>> surprised
>>> to see the recent Phoenix (http://csl.stanford.edu/~christos/sw/phoenix/ 
>>>  )
>>> implementation of Map Reduce for multi-core CPUs (they won the  
>>> best paper
>>> award at HPCA'07).
>>>
>>> Recently also GPU computing was in the news again, pushed by  
>>> Nvidia (check
>>> CUDA  http://www.nvidia.com/object/cuda_showcase.html ), and now  
>>> also
>>> there a Map Reduce implementation called Mars became available:
>>> http://www.cse.ust.hk/gpuqp/Mars_tr.pdf
>>> The Mars people say a the end of their paper "We are also  
>>> interested in
>>> integrating Mars into the existing Map Reduce implementations such  
>>> as Hadoop
>>> so that the Map Reduce framework can take the advantage of the  
>>> parallelism
>>> among different machines as well as the parallelism within each  
>>> machine."
>>>
>>> What do you think of this, especially about the multi-core  
>>> approach? Do you
>>> think these needs are already served by the current  
>>> InMemoryFileSystem of
>>> Hadoop or not? Are there any plans of 'integrating' one of the two  
>>> above
>>> frameworks?
>>> Or would it already be done by improving the significant  
>>> intermediate data
>>> pairs overhead (https://issues.apache.org/jira/browse/HADOOP-3366 )?
>>>
>>> Any comments?