You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Arun C Murthy <ac...@hortonworks.com> on 2011/09/09 12:43:28 UTC

Re: Research projects for hadoop

Saikat,

 As Robert pointed out, performance is a primary criterion - maybe you can come back with benchmarks? Try sorts with >100G data.

 Also, MRv2 makes it easy to play with these, you might want to try that.

Arun

On Sep 9, 2011, at 10:34 AM, Saikat Kanjilal wrote:

> 
> How about using virtual box and centos 64 bit to serve as a linux container for isolating map/reduce processes?  I have setup this up in the past, its really easy.
> 
> 
>> From: evans@yahoo-inc.com
>> To: mapreduce-dev@hadoop.apache.org
>> Date: Fri, 9 Sep 2011 10:30:37 -0700
>> Subject: Re: Research projects for hadoop
>> 
>> The biggest issue with Xen and other virtualization technologies is that often there is an IO penalty involved with using them.  For many jobs this is not an acceptable trade off.  I do know, however, that there has been some discussion about using Linux Containers for isolation of Map/Reduce processes.  I don't know if any JIRA has been filed for it or not, but they are much lighter weight then Xen and other virtualization tech, because all it really is concerned with is resource isolation, and not virtualizing an entire operating system.
>> 
>> --Bobby Evans
>> 
>> On 9/9/11 10:58 AM, "Saikat Kanjilal" <sx...@hotmail.com> wrote:
>> 
>> 
>> 
>> Hi  Folks,I was looking through the following wiki page:  http://wiki.apache.org/hadoop/HadoopResearchProjects and was wondering if there's been any work done (or any interest to do work) for the following topics:
>> Integration of Virtualization (such as Xen) with Hadoop toolsHow does one integrate sandboxing of arbitrary user code in C++ and other languages in a VM such as Xen with the Hadoop framework? How does this interact with SGE, Torque, Condor?As each individual machine has more and more cores/cpus, it makes sense to partition each machine into multiple virtual machines. That gives us a number of benefits:By assigning a virtual machine to a datanode, we effectively isolate the datanode from the load on the machine caused by other processes, making the datanode more responsive/reliable.With multiple virtual machines on each machine, we can lower the granularity of hod scheduling units, making it possible to schedule multiple tasktrackers on the same machine, improving the overall utilization of the whole clusters.With virtualization, we can easily snapshot a virtual cluster before releasing it, making it possible to re-activate the same cluster in the future and start to work from the snapshot.Provisioning of long running Services via HODWork on a computation model for services on the grid. The model would include:Various tools for defining clients and servers of the service, and at the least a C++ and Java instantiation of the abstractionsLogical definitions of how to partition work onto a set of servers, i.e. a generalized shard implementationA few useful abstractions like locks (exclusive and RW, fairness), leader election, transactions,Various communication models for groups of servers belonging to a service, such as broadcast, unicast, etc.Tools for assuring QoS, reliability, managing pools of servers for a service with spares, etc.Integration with HDFS for persistence, as well as access to local filesystemsIntegration with ZooKeeper so that applications can use the namespace
>> I would like to either help out with a design for the above or prototyping code, please let me know if and what the process may be to move forward with this.
>> Regards
>> 
>