You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Eric Baldeschwieler <er...@yahoo-inc.com> on 2007/09/19 21:54:19 UTC

Re: HELP: Need a list of Grid systems research projects

Hi Folks,

In a couple of different contexts I've been asked for a list of  
projects that we would like groups who want to participate in Hadoop  
to think about tackling.  I've been asked for research projects  
ideas, engineering ideas for new participants and areas where domain  
experts from other fields might add a lot of value by bringing their  
perspective into the Hadoop discussion.

Below I've include a quick first pass I made at such topics.

I'd love anyone and everyone's input.  Please respond.  I'll  
aggregate and hopefully eventually post it on the wiki.

E14


---+ Modeling of block placement and replication policies in HDFS

    * Modeling of the expected time to data loss for a give HDFS  
cluster, given Hadoops replication policy and protocols.

    * Modeling of erasure codes and other approaches to replication  
that might have other space-performance-reliability tradeoffs.


---+ HDFS Namespace Expansion

Prototyping approaches to scaling the HDFS name space.  Goals - Keep  
it simple; Preserve or increase meta-data operations / second; Very  
large numbers of files (billions to trillions) & blocks

---+ Hadoop Security Design

An end-to-end proposal for how to support authentication and client  
side data encryption/decryption, so that large data sets can be  
stored in a public HDFS and only jobs launched by authenticated users  
can map-reduce or browse the data.  See HADOOP-xxx

---+ Hod ports to various campus work queueing systems.

Hod currently supports Torque and has previously supported Condor.   
We would like to have ports to whichever system(s) are used on major  
campuses (SGE, ...).

--+ Integration of Virtualization (such as Xen) with Hadoop tools

    * How does one integrate sandboxing of arbitrary user code in C++  
and other languages in a VM such
      as Xen with the Hadoop framework?  How does this interact with  
SGE, Torque, Condor? -- Eric14

    * As each individual machine has more and more cores/cpus, it  
makes sense to partition each machine
      into multiple virtual machines. That gives us a number of  
benefits:

       * By assigning a virtual machine to a datanode, we effectively  
isolate the
datanode from the load on the machine caused by other processes,  
making the
datanode more responsive/reliable.

       * With multiple virtual machines on each machine, we can lower  
the
granularity of hod scheduling units, making it possible to schedule  
multiple
tasktrackers on the same machine, improving the overall utilization  
of the
whole clusters.

       * With virtualization, we can easily snapshot a virtual  
cluster before
releasing it, making it possible to re-activate the same cluster in the
future and start to work from the snapshot.

       * -- Runping Qi

---+ Provisioning of long running Services via HOD

Work on a computation model for services on the grid.  The model  
would include:

    * Various tools for defining clients and servers of the service,  
and at
       the least a C++ and Java instantiation of the abstractions
    * Logical definitions of how to partition work onto a set of  
servers,
       i.e. a generalized shard implementation,
    * A few useful abstractions like locks (exclusive and RW, fairness),
       leader election, transactions,
    * Various communication models for groups of servers belonging to a
       service, such as broadcast, unicast, etc.
    * Tools for assuring QoS, reliability, managing pools of servers  
for a
       service with spares, etc.
    * Integration with HDFS for persistence, as well as access to  
local filesystems
    * Integration with ZooKeeper so that applications can use the  
namespace

--+ A Hadoop compatible framework for discovering network topology  
and identifying and diagnosing hardware that is not functioning  
correctly.

---+ An improved framework for debugging and performance optimizing  
hadoop and streaming Hadoop jobs.

Some suggestions:

*   A distributed profiler for measuring distributed map-reduce  
applications.
     This would be real helpful for grid users. It should be able to  
provide
     standard profiler features , e.g. number of times a method is  
executed, time
     of execution, number of times a method caused some kind of  
failures, etc
     etc.; maybe accumulated over all instances of tasks that  
comprised that
     application.   -- Dhruba Borthakur

---+ Pig features.

thoughts?

---+ Map reduce performance enhancements.

How can we improve the performance of the standard Hadoop performance  
sort benchmarks?


---+ Sort and shuffle optimization in MR framework

Some example directions:

    * Memory-based shuffling in MR framework

    * Combining the results of several maps on rack or node before  
the shuffle.  This can reduce seek work and intermediate storage.


---+ Work load characterization from various Hadoop sites

A framework for capturing workload statistics and replaying workload  
simulations to allow the assessment of framework improvements.


---+ Other ideas on how to improve the frameworks performance or  
stability


---+ Benchmark suite for Data Intensive Supercomputing:

Scientific computation research and software has benefited  
tremendously due
to availability of benchmark suites such as NAS Parallel Benchmarks.  
This
was a kernel of 7 applications, starting with EP (embarrassingly  
parallel)
to SP, BT, LU (reflecting varying degree of parallelism and  
communication
patterns.)

A suite for data-intensive supercomputing application benchmarks would
present a target that hadoop (and other map-reduce implementations)  
should
be optimized for.

The cool thing about NAS Parallel benchmarks was that their  
specification
was paper-and-pencil, rather than as a code (such as SPEC benchmarks.)

This led to development of different codes in different languages, in
different programming paradigms, implementing the paper-and-pencil
specification.

    * -- Milind Bhandarkar

---+ Performance evaluation of existing Locality Sensitive Hashing  
schemes

Research on new hashing schemes for filesystem namespace partitioning.

http://en.wikipedia.org/wiki/Locality_sensitive_hashing

    * -- Milind Bhandarkar

---+ An alternate view of files a collection of blocks

Propose an API and sample use cases for a file as a repository of  
blocks where a user can add and delete blocks
to arbitrary parts of a file.  This would allow holes in files and  
moving blocks from one file to
another. How does this reconcile with the sequence-of-bytes view of  
file?

Such an approach may encourage new styles of applications.

To push a bit more in a research direction: UNIX file systems are  
managed as a
sequence-of-bytes but usually (and in Hadoop's case exclusively) used  
as a
sequence of records. If the filesystem participates in the record  
management
(like mainframes do for example) you can get same nice semantic and
performance improvements.

    * -- Benjamin Reed