You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Steve Loughran <st...@apache.org> on 2009/09/01 12:28:52 UTC

Re: Contributing to HDFS - Distributed Computing

Hrishikesh Mantri wrote:
> Hi All.
> 
> I am Masters student in CS . We are a group of two and are looking for adding some additional features 
> to the HDFS as a part of the Distributed Computing course project . Can someone please provide us with pointers 
> as in which direction we should take so that it can benefit the Hadoop community  ?
> 
> Regards,
> Hrishi
> 

I have some thoughts here : 
http://www.slideshare.net/steve_l/hadoop-and-universities

* I would recommend steering clear of the big HA problem because while 
it is the big issue with HDFS, it's the one where someone may set an 
entire engineering team up to to solving, at which point your work is 
going to have a hard time surviving.

* It might also be interesting to find some potential in-university 
users of Hadoop, and work on their use cases.

* What's your timescale, location? It would be good if there were other 
Hadoop developers locally, to give you a bit of in-apache mentorship

* Don't forget the tests. Apache code is very test centric. One key 
benefit of working with an OSS project is your code gets used,  but it 
does mean you need to embrace the community's test/development process, 
which means junit tests for everything.



-Steve


Re: Contributing to HDFS - Distributed Computing

Posted by Dhruba Borthakur <dh...@gmail.com>.
Hi Brian,

That is a good idea. Other block placement algorithms to try (using
HDFS-385) would be place blocks using heat-map-topology of a data center, or
using a dynamic network topology (based on network performance instead of
the static network topology that HDFS currently uses), simulate a a new
network topology that reduces cost of costly network switches in the data
center, etc.etc

thanks,
dhruba


On Tue, Sep 1, 2009 at 5:50 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey all,
>
> One place which would be an exceptionally good research project is the new
> pluggable interface for replica placement.
>
> https://issues.apache.org/jira/browse/HDFS-385
>
> It's something which taps into many lines of CS research (such as
> scheduling) and is meant to be experimental for a release or two.  I think
> if you could come up with a few example placement policies, it would help
> Dhruba refine the interface.  Because it's only developing a plug-in, the
> barrier of entry is much lower than core FS features.
>
> To get you started, one problem we've seen is the multi-datacenter problem.
>  How do you allocate blocks when there is a wildly heterogeneous network
> topology (such as an HDFS instance spread between two centers with only
> 10Gbps between the two)?  How do your scheduling decisions affect the
> performance of MapReduce jobs?  How do you balance good performance with
> maximum resiliency (placing copies of blocks in two separate buildings)?
>
> Brian
>
>
> On Sep 1, 2009, at 5:28 AM, Steve Loughran wrote:
>
>  Hrishikesh Mantri wrote:
>>
>>> Hi All.
>>> I am Masters student in CS . We are a group of two and are looking for
>>> adding some additional features to the HDFS as a part of the Distributed
>>> Computing course project . Can someone please provide us with pointers as in
>>> which direction we should take so that it can benefit the Hadoop community
>>>  ?
>>> Regards,
>>> Hrishi
>>>
>>
>> I have some thoughts here :
>> http://www.slideshare.net/steve_l/hadoop-and-universities
>>
>> * I would recommend steering clear of the big HA problem because while it
>> is the big issue with HDFS, it's the one where someone may set an entire
>> engineering team up to to solving, at which point your work is going to have
>> a hard time surviving.
>>
>> * It might also be interesting to find some potential in-university users
>> of Hadoop, and work on their use cases.
>>
>> * What's your timescale, location? It would be good if there were other
>> Hadoop developers locally, to give you a bit of in-apache mentorship
>>
>> * Don't forget the tests. Apache code is very test centric. One key
>> benefit of working with an OSS project is your code gets used,  but it does
>> mean you need to embrace the community's test/development process, which
>> means junit tests for everything.
>>
>>
>>
>> -Steve
>>
>
>

Re: Contributing to HDFS - Distributed Computing

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey all,

One place which would be an exceptionally good research project is the  
new pluggable interface for replica placement.

https://issues.apache.org/jira/browse/HDFS-385

It's something which taps into many lines of CS research (such as  
scheduling) and is meant to be experimental for a release or two.  I  
think if you could come up with a few example placement policies, it  
would help Dhruba refine the interface.  Because it's only developing  
a plug-in, the barrier of entry is much lower than core FS features.

To get you started, one problem we've seen is the multi-datacenter  
problem.  How do you allocate blocks when there is a wildly  
heterogeneous network topology (such as an HDFS instance spread  
between two centers with only 10Gbps between the two)?  How do your  
scheduling decisions affect the performance of MapReduce jobs?  How do  
you balance good performance with maximum resiliency (placing copies  
of blocks in two separate buildings)?

Brian

On Sep 1, 2009, at 5:28 AM, Steve Loughran wrote:

> Hrishikesh Mantri wrote:
>> Hi All.
>> I am Masters student in CS . We are a group of two and are looking  
>> for adding some additional features to the HDFS as a part of the  
>> Distributed Computing course project . Can someone please provide  
>> us with pointers as in which direction we should take so that it  
>> can benefit the Hadoop community  ?
>> Regards,
>> Hrishi
>
> I have some thoughts here : http://www.slideshare.net/steve_l/hadoop-and-universities
>
> * I would recommend steering clear of the big HA problem because  
> while it is the big issue with HDFS, it's the one where someone may  
> set an entire engineering team up to to solving, at which point your  
> work is going to have a hard time surviving.
>
> * It might also be interesting to find some potential in-university  
> users of Hadoop, and work on their use cases.
>
> * What's your timescale, location? It would be good if there were  
> other Hadoop developers locally, to give you a bit of in-apache  
> mentorship
>
> * Don't forget the tests. Apache code is very test centric. One key  
> benefit of working with an OSS project is your code gets used,  but  
> it does mean you need to embrace the community's test/development  
> process, which means junit tests for everything.
>
>
>
> -Steve