You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by Sujee Maniyam <su...@sujee.net> on 2012/09/11 01:29:39 UTC

data center aware hadoop?

HI devs
now that hfds HA is is a reality,  how about HDFS spanning multiple
data centers?  Are there any discussions / work going on in this area?

It could be a single cluster spanning multiple data centers or having
a 'standby cluster' in another data center.

curious, and thanks for your time!

regards
Sujee Maniyam
http://sujee.net

Re: data center aware hadoop?

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.
> Or do you plan to have different data across sites and then run MR jobs
> across them? This would be an interesting problem, but its way above the FS.

MAPREDUCE-4502 relates to this problem. Please check it out if you
have interests.
https://issues.apache.org/jira/browse/MAPREDUCE-4502

- Tsuyoshi

On Sat, Sep 22, 2012 at 5:39 PM, Steve Loughran <st...@hortonworks.com> wrote:
> On 11 September 2012 00:29, Sujee Maniyam <su...@sujee.net> wrote:
>
>> HI devs
>> now that hfds HA is is a reality,  how about HDFS spanning multiple
>> data centers?  Are there any discussions / work going on in this area?
>>
>> It could be a single cluster spanning multiple data centers or having
>> a 'standby cluster' in another data center.
>>
>> curious, and thanks for your time!
>>
>> regards
>> Sujee Maniyam
>> http://sujee.net
>
>
> what are your goals here?
>
>    - store 1 of the 3 replicas off-site for (possible) recovery on a site
>    failure
>    - store 2+ replicas on each site for better recovery of site+block
>    failure
>    - be able to back up all of the data to a different site
>    - be able to back up some the data to a different site
>    - stream the metadata/NN log to a remote site (you could get away with
>    that today
>
> Or do you plan to have different data across sites and then run MR jobs
> across them? This would be an interesting problem, but its way above the FS.
>
> There's still a lot of work that could be done for single-site failure
> tolerance, in particular
> -better failure topology awareness, if you run the site on two external
> power supplies -as telcos do- then you want at least one copy on each power
> source
> -better partition failure awareness -differentiate "loss of rack"
> differently from "all the machines on  rack have stopped reporting in",
> which is how it is treated today,
>
> -steve



-- 
OZAWA Tsuyoshi

Re: data center aware hadoop?

Posted by Steve Loughran <st...@hortonworks.com>.
On 11 September 2012 00:29, Sujee Maniyam <su...@sujee.net> wrote:

> HI devs
> now that hfds HA is is a reality,  how about HDFS spanning multiple
> data centers?  Are there any discussions / work going on in this area?
>
> It could be a single cluster spanning multiple data centers or having
> a 'standby cluster' in another data center.
>
> curious, and thanks for your time!
>
> regards
> Sujee Maniyam
> http://sujee.net


what are your goals here?

   - store 1 of the 3 replicas off-site for (possible) recovery on a site
   failure
   - store 2+ replicas on each site for better recovery of site+block
   failure
   - be able to back up all of the data to a different site
   - be able to back up some the data to a different site
   - stream the metadata/NN log to a remote site (you could get away with
   that today

Or do you plan to have different data across sites and then run MR jobs
across them? This would be an interesting problem, but its way above the FS.

There's still a lot of work that could be done for single-site failure
tolerance, in particular
-better failure topology awareness, if you run the site on two external
power supplies -as telcos do- then you want at least one copy on each power
source
-better partition failure awareness -differentiate "loss of rack"
differently from "all the machines on  rack have stopped reporting in",
which is how it is treated today,

-steve

Re: data center aware hadoop?

Posted by Joe Bounour <jb...@ddn.com>.
Hello

Interesting topic but is it really a general use case? The average HDFS
cluster today is lower than 100 nodes , still, it is a lot of stored data.
You would have to synchronize petabytes over high latency networks and I
would assume, you are using HDFS as an archive (meaning you do not replace
the content often). The Social network companies cycles logs in HDFS
because the most recent data is their focus.

HDFS Site protection would have to be async mode for sure (performance)
and dealing with data consistency will have to be handle as well which is
never simple; Could use WAN accelerator, of course it is all doable.

HDFS has already protection (3 replica), disaster recovery is relevant for
the namenode, hopefully you would not lose HDFS content

Enterprise requirements from Ops would look at SAN solution for Datanodes
and replicate the storage array or do a backup of it; if you cannot use
SAN and stuck with DAS, make more copies to have more protection level (be
pragmatic and save $$)

Maybe I am missing the point below, why is it really needed

-J


On 9/21/12 5:09 PM, "Jun Ping Du" <jd...@vmware.com> wrote:

>Hi Sujee,
>   HDFS today doesn't consider too much on data center level reliability
>(although it is supposed to extend to data center layer in topology but
>never honored in replica policemen/balancer/task scheduling policy) and
>performance is part of concern to cross data center (assume cross-dc
>bandwidth is lower than within data center). However, in future, I think
>we should deliver a solution to enable data center level disaster
>recovery even performance is downgrade. My several years experience in
>delivering enterprise software is: it is best to let customer to make
>trade-off decision on performance and reliability, and engineering effort
>is to provide options.
>BTW, HDFS HA is a protection of key nodes from SPOF but not handle the
>whole data center shutdown.
>
>Thanks,
>
>Junping
>
>----- Original Message -----
>From: "Sujee Maniyam" <su...@sujee.net>
>To: "hdfs-dev" <hd...@hadoop.apache.org>
>Sent: Tuesday, September 11, 2012 7:29:39 AM
>Subject: data center aware hadoop?
>
>HI devs
>now that hfds HA is is a reality,  how about HDFS spanning multiple
>data centers?  Are there any discussions / work going on in this area?
>
>It could be a single cluster spanning multiple data centers or having
>a 'standby cluster' in another data center.
>
>curious, and thanks for your time!
>
>regards
>Sujee Maniyam
>http://sujee.net


Re: data center aware hadoop?

Posted by Jun Ping Du <jd...@vmware.com>.
Hi Sujee,
   HDFS today doesn't consider too much on data center level reliability (although it is supposed to extend to data center layer in topology but never honored in replica policemen/balancer/task scheduling policy) and performance is part of concern to cross data center (assume cross-dc bandwidth is lower than within data center). However, in future, I think we should deliver a solution to enable data center level disaster recovery even performance is downgrade. My several years experience in delivering enterprise software is: it is best to let customer to make trade-off decision on performance and reliability, and engineering effort is to provide options.
BTW, HDFS HA is a protection of key nodes from SPOF but not handle the whole data center shutdown.

Thanks,

Junping

----- Original Message -----
From: "Sujee Maniyam" <su...@sujee.net>
To: "hdfs-dev" <hd...@hadoop.apache.org>
Sent: Tuesday, September 11, 2012 7:29:39 AM
Subject: data center aware hadoop?

HI devs
now that hfds HA is is a reality,  how about HDFS spanning multiple
data centers?  Are there any discussions / work going on in this area?

It could be a single cluster spanning multiple data centers or having
a 'standby cluster' in another data center.

curious, and thanks for your time!

regards
Sujee Maniyam
http://sujee.net