You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Abhishek Pratap Singh <ma...@gmail.com> on 2012/04/11 20:44:44 UTC

Multiple data centre in Hadoop

Hi All,

Just wanted if hadoop supports more than one data centre. This is basically
for DR purposes and High Availability where one centre goes down other can
bring up.


Regards,
Abhishek

Re: Multiple data centre in Hadoop

Posted by Edward Capriolo <ed...@gmail.com>.

Hive is beginning to implement Region support where one metastore will
manage multiple filesystems and jobtrackers. When a query creates a
table it will then be copied to one ore more datacenters. In addition
the query planner will intelligently attempt to run queries in regions
only where all the tables exists.

While wiating for these awesome features I am doing a fair amount of
distcp work from groovy scripts.

Edward

On Thu, Apr 19, 2012 at 5:33 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> If you want to start an open source project for this I am sure that there are others with the same problem that might be very wiling to help out. :)
>
> --Bobby Evans
>
> On 4/19/12 4:31 PM, "Michael Segel" <mi...@hotmail.com> wrote:
>
> I don't know of any open source solution in doing this...
> And yeah its something one can't talk about....  ;-)
>
>
> On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:
>
>> Where I work  we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it.  I can guess about what it would take, but that is all it would be at this point.
>>
>> --Bobby
>>
>>
>> On 4/17/12 5:46 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:
>>
>> Thanks bobby, I m looking for something like this..... Now the question is
>> what is the best strategy to do Hot/Hot or Hot/Warm.
>> I need to consider the CPU and Network bandwidth, also needs to decide from
>> which layer this replication should start.
>>
>> Regards,
>> Abhishek
>>
>> On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
>>
>>> Hi Abhishek,
>>>
>>> Manu is correct about High Availability within a single colo.  I realize
>>> that in some cases you have to have fail over between colos.  I am not
>>> aware of any turn key solution for things like that, but generally what you
>>> want to do is to run two clusters, one in each colo, either hot/hot or
>>> hot/warm, and I have seen both depending on how quickly you need to fail
>>> over.  In hot/hot the input data is replicated to both clusters and the
>>> same software is run on both.  In this case though you have to be fairly
>>> sure that your processing is deterministic, or the results could be
>>> slightly different (i.e. No generating if random ids).  In hot/warm the
>>> data is replicated from one colo to the other at defined checkpoints.  The
>>> data is only processed on one of the grids, but if that colo goes down the
>>> other one can take up the processing from where ever the last checkpoint
>>> was.
>>>
>>> I hope that helps.
>>>
>>> --Bobby
>>>
>>> On 4/12/12 5:07 AM, "Manu S" <ma...@gmail.com> wrote:
>>>
>>> Hi Abhishek,
>>>
>>> 1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
>>> * Recommendation: write to *two local directories on different
>>> physical volumes*, and to an *NFS-mounted* directory
>>> - Data will be preserved even in the event of a total failure of the
>>> NameNode machines
>>> * Recommendation: *soft-mount the NFS* directory
>>> - If the NFS mount goes offline, this will not cause the NameNode
>>> to fail
>>>
>>> 2. *Rack awareness*
>>>
>>> https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
>>>
>>> On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
>>> <ma...@gmail.com>wrote:
>>>
>>>> Thanks Robert.
>>>> Is there a best practice or design than can address the High Availability
>>>> to certain extent?
>>>>
>>>> ~Abhishek
>>>>
>>>> On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com>
>>>> wrote:
>>>>
>>>>> No it does not. Sorry
>>>>>
>>>>>
>>>>> On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com>
>>> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Just wanted if hadoop supports more than one data centre. This is
>>>> basically
>>>>> for DR purposes and High Availability where one centre goes down other
>>>> can
>>>>> bring up.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Abhishek
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> ----
>>> *Manu S*
>>> SI Engineer - OpenSource & HPC
>>> Wipro Infotech
>>> Mob: +91 8861302855                Skype: manuspkd
>>> www.opensourcetalk.co.in
>>>
>>>
>>
>
>

Re: Multiple data centre in Hadoop

Posted by Robert Evans <ev...@yahoo-inc.com>.

If you want to start an open source project for this I am sure that there are others with the same problem that might be very wiling to help out. :)

--Bobby Evans

On 4/19/12 4:31 PM, "Michael Segel" <mi...@hotmail.com> wrote:

I don't know of any open source solution in doing this...
And yeah its something one can't talk about....  ;-)


On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:

> Where I work  we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it.  I can guess about what it would take, but that is all it would be at this point.
>
> --Bobby
>
>
> On 4/17/12 5:46 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:
>
> Thanks bobby, I m looking for something like this..... Now the question is
> what is the best strategy to do Hot/Hot or Hot/Warm.
> I need to consider the CPU and Network bandwidth, also needs to decide from
> which layer this replication should start.
>
> Regards,
> Abhishek
>
> On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
>
>> Hi Abhishek,
>>
>> Manu is correct about High Availability within a single colo.  I realize
>> that in some cases you have to have fail over between colos.  I am not
>> aware of any turn key solution for things like that, but generally what you
>> want to do is to run two clusters, one in each colo, either hot/hot or
>> hot/warm, and I have seen both depending on how quickly you need to fail
>> over.  In hot/hot the input data is replicated to both clusters and the
>> same software is run on both.  In this case though you have to be fairly
>> sure that your processing is deterministic, or the results could be
>> slightly different (i.e. No generating if random ids).  In hot/warm the
>> data is replicated from one colo to the other at defined checkpoints.  The
>> data is only processed on one of the grids, but if that colo goes down the
>> other one can take up the processing from where ever the last checkpoint
>> was.
>>
>> I hope that helps.
>>
>> --Bobby
>>
>> On 4/12/12 5:07 AM, "Manu S" <ma...@gmail.com> wrote:
>>
>> Hi Abhishek,
>>
>> 1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
>> * Recommendation: write to *two local directories on different
>> physical volumes*, and to an *NFS-mounted* directory
>> - Data will be preserved even in the event of a total failure of the
>> NameNode machines
>> * Recommendation: *soft-mount the NFS* directory
>> - If the NFS mount goes offline, this will not cause the NameNode
>> to fail
>>
>> 2. *Rack awareness*
>>
>> https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
>>
>> On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
>> <ma...@gmail.com>wrote:
>>
>>> Thanks Robert.
>>> Is there a best practice or design than can address the High Availability
>>> to certain extent?
>>>
>>> ~Abhishek
>>>
>>> On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com>
>>> wrote:
>>>
>>>> No it does not. Sorry
>>>>
>>>>
>>>> On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com>
>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> Just wanted if hadoop supports more than one data centre. This is
>>> basically
>>>> for DR purposes and High Availability where one centre goes down other
>>> can
>>>> bring up.
>>>>
>>>>
>>>> Regards,
>>>> Abhishek
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> ----
>> *Manu S*
>> SI Engineer - OpenSource & HPC
>> Wipro Infotech
>> Mob: +91 8861302855                Skype: manuspkd
>> www.opensourcetalk.co.in
>>
>>
>

Re: Multiple data centre in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

I don't know of any open source solution in doing this... 
And yeah its something one can't talk about....  ;-)


On Apr 19, 2012, at 4:28 PM, Robert Evans wrote:

> Where I work  we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it.  I can guess about what it would take, but that is all it would be at this point.
> 
> --Bobby
> 
> 
> On 4/17/12 5:46 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:
> 
> Thanks bobby, I m looking for something like this..... Now the question is
> what is the best strategy to do Hot/Hot or Hot/Warm.
> I need to consider the CPU and Network bandwidth, also needs to decide from
> which layer this replication should start.
> 
> Regards,
> Abhishek
> 
> On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
> 
>> Hi Abhishek,
>> 
>> Manu is correct about High Availability within a single colo.  I realize
>> that in some cases you have to have fail over between colos.  I am not
>> aware of any turn key solution for things like that, but generally what you
>> want to do is to run two clusters, one in each colo, either hot/hot or
>> hot/warm, and I have seen both depending on how quickly you need to fail
>> over.  In hot/hot the input data is replicated to both clusters and the
>> same software is run on both.  In this case though you have to be fairly
>> sure that your processing is deterministic, or the results could be
>> slightly different (i.e. No generating if random ids).  In hot/warm the
>> data is replicated from one colo to the other at defined checkpoints.  The
>> data is only processed on one of the grids, but if that colo goes down the
>> other one can take up the processing from where ever the last checkpoint
>> was.
>> 
>> I hope that helps.
>> 
>> --Bobby
>> 
>> On 4/12/12 5:07 AM, "Manu S" <ma...@gmail.com> wrote:
>> 
>> Hi Abhishek,
>> 
>> 1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
>> * Recommendation: write to *two local directories on different
>> physical volumes*, and to an *NFS-mounted* directory
>> - Data will be preserved even in the event of a total failure of the
>> NameNode machines
>> * Recommendation: *soft-mount the NFS* directory
>> - If the NFS mount goes offline, this will not cause the NameNode
>> to fail
>> 
>> 2. *Rack awareness*
>> 
>> https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
>> 
>> On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
>> <ma...@gmail.com>wrote:
>> 
>>> Thanks Robert.
>>> Is there a best practice or design than can address the High Availability
>>> to certain extent?
>>> 
>>> ~Abhishek
>>> 
>>> On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com>
>>> wrote:
>>> 
>>>> No it does not. Sorry
>>>> 
>>>> 
>>>> On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com>
>> wrote:
>>>> 
>>>> Hi All,
>>>> 
>>>> Just wanted if hadoop supports more than one data centre. This is
>>> basically
>>>> for DR purposes and High Availability where one centre goes down other
>>> can
>>>> bring up.
>>>> 
>>>> 
>>>> Regards,
>>>> Abhishek
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards
>> ----
>> *Manu S*
>> SI Engineer - OpenSource & HPC
>> Wipro Infotech
>> Mob: +91 8861302855                Skype: manuspkd
>> www.opensourcetalk.co.in
>> 
>> 
>

Re: Multiple data centre in Hadoop

Posted by Robert Evans <ev...@yahoo-inc.com>.

Where I work  we have done some things like this, but none of them are open source, and I have not really been directly involved with the details of it.  I can guess about what it would take, but that is all it would be at this point.

--Bobby


On 4/17/12 5:46 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:

Thanks bobby, I m looking for something like this..... Now the question is
what is the best strategy to do Hot/Hot or Hot/Warm.
I need to consider the CPU and Network bandwidth, also needs to decide from
which layer this replication should start.

Regards,
Abhishek

On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Hi Abhishek,
>
> Manu is correct about High Availability within a single colo.  I realize
> that in some cases you have to have fail over between colos.  I am not
> aware of any turn key solution for things like that, but generally what you
> want to do is to run two clusters, one in each colo, either hot/hot or
> hot/warm, and I have seen both depending on how quickly you need to fail
> over.  In hot/hot the input data is replicated to both clusters and the
> same software is run on both.  In this case though you have to be fairly
> sure that your processing is deterministic, or the results could be
> slightly different (i.e. No generating if random ids).  In hot/warm the
> data is replicated from one colo to the other at defined checkpoints.  The
> data is only processed on one of the grids, but if that colo goes down the
> other one can take up the processing from where ever the last checkpoint
> was.
>
> I hope that helps.
>
> --Bobby
>
> On 4/12/12 5:07 AM, "Manu S" <ma...@gmail.com> wrote:
>
> Hi Abhishek,
>
> 1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
> * Recommendation: write to *two local directories on different
> physical volumes*, and to an *NFS-mounted* directory
> - Data will be preserved even in the event of a total failure of the
> NameNode machines
> * Recommendation: *soft-mount the NFS* directory
> - If the NFS mount goes offline, this will not cause the NameNode
> to fail
>
> 2. *Rack awareness*
>
> https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
>
> On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
> <ma...@gmail.com>wrote:
>
> > Thanks Robert.
> > Is there a best practice or design than can address the High Availability
> > to certain extent?
> >
> > ~Abhishek
> >
> > On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com>
> > wrote:
> >
> > > No it does not. Sorry
> > >
> > >
> > > On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com>
> wrote:
> > >
> > > Hi All,
> > >
> > > Just wanted if hadoop supports more than one data centre. This is
> > basically
> > > for DR purposes and High Availability where one centre goes down other
> > can
> > > bring up.
> > >
> > >
> > > Regards,
> > > Abhishek
> > >
> > >
> >
>
>
>
> --
> Thanks & Regards
> ----
> *Manu S*
> SI Engineer - OpenSource & HPC
> Wipro Infotech
> Mob: +91 8861302855                Skype: manuspkd
> www.opensourcetalk.co.in
>
>

Re: Multiple data centre in Hadoop

Posted by Abhishek Pratap Singh <ma...@gmail.com>.

Thanks bobby, I m looking for something like this..... Now the question is
what is the best strategy to do Hot/Hot or Hot/Warm.
I need to consider the CPU and Network bandwidth, also needs to decide from
which layer this replication should start.

Regards,
Abhishek

On Mon, Apr 16, 2012 at 7:08 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Hi Abhishek,
>
> Manu is correct about High Availability within a single colo.  I realize
> that in some cases you have to have fail over between colos.  I am not
> aware of any turn key solution for things like that, but generally what you
> want to do is to run two clusters, one in each colo, either hot/hot or
> hot/warm, and I have seen both depending on how quickly you need to fail
> over.  In hot/hot the input data is replicated to both clusters and the
> same software is run on both.  In this case though you have to be fairly
> sure that your processing is deterministic, or the results could be
> slightly different (i.e. No generating if random ids).  In hot/warm the
> data is replicated from one colo to the other at defined checkpoints.  The
> data is only processed on one of the grids, but if that colo goes down the
> other one can take up the processing from where ever the last checkpoint
> was.
>
> I hope that helps.
>
> --Bobby
>
> On 4/12/12 5:07 AM, "Manu S" <ma...@gmail.com> wrote:
>
> Hi Abhishek,
>
> 1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
> * Recommendation: write to *two local directories on different
> physical volumes*, and to an *NFS-mounted* directory
> - Data will be preserved even in the event of a total failure of the
> NameNode machines
> * Recommendation: *soft-mount the NFS* directory
> - If the NFS mount goes offline, this will not cause the NameNode
> to fail
>
> 2. *Rack awareness*
>
> https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf
>
> On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
> <ma...@gmail.com>wrote:
>
> > Thanks Robert.
> > Is there a best practice or design than can address the High Availability
> > to certain extent?
> >
> > ~Abhishek
> >
> > On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com>
> > wrote:
> >
> > > No it does not. Sorry
> > >
> > >
> > > On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com>
> wrote:
> > >
> > > Hi All,
> > >
> > > Just wanted if hadoop supports more than one data centre. This is
> > basically
> > > for DR purposes and High Availability where one centre goes down other
> > can
> > > bring up.
> > >
> > >
> > > Regards,
> > > Abhishek
> > >
> > >
> >
>
>
>
> --
> Thanks & Regards
> ----
> *Manu S*
> SI Engineer - OpenSource & HPC
> Wipro Infotech
> Mob: +91 8861302855                Skype: manuspkd
> www.opensourcetalk.co.in
>
>

Re: Multiple data centre in Hadoop

Posted by Robert Evans <ev...@yahoo-inc.com>.

Hi Abhishek,

Manu is correct about High Availability within a single colo.  I realize that in some cases you have to have fail over between colos.  I am not aware of any turn key solution for things like that, but generally what you want to do is to run two clusters, one in each colo, either hot/hot or hot/warm, and I have seen both depending on how quickly you need to fail over.  In hot/hot the input data is replicated to both clusters and the same software is run on both.  In this case though you have to be fairly sure that your processing is deterministic, or the results could be slightly different (i.e. No generating if random ids).  In hot/warm the data is replicated from one colo to the other at defined checkpoints.  The data is only processed on one of the grids, but if that colo goes down the other one can take up the processing from where ever the last checkpoint was.

I hope that helps.

--Bobby

On 4/12/12 5:07 AM, "Manu S" <ma...@gmail.com> wrote:

Hi Abhishek,

1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
* Recommendation: write to *two local directories on different
physical volumes*, and to an *NFS-mounted* directory
- Data will be preserved even in the event of a total failure of the
NameNode machines
* Recommendation: *soft-mount the NFS* directory
- If the NFS mount goes offline, this will not cause the NameNode
to fail

2. *Rack awareness*
https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
<ma...@gmail.com>wrote:

> Thanks Robert.
> Is there a best practice or design than can address the High Availability
> to certain extent?
>
> ~Abhishek
>
> On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com>
> wrote:
>
> > No it does not. Sorry
> >
> >
> > On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:
> >
> > Hi All,
> >
> > Just wanted if hadoop supports more than one data centre. This is
> basically
> > for DR purposes and High Availability where one centre goes down other
> can
> > bring up.
> >
> >
> > Regards,
> > Abhishek
> >
> >
>

--
Thanks & Regards
----
*Manu S*
SI Engineer - OpenSource & HPC
Wipro Infotech
Mob: +91 8861302855                Skype: manuspkd
www.opensourcetalk.co.in

Re: Multiple data centre in Hadoop

Posted by Manu S <ma...@gmail.com>.

Hi Abhishek,

1. Use multiple directories for *dfs.name.dir* & *dfs.data.dir* etc
* Recommendation: write to *two local directories on different
physical volumes*, and to an *NFS-mounted* directory
– Data will be preserved even in the event of a total failure of the
NameNode machines
* Recommendation: *soft-mount the NFS* directory
– If the NFS mount goes offline, this will not cause the NameNode
to fail

2. *Rack awareness*
https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf

On Thu, Apr 12, 2012 at 2:18 AM, Abhishek Pratap Singh
<ma...@gmail.com>wrote:

> Thanks Robert.
> Is there a best practice or design than can address the High Availability
> to certain extent?
>
> ~Abhishek
>
> On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com>
> wrote:
>
> > No it does not. Sorry
> >
> >
> > On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:
> >
> > Hi All,
> >
> > Just wanted if hadoop supports more than one data centre. This is
> basically
> > for DR purposes and High Availability where one centre goes down other
> can
> > bring up.
> >
> >
> > Regards,
> > Abhishek
> >
> >
>

-- 
Thanks & Regards
----
*Manu S*
SI Engineer - OpenSource & HPC
Wipro Infotech
Mob: +91 8861302855                Skype: manuspkd
www.opensourcetalk.co.in

Re: Multiple data centre in Hadoop

Posted by Abhishek Pratap Singh <ma...@gmail.com>.

Thanks Robert.
Is there a best practice or design than can address the High Availability
to certain extent?

~Abhishek

On Wed, Apr 11, 2012 at 12:32 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> No it does not. Sorry
>
>
> On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:
>
> Hi All,
>
> Just wanted if hadoop supports more than one data centre. This is basically
> for DR purposes and High Availability where one centre goes down other can
> bring up.
>
>
> Regards,
> Abhishek
>
>

Re: Multiple data centre in Hadoop

Posted by Robert Evans <ev...@yahoo-inc.com>.

No it does not. Sorry


On 4/11/12 1:44 PM, "Abhishek Pratap Singh" <ma...@gmail.com> wrote:

Hi All,

Just wanted if hadoop supports more than one data centre. This is basically
for DR purposes and High Availability where one centre goes down other can
bring up.


Regards,
Abhishek