You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Patai Sangbutsarakum <si...@gmail.com> on 2012/08/03 20:50:05 UTC

migrate cluster to different datacenter

Hi Hadoopers,

We have a plan to migrate Hadoop cluster to a different datacenter
where we can triple the size of the cluster.
Currently, our 0.20.2 cluster have around 1PB of data. We use only Java/Pig.

I would like to get some input how we gonna handle with transferring
1PB of data to a new site, and also keep up with
new files that thrown into cluster all the time.

Happy friday !!

P

Re: migrate cluster to different datacenter

Posted by Michael Segel <mi...@hotmail.com>.

The OP hasn't provided enough information to even start trying to make a real recommendation on how to solve this problem. 

On Aug 4, 2012, at 7:32 AM, Nitin Kesarwani <bu...@gmail.com> wrote:

> Given the size of data, there can be several approaches here:
> 
> 1. Moving the boxes
> 
> Not possible, as I suppose the data must be needed for 24x7 analytics.
> 
> 2. Mirroring the data.
> 
> This is a good solution. However, if you have data being written/removed
> continuously (if a part of live system), there are chances of losing some
> of the data during mirroring happens, unless
> a) You block writes/updates during that time (if you do so, that would be
> as good as unplugging and moving the machine around), or,
> b) Keep a track of what was modified since you started the mirroring
> process.
> 
> I would recommend you to go with 2b) because it minimizes downtime. Here is
> how I think you can do it, by using some of the tools provided by Hadoop
> itself.
> 
> a) You can use some fast distributed copying tool to copy large chunks of
> data. Before you kick-off with this, you can create a utility that tracks
> the modification of data made to your live system while copying is going on
> in the background. The utility will log the modifications into an audit
> trail.
> b) Once you're done copying the files,  allow the new data store
> replication to catch up by reading the real-time modifications that were
> made, from your utility's log file. Once sync'ed up you can begin with the
> minimal downtime by switching off the JobTracker in live cluster so that
> new files are not created.
> c) As soon as you reach the last chunk of copying, change the DNS entries
> so that the hostnames referenced by the Hadoop jobs points to the new
> location.
> d) Turn on the JobTracker for the new cluster.
> e) Enjoy a drink with the money you saved by not using other paid third
> party solutions and pat your back! ;)
> 
> The key of the above solution is to make data copying of step a) as fast as
> possible. Lesser the time, lesser the contents in audit trail, lesser the
> overall downtime.
> 
> You can develop some in house solution for this, or use DistCp, provided by
> Hadoop that uses copies over the data using Map/Reduce.
> 
> 
> On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel <mi...@hotmail.com>wrote:
> 
>> Sorry at 1PB of disk... compression isn't going to really help a whole
>> heck of a lot. Your networking bandwidth will be your bottleneck.
>> 
>> So lets look at the problem.
>> 
>> How much down time can you afford?
>> What does your hardware look like?
>> How much space do you have in your current data center?
>> 
>> You have 1PB of data. OK, what does the access pattern look like?
>> 
>> There are a couple of ways to slice and dice this. How many trucks do you
>> have?
>> 
>> On Aug 3, 2012, at 4:24 PM, Harit Himanshu <ha...@gmail.com>
>> wrote:
>> 
>>> Moving 1 PB of data would take loads of time,
>>> - check if this new data center provides something similar to
>> http://aws.amazon.com/importexport/
>>> - Consider multi part uploading of data
>>> - consider compressing the data
>>> 
>>> 
>>> On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:
>>> 
>>>> thanks for response.
>>>> Physical move is not a choice in this case. Purely looking for copying
>>>> data and how to catch up with the update of a file while it is being
>>>> migrated.
>>>> 
>>>> On Fri, Aug 3, 2012 at 12:40 PM, Chen He <ai...@gmail.com> wrote:
>>>>> sometimes, physically moving hard drives helps.   :)
>>>>> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <
>> silvianhadoop@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi Hadoopers,
>>>>>> 
>>>>>> We have a plan to migrate Hadoop cluster to a different datacenter
>>>>>> where we can triple the size of the cluster.
>>>>>> Currently, our 0.20.2 cluster have around 1PB of data. We use only
>>>>>> Java/Pig.
>>>>>> 
>>>>>> I would like to get some input how we gonna handle with transferring
>>>>>> 1PB of data to a new site, and also keep up with
>>>>>> new files that thrown into cluster all the time.
>>>>>> 
>>>>>> Happy friday !!
>>>>>> 
>>>>>> P
>>>>>> 
>>> 
>> 
>>

Re: migrate cluster to different datacenter

Posted by Nitin Kesarwani <bu...@gmail.com>.

Given the size of data, there can be several approaches here:

1. Moving the boxes

Not possible, as I suppose the data must be needed for 24x7 analytics.

2. Mirroring the data.

This is a good solution. However, if you have data being written/removed
continuously (if a part of live system), there are chances of losing some
of the data during mirroring happens, unless
a) You block writes/updates during that time (if you do so, that would be
as good as unplugging and moving the machine around), or,
b) Keep a track of what was modified since you started the mirroring
process.

I would recommend you to go with 2b) because it minimizes downtime. Here is
how I think you can do it, by using some of the tools provided by Hadoop
itself.

a) You can use some fast distributed copying tool to copy large chunks of
data. Before you kick-off with this, you can create a utility that tracks
the modification of data made to your live system while copying is going on
in the background. The utility will log the modifications into an audit
trail.
b) Once you're done copying the files,  allow the new data store
replication to catch up by reading the real-time modifications that were
made, from your utility's log file. Once sync'ed up you can begin with the
minimal downtime by switching off the JobTracker in live cluster so that
new files are not created.
c) As soon as you reach the last chunk of copying, change the DNS entries
so that the hostnames referenced by the Hadoop jobs points to the new
location.
d) Turn on the JobTracker for the new cluster.
e) Enjoy a drink with the money you saved by not using other paid third
party solutions and pat your back! ;)

The key of the above solution is to make data copying of step a) as fast as
possible. Lesser the time, lesser the contents in audit trail, lesser the
overall downtime.

You can develop some in house solution for this, or use DistCp, provided by
Hadoop that uses copies over the data using Map/Reduce.

On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel <mi...@hotmail.com>wrote:

> Sorry at 1PB of disk... compression isn't going to really help a whole
> heck of a lot. Your networking bandwidth will be your bottleneck.
>
> So lets look at the problem.
>
> How much down time can you afford?
> What does your hardware look like?
> How much space do you have in your current data center?
>
> You have 1PB of data. OK, what does the access pattern look like?
>
> There are a couple of ways to slice and dice this. How many trucks do you
> have?
>
> On Aug 3, 2012, at 4:24 PM, Harit Himanshu <ha...@gmail.com>
> wrote:
>
> > Moving 1 PB of data would take loads of time,
> > - check if this new data center provides something similar to
> http://aws.amazon.com/importexport/
> > - Consider multi part uploading of data
> > - consider compressing the data
> >
> >
> > On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:
> >
> >> thanks for response.
> >> Physical move is not a choice in this case. Purely looking for copying
> >> data and how to catch up with the update of a file while it is being
> >> migrated.
> >>
> >> On Fri, Aug 3, 2012 at 12:40 PM, Chen He <ai...@gmail.com> wrote:
> >>> sometimes, physically moving hard drives helps.   :)
> >>> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <
> silvianhadoop@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Hadoopers,
> >>>>
> >>>> We have a plan to migrate Hadoop cluster to a different datacenter
> >>>> where we can triple the size of the cluster.
> >>>> Currently, our 0.20.2 cluster have around 1PB of data. We use only
> >>>> Java/Pig.
> >>>>
> >>>> I would like to get some input how we gonna handle with transferring
> >>>> 1PB of data to a new site, and also keep up with
> >>>> new files that thrown into cluster all the time.
> >>>>
> >>>> Happy friday !!
> >>>>
> >>>> P
> >>>>
> >
>
>

Re: migrate cluster to different datacenter

Posted by Michael Segel <mi...@hotmail.com>.

Sorry at 1PB of disk... compression isn't going to really help a whole heck of a lot. Your networking bandwidth will be your bottleneck.

So lets look at the problem. 

How much down time can you afford? 
What does your hardware look like? 
How much space do you have in your current data center? 

You have 1PB of data. OK, what does the access pattern look like? 

There are a couple of ways to slice and dice this. How many trucks do you have? 

On Aug 3, 2012, at 4:24 PM, Harit Himanshu <ha...@gmail.com> wrote:

> Moving 1 PB of data would take loads of time, 
> - check if this new data center provides something similar to http://aws.amazon.com/importexport/
> - Consider multi part uploading of data
> - consider compressing the data
> 
> 
> On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:
> 
>> thanks for response.
>> Physical move is not a choice in this case. Purely looking for copying
>> data and how to catch up with the update of a file while it is being
>> migrated.
>> 
>> On Fri, Aug 3, 2012 at 12:40 PM, Chen He <ai...@gmail.com> wrote:
>>> sometimes, physically moving hard drives helps.   :)
>>> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <si...@gmail.com>
>>> wrote:
>>> 
>>>> Hi Hadoopers,
>>>> 
>>>> We have a plan to migrate Hadoop cluster to a different datacenter
>>>> where we can triple the size of the cluster.
>>>> Currently, our 0.20.2 cluster have around 1PB of data. We use only
>>>> Java/Pig.
>>>> 
>>>> I would like to get some input how we gonna handle with transferring
>>>> 1PB of data to a new site, and also keep up with
>>>> new files that thrown into cluster all the time.
>>>> 
>>>> Happy friday !!
>>>> 
>>>> P
>>>> 
>

Re: migrate cluster to different datacenter

Posted by Harit Himanshu <ha...@gmail.com>.

Moving 1 PB of data would take loads of time, 
- check if this new data center provides something similar to http://aws.amazon.com/importexport/
- Consider multi part uploading of data
- consider compressing the data


On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote:

> thanks for response.
> Physical move is not a choice in this case. Purely looking for copying
> data and how to catch up with the update of a file while it is being
> migrated.
> 
> On Fri, Aug 3, 2012 at 12:40 PM, Chen He <ai...@gmail.com> wrote:
>> sometimes, physically moving hard drives helps.   :)
>> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <si...@gmail.com>
>> wrote:
>> 
>>> Hi Hadoopers,
>>> 
>>> We have a plan to migrate Hadoop cluster to a different datacenter
>>> where we can triple the size of the cluster.
>>> Currently, our 0.20.2 cluster have around 1PB of data. We use only
>>> Java/Pig.
>>> 
>>> I would like to get some input how we gonna handle with transferring
>>> 1PB of data to a new site, and also keep up with
>>> new files that thrown into cluster all the time.
>>> 
>>> Happy friday !!
>>> 
>>> P
>>>

Re: migrate cluster to different datacenter

Posted by Patrick Angeles <pa...@gmail.com>.

It would help to know your data ingest and processing patterns (and any
applicable SLAs).

In most cases, you'd only need to move the raw ingested data, then you can
derive the rest in the other cluster. Assuming that you have some sort of
date-based partitioning on the ingest, then it's easy to define a cut-off
point.

Depending on your read SLAs, you could tee writes to both clusters for a
period of time, or just simply switch off to the new one once the majority
of data has been moved.

Finally, you would want to do a consistency check to make sure everything
made it to the other side... maybe run a checksum on derived data on both
clusters and compare. Something like that...

- P

On Fri, Aug 3, 2012 at 5:19 PM, Patai Sangbutsarakum <
silvianhadoop@gmail.com> wrote:

> thanks for response.
> Physical move is not a choice in this case. Purely looking for copying
> data and how to catch up with the update of a file while it is being
> migrated.
>
> On Fri, Aug 3, 2012 at 12:40 PM, Chen He <ai...@gmail.com> wrote:
> > sometimes, physically moving hard drives helps.   :)
> > On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <si...@gmail.com>
> > wrote:
> >
> >> Hi Hadoopers,
> >>
> >> We have a plan to migrate Hadoop cluster to a different datacenter
> >> where we can triple the size of the cluster.
> >> Currently, our 0.20.2 cluster have around 1PB of data. We use only
> >> Java/Pig.
> >>
> >> I would like to get some input how we gonna handle with transferring
> >> 1PB of data to a new site, and also keep up with
> >> new files that thrown into cluster all the time.
> >>
> >> Happy friday !!
> >>
> >> P
> >>
>

Re: migrate cluster to different datacenter

Posted by Patai Sangbutsarakum <si...@gmail.com>.

thanks for response.
Physical move is not a choice in this case. Purely looking for copying
data and how to catch up with the update of a file while it is being
migrated.

On Fri, Aug 3, 2012 at 12:40 PM, Chen He <ai...@gmail.com> wrote:
> sometimes, physically moving hard drives helps.   :)
> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <si...@gmail.com>
> wrote:
>
>> Hi Hadoopers,
>>
>> We have a plan to migrate Hadoop cluster to a different datacenter
>> where we can triple the size of the cluster.
>> Currently, our 0.20.2 cluster have around 1PB of data. We use only
>> Java/Pig.
>>
>> I would like to get some input how we gonna handle with transferring
>> 1PB of data to a new site, and also keep up with
>> new files that thrown into cluster all the time.
>>
>> Happy friday !!
>>
>> P
>>

Re: migrate cluster to different datacenter

Posted by Chen He <ai...@gmail.com>.

sometimes, physically moving hard drives helps.   :)
On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" <si...@gmail.com>
wrote:

> Hi Hadoopers,
>
> We have a plan to migrate Hadoop cluster to a different datacenter
> where we can triple the size of the cluster.
> Currently, our 0.20.2 cluster have around 1PB of data. We use only
> Java/Pig.
>
> I would like to get some input how we gonna handle with transferring
> 1PB of data to a new site, and also keep up with
> new files that thrown into cluster all the time.
>
> Happy friday !!
>
> P
>