You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by raymond <rg...@163.com> on 2016/04/12 11:44:09 UTC

Best way to migrate PB scale data between live cluster?

We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster across datacenter for larger volume capability.
We estimate that the data copy itself might took near a month to finish. So we are seeking for a sound solution. The requirement is as below:
1. we cannot bring down the old cluster for such a long time ( of course), and a couple of hours is acceptable.
2. we need to mirror the data, it means that we not only need to copy the new data, but also need to delete the deleted data happened during the migration period.
3. we don’t have much space left on the old cluster, say 30% room.

regarding distcp, although it might be the easiest way , but

1. it do not handle data delete
2. it handle newly appended file by compare file size and overwrite it ( well , it might waste a lot of bandwidth )
3. error handling base on file is triffle.
4 load control is difficult ( we still have heavy work load on old cluster) you can just try to split your work manually and make it small enough to achieve the flow control goal.

In one word, for a long time mirror work. It won't do well by itself.

The are some possible works might need to be done :

We can:

Do some wrap work around distcp to make it works better. ( say error handling, check results. Extra code for sync deleted files etc. )
Utilize Snapshot mechanisms for better identify files need to be copied and deleted. Or renamed.

Forget about distcp. Use FSIMAGE and editlog as a change history source, and write our own code to replay the operation. Handle each file one by one. ( better per file error handling could be achieved), but this might need a lot of dev works.

Btw. The closest thing I could found is facebook migration 30PB hive warehouse:

https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ <https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/>

They modifiy the distcp to do a initial bulk load (to better handling large files and very small files, for load balance I guess.) , and a replication system (not much detail on this part) to mirror the changes.

But it is not clear that how they handle those shortcomings of distcp I mentioned above. And do they utilize snapshot mechanism.

So , does anyone have experience on this kind of work? What do you think might be the best approaching for our case? Is there any ready works been done that we can utilize? Is there any works have been done around snapshot mechanism to easy data migration?

Re: Best way to migrate PB scale data between live cluster?

Posted by Namikaze Minato <ll...@gmail.com>.

The clean way to go is to start from the log and to replay it... But I
have actually no idea about how to do that
You might find this (old) work interesting:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

I'd never have tried to transmit this much data across the network, I
would always have tried to find a way to copy hard disks and
physically ship them to the location...

Camusensei

On 12 April 2016 at 19:14, cs user <ac...@gmail.com> wrote:
> Hi there,
>
> At some point in the near future we are also going to require exactly what
> you describe. We had hope to use distcp.
>
> You mentioned:
>
> 1. it do not handle data delete
>
> distcp has a -delete flag which says -
>
> "Delete the files existing in the dst but not in src"
>
> Does this not help with handling deleted data?
>
> I believe there is an issue if data is removed during a distcp run, so for
> example at the start of the run it captures all the files it needs to sync.
> If some files are deleted during the run, it may lead to errors. Is there a
> way to ignore these errors and have distcp retry on the next run?
>
> I'd be interested in how you manage to eventually accomplish the syncing
> between the two clusters, because we also need to solve the very same
> problem :-)
>
> Perhaps others on the mailing list have experience with this?
>
>
> Thanks!
>
>
> On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:
>>
>> Hi
>>
>>
>>
>> We have a hadoop cluster with several PB data. and we need to migrate it
>> to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish.
>> So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course),
>> and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy the
>> new data, but also need to delete the deleted data happened during the
>> migration period.
>> 3. we don’t have much space left on the old cluster, say 30% room.
>>
>>
>>
>> regarding distcp, although it might be the easiest way , but
>>
>>
>>
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it (
>> well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle.
>> 4 load control is difficult ( we still have heavy work load on old
>> cluster) you can just try to split your work manually and make it small
>> enough to achieve the flow control goal.
>>
>>
>>
>> In one word, for a long time mirror work. It won't do well by itself.
>>
>>
>>
>> The are some possible works might need to be done :
>>
>>
>>
>> We can:
>>
>>
>>
>> Do  some wrap work around distcp to make it works better. ( say error
>> handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied
>> and deleted. Or renamed.
>>
>>
>>
>> Or
>>
>>
>>
>> Forget about distcp. Use FSIMAGE and editlog as a change history source,
>> and write our own code to replay the operation. Handle each file one by one.
>> ( better per file error handling could be achieved), but this might need a
>> lot of dev works.
>>
>>
>>
>>
>>
>> Btw. The closest thing I could found is facebook migration 30PB hive
>> warehouse:
>>
>>
>>
>>
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>>
>>
>>
>> They modifiy the distcp to do a initial bulk load (to better handling
>> large files and very small files, for load balance I guess.) , and a
>> replication system (not much detail on this part) to mirror the changes.
>>
>>
>>
>> But it is not clear that how they handle those shortcomings of distcp I
>> mentioned above. And do they utilize snapshot mechanism.
>>
>>
>>
>> So , does anyone have experience on this kind of work? What do you think
>> might be the best approaching for our case? Is there any ready works been
>> done that we can utilize? Is there any works have been done around snapshot
>> mechanism to easy data migration?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: Best way to migrate PB scale data between live cluster?

Posted by Namikaze Minato <ll...@gmail.com>.

The clean way to go is to start from the log and to replay it... But I
have actually no idea about how to do that
You might find this (old) work interesting:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

I'd never have tried to transmit this much data across the network, I
would always have tried to find a way to copy hard disks and
physically ship them to the location...

Camusensei

On 12 April 2016 at 19:14, cs user <ac...@gmail.com> wrote:
> Hi there,
>
> At some point in the near future we are also going to require exactly what
> you describe. We had hope to use distcp.
>
> You mentioned:
>
> 1. it do not handle data delete
>
> distcp has a -delete flag which says -
>
> "Delete the files existing in the dst but not in src"
>
> Does this not help with handling deleted data?
>
> I believe there is an issue if data is removed during a distcp run, so for
> example at the start of the run it captures all the files it needs to sync.
> If some files are deleted during the run, it may lead to errors. Is there a
> way to ignore these errors and have distcp retry on the next run?
>
> I'd be interested in how you manage to eventually accomplish the syncing
> between the two clusters, because we also need to solve the very same
> problem :-)
>
> Perhaps others on the mailing list have experience with this?
>
>
> Thanks!
>
>
> On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:
>>
>> Hi
>>
>>
>>
>> We have a hadoop cluster with several PB data. and we need to migrate it
>> to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish.
>> So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course),
>> and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy the
>> new data, but also need to delete the deleted data happened during the
>> migration period.
>> 3. we don’t have much space left on the old cluster, say 30% room.
>>
>>
>>
>> regarding distcp, although it might be the easiest way , but
>>
>>
>>
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it (
>> well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle.
>> 4 load control is difficult ( we still have heavy work load on old
>> cluster) you can just try to split your work manually and make it small
>> enough to achieve the flow control goal.
>>
>>
>>
>> In one word, for a long time mirror work. It won't do well by itself.
>>
>>
>>
>> The are some possible works might need to be done :
>>
>>
>>
>> We can:
>>
>>
>>
>> Do  some wrap work around distcp to make it works better. ( say error
>> handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied
>> and deleted. Or renamed.
>>
>>
>>
>> Or
>>
>>
>>
>> Forget about distcp. Use FSIMAGE and editlog as a change history source,
>> and write our own code to replay the operation. Handle each file one by one.
>> ( better per file error handling could be achieved), but this might need a
>> lot of dev works.
>>
>>
>>
>>
>>
>> Btw. The closest thing I could found is facebook migration 30PB hive
>> warehouse:
>>
>>
>>
>>
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>>
>>
>>
>> They modifiy the distcp to do a initial bulk load (to better handling
>> large files and very small files, for load balance I guess.) , and a
>> replication system (not much detail on this part) to mirror the changes.
>>
>>
>>
>> But it is not clear that how they handle those shortcomings of distcp I
>> mentioned above. And do they utilize snapshot mechanism.
>>
>>
>>
>> So , does anyone have experience on this kind of work? What do you think
>> might be the best approaching for our case? Is there any ready works been
>> done that we can utilize? Is there any works have been done around snapshot
>> mechanism to easy data migration?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: Best way to migrate PB scale data between live cluster?

Posted by Namikaze Minato <ll...@gmail.com>.

The clean way to go is to start from the log and to replay it... But I
have actually no idea about how to do that
You might find this (old) work interesting:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

I'd never have tried to transmit this much data across the network, I
would always have tried to find a way to copy hard disks and
physically ship them to the location...

Camusensei

On 12 April 2016 at 19:14, cs user <ac...@gmail.com> wrote:
> Hi there,
>
> At some point in the near future we are also going to require exactly what
> you describe. We had hope to use distcp.
>
> You mentioned:
>
> 1. it do not handle data delete
>
> distcp has a -delete flag which says -
>
> "Delete the files existing in the dst but not in src"
>
> Does this not help with handling deleted data?
>
> I believe there is an issue if data is removed during a distcp run, so for
> example at the start of the run it captures all the files it needs to sync.
> If some files are deleted during the run, it may lead to errors. Is there a
> way to ignore these errors and have distcp retry on the next run?
>
> I'd be interested in how you manage to eventually accomplish the syncing
> between the two clusters, because we also need to solve the very same
> problem :-)
>
> Perhaps others on the mailing list have experience with this?
>
>
> Thanks!
>
>
> On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:
>>
>> Hi
>>
>>
>>
>> We have a hadoop cluster with several PB data. and we need to migrate it
>> to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish.
>> So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course),
>> and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy the
>> new data, but also need to delete the deleted data happened during the
>> migration period.
>> 3. we don’t have much space left on the old cluster, say 30% room.
>>
>>
>>
>> regarding distcp, although it might be the easiest way , but
>>
>>
>>
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it (
>> well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle.
>> 4 load control is difficult ( we still have heavy work load on old
>> cluster) you can just try to split your work manually and make it small
>> enough to achieve the flow control goal.
>>
>>
>>
>> In one word, for a long time mirror work. It won't do well by itself.
>>
>>
>>
>> The are some possible works might need to be done :
>>
>>
>>
>> We can:
>>
>>
>>
>> Do  some wrap work around distcp to make it works better. ( say error
>> handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied
>> and deleted. Or renamed.
>>
>>
>>
>> Or
>>
>>
>>
>> Forget about distcp. Use FSIMAGE and editlog as a change history source,
>> and write our own code to replay the operation. Handle each file one by one.
>> ( better per file error handling could be achieved), but this might need a
>> lot of dev works.
>>
>>
>>
>>
>>
>> Btw. The closest thing I could found is facebook migration 30PB hive
>> warehouse:
>>
>>
>>
>>
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>>
>>
>>
>> They modifiy the distcp to do a initial bulk load (to better handling
>> large files and very small files, for load balance I guess.) , and a
>> replication system (not much detail on this part) to mirror the changes.
>>
>>
>>
>> But it is not clear that how they handle those shortcomings of distcp I
>> mentioned above. And do they utilize snapshot mechanism.
>>
>>
>>
>> So , does anyone have experience on this kind of work? What do you think
>> might be the best approaching for our case? Is there any ready works been
>> done that we can utilize? Is there any works have been done around snapshot
>> mechanism to easy data migration?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: Best way to migrate PB scale data between live cluster?

Posted by Namikaze Minato <ll...@gmail.com>.

The clean way to go is to start from the log and to replay it... But I
have actually no idea about how to do that
You might find this (old) work interesting:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

I'd never have tried to transmit this much data across the network, I
would always have tried to find a way to copy hard disks and
physically ship them to the location...

Camusensei

On 12 April 2016 at 19:14, cs user <ac...@gmail.com> wrote:
> Hi there,
>
> At some point in the near future we are also going to require exactly what
> you describe. We had hope to use distcp.
>
> You mentioned:
>
> 1. it do not handle data delete
>
> distcp has a -delete flag which says -
>
> "Delete the files existing in the dst but not in src"
>
> Does this not help with handling deleted data?
>
> I believe there is an issue if data is removed during a distcp run, so for
> example at the start of the run it captures all the files it needs to sync.
> If some files are deleted during the run, it may lead to errors. Is there a
> way to ignore these errors and have distcp retry on the next run?
>
> I'd be interested in how you manage to eventually accomplish the syncing
> between the two clusters, because we also need to solve the very same
> problem :-)
>
> Perhaps others on the mailing list have experience with this?
>
>
> Thanks!
>
>
> On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:
>>
>> Hi
>>
>>
>>
>> We have a hadoop cluster with several PB data. and we need to migrate it
>> to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish.
>> So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course),
>> and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy the
>> new data, but also need to delete the deleted data happened during the
>> migration period.
>> 3. we don’t have much space left on the old cluster, say 30% room.
>>
>>
>>
>> regarding distcp, although it might be the easiest way , but
>>
>>
>>
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it (
>> well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle.
>> 4 load control is difficult ( we still have heavy work load on old
>> cluster) you can just try to split your work manually and make it small
>> enough to achieve the flow control goal.
>>
>>
>>
>> In one word, for a long time mirror work. It won't do well by itself.
>>
>>
>>
>> The are some possible works might need to be done :
>>
>>
>>
>> We can:
>>
>>
>>
>> Do  some wrap work around distcp to make it works better. ( say error
>> handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied
>> and deleted. Or renamed.
>>
>>
>>
>> Or
>>
>>
>>
>> Forget about distcp. Use FSIMAGE and editlog as a change history source,
>> and write our own code to replay the operation. Handle each file one by one.
>> ( better per file error handling could be achieved), but this might need a
>> lot of dev works.
>>
>>
>>
>>
>>
>> Btw. The closest thing I could found is facebook migration 30PB hive
>> warehouse:
>>
>>
>>
>>
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>>
>>
>>
>> They modifiy the distcp to do a initial bulk load (to better handling
>> large files and very small files, for load balance I guess.) , and a
>> replication system (not much detail on this part) to mirror the changes.
>>
>>
>>
>> But it is not clear that how they handle those shortcomings of distcp I
>> mentioned above. And do they utilize snapshot mechanism.
>>
>>
>>
>> So , does anyone have experience on this kind of work? What do you think
>> might be the best approaching for our case? Is there any ready works been
>> done that we can utilize? Is there any works have been done around snapshot
>> mechanism to easy data migration?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

Hi there,

At some point in the near future we are also going to require exactly what
you describe. We had hope to use distcp.

You mentioned:

1. it do not handle data delete

distcp has a -delete flag which says -

"Delete the files existing in the dst but not in src"

Does this not help with handling deleted data?

I believe there is an issue if data is removed during a distcp run, so for
example at the start of the run it captures all the files it needs to sync.
If some files are deleted during the run, it may lead to errors. Is there a
way to ignore these errors and have distcp retry on the next run?

I'd be interested in how you manage to eventually accomplish the syncing
between the two clusters, because we also need to solve the very same
problem :-)

Perhaps others on the mailing list have experience with this?


Thanks!


On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:

> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don’t have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>

Re: Best way to migrate PB scale data between live cluster?

Posted by raymond <rg...@163.com>.

Just to come back with our actual choice FYI.

We finally choose to use distcp ver2 to do the migration work ( the edit log approaching we developed by ourselves is not verified, and we need to do it quick, so…) , to minimize possible issues during the migration period. We also utilize the snapshot mechanism ( it solved a lot of issues I mentioned before ). And there is a blog posted by Cloudera last year which describe the process in great detail.

https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/ <https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/>

there are several patches that are critical to this tasks, and we are not lucky enough to running the version employ all those patches. And is looking for the possibility to patch our version. The main issue is the metadata is huge, and the startup period of each iteration will take a long time without the patch. ( In our case, it took almost 6-8 hours to collect the necessary data before actually start data transfer works.)

We have transfer 1PB data across cluster, almost finish the first iteration, so the whole process not yet fully verified ( that say, the snapshot diff mechanism not yet fully verified on large cluster ). 

—

Raymond



> 在 2016年4月18日，下午3:34，cs user <ac...@gmail.com> 写道：
> 
> rsync is fairly low level, I guess it would be ok as a last resort to get back files held within hadoop. But it might be difficult to reconstruct a hadoop cluster using just the raw files on the disk. It wouldn't be very quick in any case.
> 
> How are people doing disaster recover then with large hadoop clusters? Lets say you have two data centers and you want to replicate data from one cluster to another, so that if you lost your primary dc, you could then switch to the secondary one?
> 
> If you take a look here - http://hortonworks.com/partner/wandisco/ <http://hortonworks.com/partner/wandisco/>
> 
> There is a paid for solution using wan disco which is able to perform this replication for you. Are there no other alternatives to this?
>  
> 
> On Sun, Apr 17, 2016 at 11:18 AM, Jonathan Aquilina <jaquilina@eagleeyet.net <ma...@eagleeyet.net>> wrote:
> Probably a stupid suggestion but did you guys consider rsync? Supposed to be quick and can do deletes?
> 
>  
>  
> On 2016-04-12 11:44, raymond wrote:
> 
>> Hi
>>  
>> We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish. So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course), and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy the new data, but also need to delete the deleted data happened during the migration period.
>> 3. we don't have much space left on the old cluster, say 30% room.
>>  
>> regarding distcp, although it might be the easiest way , but 
>>  
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it ( well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle. 
>> 4 load control is difficult ( we still have heavy work load on old cluster) you can just try to split your work manually and make it small enough to achieve the flow control goal.
>>  
>> In one word, for a long time mirror work. It won't do well by itself.
>>  
>> The are some possible works might need to be done :
>>  
>> We can:
>>  
>> Do  some wrap work around distcp to make it works better. ( say error handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied and deleted. Or renamed.
>>  
>> Or
>>  
>> Forget about distcp. Use FSIMAGE and editlog as a change history source, and write our own code to replay the operation. Handle each file one by one. ( better per file error handling could be achieved), but this might need a lot of dev works.
>>  
>>  
>> Btw. The closest thing I could found is facebook migration 30PB hive warehouse:
>>  
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ <https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/>
>>  
>> They modifiy the distcp to do a initial bulk load (to better handling large files and very small files, for load balance I guess.) , and a replication system (not much detail on this part) to mirror the changes.
>>  
>> But it is not clear that how they handle those shortcomings of distcp I mentioned above. And do they utilize snapshot mechanism.
>>  
>> So , does anyone have experience on this kind of work? What do you think might be the best approaching for our case? Is there any ready works been done that we can utilize? Is there any works have been done around snapshot mechanism to easy data migration?
>

Re: Best way to migrate PB scale data between live cluster?

Posted by raymond <rg...@163.com>.

Just to come back with our actual choice FYI.

We finally choose to use distcp ver2 to do the migration work ( the edit log approaching we developed by ourselves is not verified, and we need to do it quick, so…) , to minimize possible issues during the migration period. We also utilize the snapshot mechanism ( it solved a lot of issues I mentioned before ). And there is a blog posted by Cloudera last year which describe the process in great detail.

https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/ <https://blog.cloudera.com/blog/2015/12/distcp-performance-improvements-in-apache-hadoop/>

there are several patches that are critical to this tasks, and we are not lucky enough to running the version employ all those patches. And is looking for the possibility to patch our version. The main issue is the metadata is huge, and the startup period of each iteration will take a long time without the patch. ( In our case, it took almost 6-8 hours to collect the necessary data before actually start data transfer works.)

We have transfer 1PB data across cluster, almost finish the first iteration, so the whole process not yet fully verified ( that say, the snapshot diff mechanism not yet fully verified on large cluster ). 

—

Raymond



> 在 2016年4月18日，下午3:34，cs user <acldstkusr@gmail.com <ma...@gmail.com>> 写道：
> 
> rsync is fairly low level, I guess it would be ok as a last resort to get back files held within hadoop. But it might be difficult to reconstruct a hadoop cluster using just the raw files on the disk. It wouldn't be very quick in any case.
> 
> How are people doing disaster recover then with large hadoop clusters? Lets say you have two data centers and you want to replicate data from one cluster to another, so that if you lost your primary dc, you could then switch to the secondary one?
> 
> If you take a look here - http://hortonworks.com/partner/wandisco/ <http://hortonworks.com/partner/wandisco/>
> 
> There is a paid for solution using wan disco which is able to perform this replication for you. Are there no other alternatives to this?
>  
> 
> On Sun, Apr 17, 2016 at 11:18 AM, Jonathan Aquilina <jaquilina@eagleeyet.net <ma...@eagleeyet.net>> wrote:
> Probably a stupid suggestion but did you guys consider rsync? Supposed to be quick and can do deletes?
> 
>  
>  
> On 2016-04-12 11:44, raymond wrote:
> 
>> Hi
>>  
>> We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster across datacenter for larger volume capability.
>> We estimate that the data copy itself might took near a month to finish. So we are seeking for a sound solution. The requirement is as below:
>> 1. we cannot bring down the old cluster for such a long time ( of course), and a couple of hours is acceptable.
>> 2. we need to mirror the data, it means that we not only need to copy the new data, but also need to delete the deleted data happened during the migration period.
>> 3. we don't have much space left on the old cluster, say 30% room.
>>  
>> regarding distcp, although it might be the easiest way , but 
>>  
>> 1. it do not handle data delete
>> 2. it handle newly appended file by compare file size and overwrite it ( well , it might waste a lot of bandwidth )
>> 3. error handling base on file is triffle. 
>> 4 load control is difficult ( we still have heavy work load on old cluster) you can just try to split your work manually and make it small enough to achieve the flow control goal.
>>  
>> In one word, for a long time mirror work. It won't do well by itself.
>>  
>> The are some possible works might need to be done :
>>  
>> We can:
>>  
>> Do  some wrap work around distcp to make it works better. ( say error handling, check results. Extra code for sync deleted files etc. )
>> Utilize Snapshot mechanisms for better identify files need to be copied and deleted. Or renamed.
>>  
>> Or
>>  
>> Forget about distcp. Use FSIMAGE and editlog as a change history source, and write our own code to replay the operation. Handle each file one by one. ( better per file error handling could be achieved), but this might need a lot of dev works.
>>  
>>  
>> Btw. The closest thing I could found is facebook migration 30PB hive warehouse:
>>  
>> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ <https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/>
>>  
>> They modifiy the distcp to do a initial bulk load (to better handling large files and very small files, for load balance I guess.) , and a replication system (not much detail on this part) to mirror the changes.
>>  
>> But it is not clear that how they handle those shortcomings of distcp I mentioned above. And do they utilize snapshot mechanism.
>>  
>> So , does anyone have experience on this kind of work? What do you think might be the best approaching for our case? Is there any ready works been done that we can utilize? Is there any works have been done around snapshot mechanism to easy data migration?
>

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

rsync is fairly low level, I guess it would be ok as a last resort to get
back files held within hadoop. But it might be difficult to reconstruct a
hadoop cluster using just the raw files on the disk. It wouldn't be very
quick in any case.

How are people doing disaster recover then with large hadoop clusters? Lets
say you have two data centers and you want to replicate data from one
cluster to another, so that if you lost your primary dc, you could then
switch to the secondary one?

If you take a look here - http://hortonworks.com/partner/wandisco/

There is a paid for solution using wan disco which is able to perform this
replication for you. Are there no other alternatives to this?


On Sun, Apr 17, 2016 at 11:18 AM, Jonathan Aquilina <jaquilina@eagleeyet.net
> wrote:

> Probably a stupid suggestion but did you guys consider rsync? Supposed to
> be quick and can do deletes?
>
>
>
>
> On 2016-04-12 11:44, raymond wrote:
>
> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don't have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>
>

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

rsync is fairly low level, I guess it would be ok as a last resort to get
back files held within hadoop. But it might be difficult to reconstruct a
hadoop cluster using just the raw files on the disk. It wouldn't be very
quick in any case.

How are people doing disaster recover then with large hadoop clusters? Lets
say you have two data centers and you want to replicate data from one
cluster to another, so that if you lost your primary dc, you could then
switch to the secondary one?

If you take a look here - http://hortonworks.com/partner/wandisco/

There is a paid for solution using wan disco which is able to perform this
replication for you. Are there no other alternatives to this?


On Sun, Apr 17, 2016 at 11:18 AM, Jonathan Aquilina <jaquilina@eagleeyet.net
> wrote:

> Probably a stupid suggestion but did you guys consider rsync? Supposed to
> be quick and can do deletes?
>
>
>
>
> On 2016-04-12 11:44, raymond wrote:
>
> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don't have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>
>

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

rsync is fairly low level, I guess it would be ok as a last resort to get
back files held within hadoop. But it might be difficult to reconstruct a
hadoop cluster using just the raw files on the disk. It wouldn't be very
quick in any case.

How are people doing disaster recover then with large hadoop clusters? Lets
say you have two data centers and you want to replicate data from one
cluster to another, so that if you lost your primary dc, you could then
switch to the secondary one?

If you take a look here - http://hortonworks.com/partner/wandisco/

There is a paid for solution using wan disco which is able to perform this
replication for you. Are there no other alternatives to this?


On Sun, Apr 17, 2016 at 11:18 AM, Jonathan Aquilina <jaquilina@eagleeyet.net
> wrote:

> Probably a stupid suggestion but did you guys consider rsync? Supposed to
> be quick and can do deletes?
>
>
>
>
> On 2016-04-12 11:44, raymond wrote:
>
> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don't have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>
>

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

rsync is fairly low level, I guess it would be ok as a last resort to get
back files held within hadoop. But it might be difficult to reconstruct a
hadoop cluster using just the raw files on the disk. It wouldn't be very
quick in any case.

How are people doing disaster recover then with large hadoop clusters? Lets
say you have two data centers and you want to replicate data from one
cluster to another, so that if you lost your primary dc, you could then
switch to the secondary one?

If you take a look here - http://hortonworks.com/partner/wandisco/

There is a paid for solution using wan disco which is able to perform this
replication for you. Are there no other alternatives to this?


On Sun, Apr 17, 2016 at 11:18 AM, Jonathan Aquilina <jaquilina@eagleeyet.net
> wrote:

> Probably a stupid suggestion but did you guys consider rsync? Supposed to
> be quick and can do deletes?
>
>
>
>
> On 2016-04-12 11:44, raymond wrote:
>
> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don't have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>
>

Re: Best way to migrate PB scale data between live cluster?

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.

Probably a stupid suggestion but did you guys consider rsync? Supposed
to be quick and can do deletes?

On 2016-04-12 11:44, raymond wrote:

> Hi 
> 
> We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster across datacenter for larger volume capability. 
> We estimate that the data copy itself might took near a month to finish. So we are seeking for a sound solution. The requirement is as below: 
> 1. we cannot bring down the old cluster for such a long time ( of course), and a couple of hours is acceptable. 
> 2. we need to mirror the data, it means that we not only need to copy the new data, but also need to delete the deleted data happened during the migration period. 
> 3. we don't have much space left on the old cluster, say 30% room. 
> 
> regarding distcp, although it might be the easiest way , but  
> 
> 1. it do not handle data delete 
> 2. it handle newly appended file by compare file size and overwrite it ( well , it might waste a lot of bandwidth ) 
> 3. error handling base on file is triffle.  
> 4 load control is difficult ( we still have heavy work load on old cluster) you can just try to split your work manually and make it small enough to achieve the flow control goal. 
> 
> In one word, for a long time mirror work. It won't do well by itself. 
> 
> The are some possible works might need to be done : 
> 
> We can: 
> 
> * Do  some wrap work around distcp to make it works better. ( say error handling, check results. Extra code for sync deleted files etc. )
> * Utilize Snapshot mechanisms for better identify files need to be copied and deleted. Or renamed.
> 
> Or 
> 
> * Forget about distcp. Use FSIMAGE and editlog as a change history source, and write our own code to replay the operation. Handle each file one by one. ( better per file error handling could be achieved), but this might need a lot of dev works. 
> 
> Btw. The closest thing I could found is facebook migration 30PB hive warehouse: 
> 
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ 
> 
> They modifiy the distcp to do a initial bulk load (to better handling large files and very small files, for load balance I guess.) , and a replication system (not much detail on this part) to mirror the changes. 
> 
> But it is not clear that how they handle those shortcomings of distcp I mentioned above. And do they utilize snapshot mechanism. 
> 
> So , does anyone have experience on this kind of work? What do you think might be the best approaching for our case? Is there any ready works been done that we can utilize? Is there any works have been done around snapshot mechanism to easy data migration?

Re: Best way to migrate PB scale data between live cluster?

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.

Probably a stupid suggestion but did you guys consider rsync? Supposed
to be quick and can do deletes?

On 2016-04-12 11:44, raymond wrote:

> Hi 
> 
> We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster across datacenter for larger volume capability. 
> We estimate that the data copy itself might took near a month to finish. So we are seeking for a sound solution. The requirement is as below: 
> 1. we cannot bring down the old cluster for such a long time ( of course), and a couple of hours is acceptable. 
> 2. we need to mirror the data, it means that we not only need to copy the new data, but also need to delete the deleted data happened during the migration period. 
> 3. we don't have much space left on the old cluster, say 30% room. 
> 
> regarding distcp, although it might be the easiest way , but  
> 
> 1. it do not handle data delete 
> 2. it handle newly appended file by compare file size and overwrite it ( well , it might waste a lot of bandwidth ) 
> 3. error handling base on file is triffle.  
> 4 load control is difficult ( we still have heavy work load on old cluster) you can just try to split your work manually and make it small enough to achieve the flow control goal. 
> 
> In one word, for a long time mirror work. It won't do well by itself. 
> 
> The are some possible works might need to be done : 
> 
> We can: 
> 
> * Do  some wrap work around distcp to make it works better. ( say error handling, check results. Extra code for sync deleted files etc. )
> * Utilize Snapshot mechanisms for better identify files need to be copied and deleted. Or renamed.
> 
> Or 
> 
> * Forget about distcp. Use FSIMAGE and editlog as a change history source, and write our own code to replay the operation. Handle each file one by one. ( better per file error handling could be achieved), but this might need a lot of dev works. 
> 
> Btw. The closest thing I could found is facebook migration 30PB hive warehouse: 
> 
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ 
> 
> They modifiy the distcp to do a initial bulk load (to better handling large files and very small files, for load balance I guess.) , and a replication system (not much detail on this part) to mirror the changes. 
> 
> But it is not clear that how they handle those shortcomings of distcp I mentioned above. And do they utilize snapshot mechanism. 
> 
> So , does anyone have experience on this kind of work? What do you think might be the best approaching for our case? Is there any ready works been done that we can utilize? Is there any works have been done around snapshot mechanism to easy data migration?

Re: Best way to migrate PB scale data between live cluster?

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.

Probably a stupid suggestion but did you guys consider rsync? Supposed
to be quick and can do deletes?

On 2016-04-12 11:44, raymond wrote:

> Hi 
> 
> We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster across datacenter for larger volume capability. 
> We estimate that the data copy itself might took near a month to finish. So we are seeking for a sound solution. The requirement is as below: 
> 1. we cannot bring down the old cluster for such a long time ( of course), and a couple of hours is acceptable. 
> 2. we need to mirror the data, it means that we not only need to copy the new data, but also need to delete the deleted data happened during the migration period. 
> 3. we don't have much space left on the old cluster, say 30% room. 
> 
> regarding distcp, although it might be the easiest way , but  
> 
> 1. it do not handle data delete 
> 2. it handle newly appended file by compare file size and overwrite it ( well , it might waste a lot of bandwidth ) 
> 3. error handling base on file is triffle.  
> 4 load control is difficult ( we still have heavy work load on old cluster) you can just try to split your work manually and make it small enough to achieve the flow control goal. 
> 
> In one word, for a long time mirror work. It won't do well by itself. 
> 
> The are some possible works might need to be done : 
> 
> We can: 
> 
> * Do  some wrap work around distcp to make it works better. ( say error handling, check results. Extra code for sync deleted files etc. )
> * Utilize Snapshot mechanisms for better identify files need to be copied and deleted. Or renamed.
> 
> Or 
> 
> * Forget about distcp. Use FSIMAGE and editlog as a change history source, and write our own code to replay the operation. Handle each file one by one. ( better per file error handling could be achieved), but this might need a lot of dev works. 
> 
> Btw. The closest thing I could found is facebook migration 30PB hive warehouse: 
> 
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ 
> 
> They modifiy the distcp to do a initial bulk load (to better handling large files and very small files, for load balance I guess.) , and a replication system (not much detail on this part) to mirror the changes. 
> 
> But it is not clear that how they handle those shortcomings of distcp I mentioned above. And do they utilize snapshot mechanism. 
> 
> So , does anyone have experience on this kind of work? What do you think might be the best approaching for our case? Is there any ready works been done that we can utilize? Is there any works have been done around snapshot mechanism to easy data migration?

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

Hi there,

At some point in the near future we are also going to require exactly what
you describe. We had hope to use distcp.

You mentioned:

1. it do not handle data delete

distcp has a -delete flag which says -

"Delete the files existing in the dst but not in src"

Does this not help with handling deleted data?

I believe there is an issue if data is removed during a distcp run, so for
example at the start of the run it captures all the files it needs to sync.
If some files are deleted during the run, it may lead to errors. Is there a
way to ignore these errors and have distcp retry on the next run?

I'd be interested in how you manage to eventually accomplish the syncing
between the two clusters, because we also need to solve the very same
problem :-)

Perhaps others on the mailing list have experience with this?


Thanks!


On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:

> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don’t have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>

Re: Best way to migrate PB scale data between live cluster?

Posted by Jonathan Aquilina <ja...@eagleeyet.net>.

Probably a stupid suggestion but did you guys consider rsync? Supposed
to be quick and can do deletes?

On 2016-04-12 11:44, raymond wrote:

> Hi 
> 
> We have a hadoop cluster with several PB data. and we need to migrate it to a new cluster across datacenter for larger volume capability. 
> We estimate that the data copy itself might took near a month to finish. So we are seeking for a sound solution. The requirement is as below: 
> 1. we cannot bring down the old cluster for such a long time ( of course), and a couple of hours is acceptable. 
> 2. we need to mirror the data, it means that we not only need to copy the new data, but also need to delete the deleted data happened during the migration period. 
> 3. we don't have much space left on the old cluster, say 30% room. 
> 
> regarding distcp, although it might be the easiest way , but  
> 
> 1. it do not handle data delete 
> 2. it handle newly appended file by compare file size and overwrite it ( well , it might waste a lot of bandwidth ) 
> 3. error handling base on file is triffle.  
> 4 load control is difficult ( we still have heavy work load on old cluster) you can just try to split your work manually and make it small enough to achieve the flow control goal. 
> 
> In one word, for a long time mirror work. It won't do well by itself. 
> 
> The are some possible works might need to be done : 
> 
> We can: 
> 
> * Do  some wrap work around distcp to make it works better. ( say error handling, check results. Extra code for sync deleted files etc. )
> * Utilize Snapshot mechanisms for better identify files need to be copied and deleted. Or renamed.
> 
> Or 
> 
> * Forget about distcp. Use FSIMAGE and editlog as a change history source, and write our own code to replay the operation. Handle each file one by one. ( better per file error handling could be achieved), but this might need a lot of dev works. 
> 
> Btw. The closest thing I could found is facebook migration 30PB hive warehouse: 
> 
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ 
> 
> They modifiy the distcp to do a initial bulk load (to better handling large files and very small files, for load balance I guess.) , and a replication system (not much detail on this part) to mirror the changes. 
> 
> But it is not clear that how they handle those shortcomings of distcp I mentioned above. And do they utilize snapshot mechanism. 
> 
> So , does anyone have experience on this kind of work? What do you think might be the best approaching for our case? Is there any ready works been done that we can utilize? Is there any works have been done around snapshot mechanism to easy data migration?

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

Hi there,

At some point in the near future we are also going to require exactly what
you describe. We had hope to use distcp.

You mentioned:

1. it do not handle data delete

distcp has a -delete flag which says -

"Delete the files existing in the dst but not in src"

Does this not help with handling deleted data?

I believe there is an issue if data is removed during a distcp run, so for
example at the start of the run it captures all the files it needs to sync.
If some files are deleted during the run, it may lead to errors. Is there a
way to ignore these errors and have distcp retry on the next run?

I'd be interested in how you manage to eventually accomplish the syncing
between the two clusters, because we also need to solve the very same
problem :-)

Perhaps others on the mailing list have experience with this?


Thanks!


On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:

> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don’t have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>

Re: Best way to migrate PB scale data between live cluster?

Posted by cs user <ac...@gmail.com>.

Hi there,

At some point in the near future we are also going to require exactly what
you describe. We had hope to use distcp.

You mentioned:

1. it do not handle data delete

distcp has a -delete flag which says -

"Delete the files existing in the dst but not in src"

Does this not help with handling deleted data?

I believe there is an issue if data is removed during a distcp run, so for
example at the start of the run it captures all the files it needs to sync.
If some files are deleted during the run, it may lead to errors. Is there a
way to ignore these errors and have distcp retry on the next run?

I'd be interested in how you manage to eventually accomplish the syncing
between the two clusters, because we also need to solve the very same
problem :-)

Perhaps others on the mailing list have experience with this?


Thanks!


On Tue, Apr 12, 2016 at 10:44 AM, raymond <rg...@163.com> wrote:

> Hi
>
>
> We have a hadoop cluster with several PB data. and we need to migrate it
> to a new cluster across datacenter for larger volume capability.
> We estimate that the data copy itself might took near a month to finish.
> So we are seeking for a sound solution. The requirement is as below:
> 1. we cannot bring down the old cluster for such a long time ( of course),
> and a couple of hours is acceptable.
> 2. we need to mirror the data, it means that we not only need to copy the
> new data, but also need to delete the deleted data happened during the
> migration period.
> 3. we don’t have much space left on the old cluster, say 30% room.
>
>
> regarding distcp, although it might be the easiest way , but
>
>
> 1. it do not handle data delete
> 2. it handle newly appended file by compare file size and overwrite it (
> well , it might waste a lot of bandwidth )
> 3. error handling base on file is triffle.
> 4 load control is difficult ( we still have heavy work load on old
> cluster) you can just try to split your work manually and make it small
> enough to achieve the flow control goal.
>
>
> In one word, for a long time mirror work. It won't do well by itself.
>
>
> The are some possible works might need to be done :
>
>
> We can:
>
>
>
>    1. Do  some wrap work around distcp to make it works better. ( say
>    error handling, check results. Extra code for sync deleted files etc. )
>    2. Utilize Snapshot mechanisms for better identify files need to be
>    copied and deleted. Or renamed.
>
>
> Or
>
>
>
>    1. Forget about distcp. Use FSIMAGE and editlog as a change history
>    source, and write our own code to replay the operation. Handle each file
>    one by one. ( better per file error handling could be achieved), but this
>    might need a lot of dev works.
>
>
>
>
> Btw. The closest thing I could found is facebook migration 30PB hive
> warehouse:
>
>
>
> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/
>
>
> They modifiy the distcp to do a initial bulk load (to better handling
> large files and very small files, for load balance I guess.) , and a
> replication system (not much detail on this part) to mirror the changes.
>
>
> But it is not clear that how they handle those shortcomings of distcp I
> mentioned above. And do they utilize snapshot mechanism.
>
>
> So , does anyone have experience on this kind of work? What do you think
> might be the best approaching for our case? Is there any ready works been
> done that we can utilize? Is there any works have been done around snapshot
> mechanism to easy data migration?
>