You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Vegard Berget <po...@fantasista.no> on 2012/12/19 13:27:45 UTC

Moving data from one datacenter to another

Hi,
I know this have been a topic here before, but I need some input on
how to move data from one datacenter to another (and google just gives
me some old mails) - and at the same time moving "production" writing
the same way.  To add the target cluster into the source cluster and
just replicate data before moving source nodes is not an option, but
my plan is as follows:1)  Flush data on source cluster and move all
data/-files to the destination cluster.  While this is going on, we
are still writing to the source cluster.2)  When data is copied,
start cassandra on the new cluster - and then move writing/reading to
the new cluster.3)  Now, do a new flush on the source cluster.  As I
understand, the sstable files are immutable, so the _newly added_
data/ files could be moved to the target cluster.4)  After new data
is also copied into the the target data/, do a nodetool -refresh to
load the new sstables into the system (i know we need to take care of
filenames). 

	It's worth noting that none of the data is critical, but it would be
nice to get it correct.  I know that there will be a short period
between 2 and 4 that reads potentially could read old data (written
while copying, reading after we have moved read/write).  This is ok
in this case.  Our second alternative is to:

	1) Drain old cluster
2) Copy to new cluster
3) Start new cluster

	This will cause the cluster to be unavailable for writes in the
copy-period, and I wish to avoid that (even if that, too, is
survivable).

	Both nodes are 11.6, but it might be that we upgrade the target to
1.1.7, as I can't see that this will cause any problems?   

	Questions:

	1)  It's the same number of nodes on both clusters, but does the
tokens need to be the same aswell?  (Wouldn't a repair correct that
later?)

	2)  Could data files have any name?  Could we, to avoid a filename
crash, just substitute the numbers with for example XXX in the
data-files?

	3)  Is this really a sane way to do things?  

	Suggestions are most welcome!

	Regards
Vegard Berget



Re: Moving data from one datacenter to another

Posted by aaron morton <aa...@thelastpickle.com>.
Sounds about right, i've done similar things before. 

Some notes…

* I would make sure repair has completed on the source cluster before making changes. I just like to know data is distributed. I would also do it once all the moves are done.

* Rather than flush, take a snap shot and copy from that. Then you will have a stable set of files and it's easier to go back and see what you copied. (Snapshot does a flush) 
 
* Take a second snapshot after you stop writing to the original cluster and work out the delta between them. New files in the second snapshot are the ones to copy. 

>> Both nodes are 1.1.6, but it might be that we upgrade the target to 1.1.7,
>> as I can't see that this will cause any problems?
I would always do one thing at a time. Upgrade before or after the move, not in the middle of it. 

>> 1)  It's the same number of nodes on both clusters, but does the tokens need
>> to be the same aswell?  (Wouldn't a repair correct that later?)
I *think* you are moving from nodes in one cluster to nodes in a different cluster (i.e. not adding a "data centre" to an existing cluster). In which case it does not matter too much but I would keep them the same. 

>> 2)  Could data files have any name?  Could we, to avoid a filename crash,
>> just substitute the numbers with for example XXX in the data-files?
The names have to match the expected patterns. 

It may be easier to rename the files in your first copy, not the second delta copy. Bump the file numbers enough that all the files in the delta copy do not need to be renamed. 

>> 3)  Is this really a sane way to do things?
If you are moving data from one set of nodes in a cassandra cluster to another set of nodes in another cluster this is reasonable. You could add the new nodes as a new DC and do the whole thing without down time but you mentioned that was not possible. 

It looks like you are going to have some down time, or can accept some down time, so here's a tweak. You should be able to get the delta copy part done pretty quickly. If that's the case you can:

1) do the main copy
2) stop the old system.
3) do the delta copy
4) start the new system

That way you will not have stale reads in the new system.
 
Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/12/2012, at 5:08 PM, B. Todd Burruss <bt...@gmail.com> wrote:

> to get it "correct", meaning consistent, it seems you will need to do
> a repair no matter what since the source cluster is taking writes
> during this time and writing to commit log.  so to avoid filename
> issues just do the first copy and then repair.  i am not sure if they
> can have any filename.
> 
> to the question about whether the tokens must be the same, the answer
> is they can't be.
> (http://www.datastax.com/docs/datastax_enterprise2.0/multi_dc_install).
> i believe that as long as your replication factor is > 1, then using
> repair would fix most any token assignment
> 
> On Wed, Dec 19, 2012 at 4:27 AM, Vegard  Berget <po...@fantasista.no> wrote:
>> Hi,
>> 
>> I know this have been a topic here before, but I need some input on how to
>> move data from one datacenter to another (and google just gives me some old
>> mails) - and at the same time moving "production" writing the same way.
>> To add the target cluster into the source cluster and just replicate data
>> before moving source nodes is not an option, but my plan is as follows:
>> 1)  Flush data on source cluster and move all data/-files to the destination
>> cluster.  While this is going on, we are still writing to the source
>> cluster.
>> 2)  When data is copied, start cassandra on the new cluster - and then move
>> writing/reading to the new cluster.
>> 3)  Now, do a new flush on the source cluster.  As I understand, the sstable
>> files are immutable, so the _newly added_ data/ files could be moved to the
>> target cluster.
>> 4)  After new data is also copied into the the target data/, do a nodetool
>> -refresh to load the new sstables into the system (i know we need to take
>> care of filenames).
>> 
>> It's worth noting that none of the data is critical, but it would be nice to
>> get it correct.  I know that there will be a short period between 2 and 4
>> that reads potentially could read old data (written while copying, reading
>> after we have moved read/write).  This is ok in this case.  Our second
>> alternative is to:
>> 
>> 1) Drain old cluster
>> 2) Copy to new cluster
>> 3) Start new cluster
>> 
>> This will cause the cluster to be unavailable for writes in the copy-period,
>> and I wish to avoid that (even if that, too, is survivable).
>> 
>> Both nodes are 1.1.6, but it might be that we upgrade the target to 1.1.7,
>> as I can't see that this will cause any problems?
>> 
>> Questions:
>> 
>> 1)  It's the same number of nodes on both clusters, but does the tokens need
>> to be the same aswell?  (Wouldn't a repair correct that later?)
>> 
>> 2)  Could data files have any name?  Could we, to avoid a filename crash,
>> just substitute the numbers with for example XXX in the data-files?
>> 
>> 3)  Is this really a sane way to do things?
>> 
>> Suggestions are most welcome!
>> 
>> Regards
>> Vegard Berget
>> 
>> 


Re: Moving data from one datacenter to another

Posted by "B. Todd Burruss" <bt...@gmail.com>.
to get it "correct", meaning consistent, it seems you will need to do
a repair no matter what since the source cluster is taking writes
during this time and writing to commit log.  so to avoid filename
issues just do the first copy and then repair.  i am not sure if they
can have any filename.

to the question about whether the tokens must be the same, the answer
is they can't be.
(http://www.datastax.com/docs/datastax_enterprise2.0/multi_dc_install).
 i believe that as long as your replication factor is > 1, then using
repair would fix most any token assignment

On Wed, Dec 19, 2012 at 4:27 AM, Vegard  Berget <po...@fantasista.no> wrote:
> Hi,
>
> I know this have been a topic here before, but I need some input on how to
> move data from one datacenter to another (and google just gives me some old
> mails) - and at the same time moving "production" writing the same way.
> To add the target cluster into the source cluster and just replicate data
> before moving source nodes is not an option, but my plan is as follows:
> 1)  Flush data on source cluster and move all data/-files to the destination
> cluster.  While this is going on, we are still writing to the source
> cluster.
> 2)  When data is copied, start cassandra on the new cluster - and then move
> writing/reading to the new cluster.
> 3)  Now, do a new flush on the source cluster.  As I understand, the sstable
> files are immutable, so the _newly added_ data/ files could be moved to the
> target cluster.
> 4)  After new data is also copied into the the target data/, do a nodetool
> -refresh to load the new sstables into the system (i know we need to take
> care of filenames).
>
> It's worth noting that none of the data is critical, but it would be nice to
> get it correct.  I know that there will be a short period between 2 and 4
> that reads potentially could read old data (written while copying, reading
> after we have moved read/write).  This is ok in this case.  Our second
> alternative is to:
>
> 1) Drain old cluster
> 2) Copy to new cluster
> 3) Start new cluster
>
> This will cause the cluster to be unavailable for writes in the copy-period,
> and I wish to avoid that (even if that, too, is survivable).
>
> Both nodes are 1.1.6, but it might be that we upgrade the target to 1.1.7,
> as I can't see that this will cause any problems?
>
> Questions:
>
> 1)  It's the same number of nodes on both clusters, but does the tokens need
> to be the same aswell?  (Wouldn't a repair correct that later?)
>
> 2)  Could data files have any name?  Could we, to avoid a filename crash,
> just substitute the numbers with for example XXX in the data-files?
>
> 3)  Is this really a sane way to do things?
>
> Suggestions are most welcome!
>
> Regards
> Vegard Berget
>
>