You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by David Laube <da...@stormpath.com> on 2013/11/14 21:37:01 UTC

Read inconsistency after backup and restore to different cluster

Hi All,

After running through our backup and restore process FROM our test production TO our staging environment, we are seeing inconsistent reads from the cluster we restored to. We have the same number of nodes in both clusters. For example, we will select data from a column family on the newly restored cluster but sometimes the expected data is returned and other times it is not. These selects are carried out one after another with very little delay. It is almost as if the data only exists on some of the nodes, or perhaps the token ranges are dramatically different --again, we are using vnodes so I am not exactly sure how this plays into the equation.

We are running Cassadra 2.0.2 with vnodes and deploying via chef. The backup and restore process is currently orchestrated using bash scripts and chef's distributed SSH. I have outlined the process below for review.

(I) Backup cluster-A (with existing prod data):
1. Run "nodetool flush" on each of the nodes in a 5 node ring.
2. Run "nodetool snapshot keyspace_name" on each of the nodes in a 5 node ring.
3. Archive the snapshot data from the snapshots directory in each node, creating a single archive of the snapshot.
4. Copy the snapshot data archive for each of the nodes to s3.

(II) Restore backup FROM cluster-A TO cluster-B:
*NOTE: cluster-B is a freshly deployed ring with no data, but a different cluster-name used for staging.

1. Deploy 5 nodes as part of the cluster-B ring.
2. Create keyspace_name keyspace and column families on cluster-B.
3. Stop Cassandra on all 5 nodes in the cluster-B ring.
4. Clear commit logs on cluster-B with: "rm -f /var/lib/cassandra/commitlog/*"
5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes in the new cluster-B ring.
6. Extract the archives to /var/lib/cassandra/data/keyspace_name ensuring that the column family directories and associated .DB files are in place under /var/lib/cassandra/data/keyspace_name/columfamily1/ ….etc.
7.Start Cassandra on each of the nodes in cluster-B.
8. Run "nodetool repair" on each of the nodes in cluster-B.

Please let me know if you see any major errors or deviation from best practices which could be contributing to our read inconsistencies. I'll be happy to answer any specific question you may have regarding our configuration. Thank you in advance!

Best regards,
-David Laube

Re: Read inconsistency after backup and restore to different cluster

Posted by Julien Campan <ju...@gmail.com>.

Hi David,

I'm not running Cassandra 2.0.2, but I'm used to move the data from a
Cassandra cluster with vnodes to another one.

I will do the same for backuping the cluster A.

In order to restore cluster B, I do the following steps:


1. Deploy 5 nodes as part of the cluster-B ring.
2. Create keyspace_name keyspace and column families on cluster-B.
3 Copy backup of each node in one node to:
        /tmp/node1/Keyspace_name/cf_name/
        /tmp/node2/Keyspace_name/cf_name/
       /tmp/node3/Keyspace_name/cf_name/
       /tmp/node4/Keyspace_name/cf_name/
       /tmp/node5/Keyspace_name/cf_name/
4 Use sstableloader to load the sstable of each repository. Sstableloader
guarantees putting the data on a good node.
5 Make a repair on each node.


Sstableloader is the right tool to make this kind of operation.


Good luck  :)


Julien Campan.


2013/11/19 Aaron Morton <aa...@thelastpickle.com>

> we then take the snapshot archive generated FROM cluster-A_node1 and
> copy/extract/restore TO cluster-B_node1,  then we
>
> sounds correct.
>
> Depending on what additional comments/recommendation you or another member
> of the list may have (if any) based on the clarification I've made above,
>
>
> Also if you backup the system data it will bring along the tokens. This
> can be a pain if you want to change the cluster name.
>
> cheers
>
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 15/11/2013, at 10:44 am, David Laube <da...@stormpath.com> wrote:
>
> Thank you for the detailed reply Rob!  I have replied to your comments
> in-line below;
>
> On Nov 14, 2013, at 1:15 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
> On Thu, Nov 14, 2013 at 12:37 PM, David Laube <da...@stormpath.com> wrote:
>
>> It is almost as if the data only exists on some of the nodes, or perhaps
>> the token ranges are dramatically different --again, we are using vnodes so
>> I am not exactly sure how this plays into the equation.
>
>
> The token ranges are dramatically different, due to vnode random token
> selection from not setting initial_token, and setting num_tokens.
>
> You can verify this by listing the tokens per physical node in nodetool
> gossipinfo or (iirc) nodetool status.
>
>
>> 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five
>> nodes in the new cluster-B ring.
>>
>
> I don't understand this at all, do you mean that you are using one source
> node's data to load each of of the target nodes? Or are you just saying
> there's a 1:1 relationship between source snapshots and target nodes to
> load into? Unless you have RF=N, using one source for 5 target nodes won't
> work.
>
>
> We have configured RF=3 for the keyspace in question. Also, from a client
> perspective, we read with CL=1 and write with CL=QUORUM. Since we have 5
> nodes total in cluster-A, we snapshot keyspace_name on each of the five
> nodes which results in a snapshot directory on each of the five nodes that
> we archive and ship off to s3. We then take the snapshot archive generated
> FROM cluster-A_node1 and copy/extract/restore TO cluster-B_node1,  then
> we take the snapshot archive FROM cluster-A_node2 and copy/extract/restore
> TO cluster-B_node2 and so on and so forth.
>
>
> To do what I think you're attempting to do, you have basically two options.
>
> 1) don't use vnodes and do a 1:1 copy of snapshots
> 2) use vnodes and
>    a) get a list of tokens per node from the source cluster
>    b) put a comma delimited list of these in initial_token in
> cassandra.yaml on target nodes
>    c) probably have to un-set num_tokens (this part is unclear to me, you
> will have to test..)
>    d) set auto_bootstrap:false in cassandra.yaml
>    e) start target nodes, they will not-bootstrap into the same ranges as
> the source cluster
>    f) load schema / copy data into datadir (being careful of
> https://issues.apache.org/jira/browse/CASSANDRA-6245)
>    g) restart node or use nodetool refresh (I'd probably restart the node
> to avoid the bulk rename that refresh does) to pick up sstables
>    h) remove auto_bootstrap:false from cassandra.yaml
>
> I *believe* this *should* work, but have never tried it as I do not
> currently run with vnodes. It should work because it basically makes
> implicit vnode tokens explicit in the conf file. If it *does* work, I'd
> greatly appreciate you sharing details of your experience with the list.
>
>
> I'll start with parsing out the token ranges that our vnode config ends up
> assigning in cluster-A, and doing some creative config work on the target
> cluster-B we are trying to restore to as you have suggested. Depending on
> what additional comments/recommendation you or another member of the list
> may have (if any) based on the clarification I've made above, I will
> absolutely report back my findings here.
>
>
>
> General reference on tasks of this nature (does not consider vnodes, but
> treat vnodes as "just a lot of physical nodes" and it is mostly relevant) :
> http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra
>
> =Rob
>
>
>

Re: Read inconsistency after backup and restore to different cluster

Posted by Aaron Morton <aa...@thelastpickle.com>.

> we then take the snapshot archive generated FROM cluster-A_node1 and copy/extract/restore TO cluster-B_node1,  then we 
sounds correct.

> Depending on what additional comments/recommendation you or another member of the list may have (if any) based on the clarification I've made above,

Also if you backup the system data it will bring along the tokens. This can be a pain if you want to change the cluster name. 

cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 15/11/2013, at 10:44 am, David Laube <da...@stormpath.com> wrote:

> Thank you for the detailed reply Rob!  I have replied to your comments in-line below;
> 
> On Nov 14, 2013, at 1:15 PM, Robert Coli <rc...@eventbrite.com> wrote:
> 
>> On Thu, Nov 14, 2013 at 12:37 PM, David Laube <da...@stormpath.com> wrote:
>> It is almost as if the data only exists on some of the nodes, or perhaps the token ranges are dramatically different --again, we are using vnodes so I am not exactly sure how this plays into the equation.
>> 
>> The token ranges are dramatically different, due to vnode random token selection from not setting initial_token, and setting num_tokens.
>> 
>> You can verify this by listing the tokens per physical node in nodetool gossipinfo or (iirc) nodetool status.
>>  
>> 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes in the new cluster-B ring.
>> 
>> I don't understand this at all, do you mean that you are using one source node's data to load each of of the target nodes? Or are you just saying there's a 1:1 relationship between source snapshots and target nodes to load into? Unless you have RF=N, using one source for 5 target nodes won't work.
> 
> We have configured RF=3 for the keyspace in question. Also, from a client perspective, we read with CL=1 and write with CL=QUORUM. Since we have 5 nodes total in cluster-A, we snapshot keyspace_name on each of the five nodes which results in a snapshot directory on each of the five nodes that we archive and ship off to s3. We then take the snapshot archive generated FROM cluster-A_node1 and copy/extract/restore TO cluster-B_node1,  then we take the snapshot archive FROM cluster-A_node2 and copy/extract/restore TO cluster-B_node2 and so on and so forth.
> 
>> 
>> To do what I think you're attempting to do, you have basically two options.
>> 
>> 1) don't use vnodes and do a 1:1 copy of snapshots
>> 2) use vnodes and
>>    a) get a list of tokens per node from the source cluster
>>    b) put a comma delimited list of these in initial_token in cassandra.yaml on target nodes
>>    c) probably have to un-set num_tokens (this part is unclear to me, you will have to test..)
>>    d) set auto_bootstrap:false in cassandra.yaml
>>    e) start target nodes, they will not-bootstrap into the same ranges as the source cluster
>>    f) load schema / copy data into datadir (being careful of https://issues.apache.org/jira/browse/CASSANDRA-6245)
>>    g) restart node or use nodetool refresh (I'd probably restart the node to avoid the bulk rename that refresh does) to pick up sstables
>>    h) remove auto_bootstrap:false from cassandra.yaml
>>    
>> I *believe* this *should* work, but have never tried it as I do not currently run with vnodes. It should work because it basically makes implicit vnode tokens explicit in the conf file. If it *does* work, I'd greatly appreciate you sharing details of your experience with the list. 
> 
> I'll start with parsing out the token ranges that our vnode config ends up assigning in cluster-A, and doing some creative config work on the target cluster-B we are trying to restore to as you have suggested. Depending on what additional comments/recommendation you or another member of the list may have (if any) based on the clarification I've made above, I will absolutely report back my findings here.
> 
> 
>> 
>> General reference on tasks of this nature (does not consider vnodes, but treat vnodes as "just a lot of physical nodes" and it is mostly relevant) : http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra
>> 
>> =Rob

Re: Read inconsistency after backup and restore to different cluster

Posted by David Laube <da...@stormpath.com>.

Thank you for the detailed reply Rob!  I have replied to your comments in-line below;

On Nov 14, 2013, at 1:15 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Thu, Nov 14, 2013 at 12:37 PM, David Laube <da...@stormpath.com> wrote:
> It is almost as if the data only exists on some of the nodes, or perhaps the token ranges are dramatically different --again, we are using vnodes so I am not exactly sure how this plays into the equation.
> 
> The token ranges are dramatically different, due to vnode random token selection from not setting initial_token, and setting num_tokens.
> 
> You can verify this by listing the tokens per physical node in nodetool gossipinfo or (iirc) nodetool status.
>  
> 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five nodes in the new cluster-B ring.
> 
> I don't understand this at all, do you mean that you are using one source node's data to load each of of the target nodes? Or are you just saying there's a 1:1 relationship between source snapshots and target nodes to load into? Unless you have RF=N, using one source for 5 target nodes won't work.

We have configured RF=3 for the keyspace in question. Also, from a client perspective, we read with CL=1 and write with CL=QUORUM. Since we have 5 nodes total in cluster-A, we snapshot keyspace_name on each of the five nodes which results in a snapshot directory on each of the five nodes that we archive and ship off to s3. We then take the snapshot archive generated FROM cluster-A_node1 and copy/extract/restore TO cluster-B_node1,  then we take the snapshot archive FROM cluster-A_node2 and copy/extract/restore TO cluster-B_node2 and so on and so forth.

> 
> To do what I think you're attempting to do, you have basically two options.
> 
> 1) don't use vnodes and do a 1:1 copy of snapshots
> 2) use vnodes and
>    a) get a list of tokens per node from the source cluster
>    b) put a comma delimited list of these in initial_token in cassandra.yaml on target nodes
>    c) probably have to un-set num_tokens (this part is unclear to me, you will have to test..)
>    d) set auto_bootstrap:false in cassandra.yaml
>    e) start target nodes, they will not-bootstrap into the same ranges as the source cluster
>    f) load schema / copy data into datadir (being careful of https://issues.apache.org/jira/browse/CASSANDRA-6245)
>    g) restart node or use nodetool refresh (I'd probably restart the node to avoid the bulk rename that refresh does) to pick up sstables
>    h) remove auto_bootstrap:false from cassandra.yaml
>    
> I *believe* this *should* work, but have never tried it as I do not currently run with vnodes. It should work because it basically makes implicit vnode tokens explicit in the conf file. If it *does* work, I'd greatly appreciate you sharing details of your experience with the list. 

I'll start with parsing out the token ranges that our vnode config ends up assigning in cluster-A, and doing some creative config work on the target cluster-B we are trying to restore to as you have suggested. Depending on what additional comments/recommendation you or another member of the list may have (if any) based on the clarification I've made above, I will absolutely report back my findings here.

> 
> General reference on tasks of this nature (does not consider vnodes, but treat vnodes as "just a lot of physical nodes" and it is mostly relevant) : http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra
> 
> =Rob

Re: Read inconsistency after backup and restore to different cluster

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Nov 14, 2013 at 12:37 PM, David Laube <da...@stormpath.com> wrote:

> It is almost as if the data only exists on some of the nodes, or perhaps
> the token ranges are dramatically different --again, we are using vnodes so
> I am not exactly sure how this plays into the equation.

The token ranges are dramatically different, due to vnode random token
selection from not setting initial_token, and setting num_tokens.

You can verify this by listing the tokens per physical node in nodetool
gossipinfo or (iirc) nodetool status.

> 5. Copy 1 of the 5 snapshot archives from cluster-A to each of the five
> nodes in the new cluster-B ring.
>

I don't understand this at all, do you mean that you are using one source
node's data to load each of of the target nodes? Or are you just saying
there's a 1:1 relationship between source snapshots and target nodes to
load into? Unless you have RF=N, using one source for 5 target nodes won't
work.

To do what I think you're attempting to do, you have basically two options.

1) don't use vnodes and do a 1:1 copy of snapshots
2) use vnodes and
   a) get a list of tokens per node from the source cluster
   b) put a comma delimited list of these in initial_token in
cassandra.yaml on target nodes
   c) probably have to un-set num_tokens (this part is unclear to me, you
will have to test..)
   d) set auto_bootstrap:false in cassandra.yaml
   e) start target nodes, they will not-bootstrap into the same ranges as
the source cluster
   f) load schema / copy data into datadir (being careful of
https://issues.apache.org/jira/browse/CASSANDRA-6245)
   g) restart node or use nodetool refresh (I'd probably restart the node
to avoid the bulk rename that refresh does) to pick up sstables
   h) remove auto_bootstrap:false from cassandra.yaml

I *believe* this *should* work, but have never tried it as I do not
currently run with vnodes. It should work because it basically makes
implicit vnode tokens explicit in the conf file. If it *does* work, I'd
greatly appreciate you sharing details of your experience with the list.

General reference on tasks of this nature (does not consider vnodes, but
treat vnodes as "just a lot of physical nodes" and it is mostly relevant) :
http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

=Rob