You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by java8964 <ja...@hotmail.com> on 2014/03/06 15:14:31 UTC

Backup/Restore in Cassandra

Hi, 
Currently I am looking how the bacup/restore be done in Cassandra, based the document from DataStax: 
http://www.datastax.com/docs/1.1/backup_restore
Here is one way to do it:
1) Do a full snapshot every week2) Enable incremental backup every day
So with last snapshot + the incremental backups after that snapshot, you can restore the cluster to the stage before it is lost.
Here are my understanding  how Cassandra will flush from Memtable to SSTable files in snapshot or incremental backups:
1) Full snapshot will force Cassandra flush all memtables to SSTable files, but incremental backup won't.2) Incremental backup just hard-link all the SSTables after the last snapshot.
Here is my question:
If the my above understanding is correct, let's say there is a data change happened after the last snapshot, then recorded into commit log, and stored in memtable, but never flush to the SSTables yet, at this time, we lost our cluster. Will that change be lost just based on last snapshot plus incremental backups? Or besides the last snapshot plus incremental backups, we also need all commit log for a restore?
Thanks
Yong 		 	   		  

RE: Backup/Restore in Cassandra

Posted by java8964 <ja...@hotmail.com>.
Hi, Jonathan:
Thanks for your answer. My original goal of this question is not really related to backup/restore, but to see if we can skip the Full Snapshot during ETL transferring the data from SSTable files of Cassandra into another Hadoop Cluster.
Right now, our production generates a full snapshot once a week, and does incremental backup every day. Our ETL implementation will parse the SSTable files, catch the delta change, and load them into Hadoop.
Of course, the Full Snapshot is just bigger and bigger every time, so I was wondering if we can avoid processing the snapshot, instead just the incremental backups, as it is much smaller comparing to snapshot.
But based on currently how Cassandra does the incremental backups and snapshot, to make sure we catch all the changed data, I think we have to process the weekly full snapshot.
Thanks
Yong

Date: Fri, 7 Mar 2014 10:33:47 -0500
Subject: Re: Backup/Restore in Cassandra
From: jlacefield@datastax.com
To: user@cassandra.apache.org

Hello,
    Full snapshot forces a flush, yes.    Incremental hard-links to SSTables, yes.
   This question really depends on how your cluster was "lost".  

   Node Loss:  You would be able to restore a node based on restoring backups + commit log or just by using repair.   Cluster Loss: (all nodes down scenario with recoverable machines/disks):  You would be able to restore to the point where you captured your last incremental backup.  If you had a commit log, then those operations would be replayed during bootstrapping.  You would also have to restore SSTables that were written to disk but not captured in an incremental backup.
   Cluster Loss: (all nodes down scenario with unrecoverable machines/disks):  You would only be able to restore to the last incremental backup point.  This assumes you save backups off Cluster.

   The commit log's goal is to provide durability in case of node failure prior to a flush operation.  The commit log will be replayed during bootstrapping a node and will repopulate memtables.  There is also a commit log archive and restore feature as well:  http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configLogArchive_t.html.  I have not personally used this feature so cannot comment on it's performance/stability.

  Does this help?
  BTW:  Here's the 1.2 documentation for backup and restore - http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_backup_restore_c.html
              Here's the 2.0 documentation for backup and restore - http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html

Thanks,
Jonathan

Jonathan LacefieldSolutions Architect, DataStax
(404) 822 3487








On Thu, Mar 6, 2014 at 9:14 AM, java8964 <ja...@hotmail.com> wrote:




Hi, 
Currently I am looking how the bacup/restore be done in Cassandra, based the document from DataStax: 
http://www.datastax.com/docs/1.1/backup_restore

Here is one way to do it:
1) Do a full snapshot every week2) Enable incremental backup every day
So with last snapshot + the incremental backups after that snapshot, you can restore the cluster to the stage before it is lost.

Here are my understanding  how Cassandra will flush from Memtable to SSTable files in snapshot or incremental backups:
1) Full snapshot will force Cassandra flush all memtables to SSTable files, but incremental backup won't.
2) Incremental backup just hard-link all the SSTables after the last snapshot.
Here is my question:
If the my above understanding is correct, let's say there is a data change happened after the last snapshot, then recorded into commit log, and stored in memtable, but never flush to the SSTables yet, at this time, we lost our cluster. Will that change be lost just based on last snapshot plus incremental backups? Or besides the last snapshot plus incremental backups, we also need all commit log for a restore?

Thanks
Yong 		 	   		  

 		 	   		  

Re: Backup/Restore in Cassandra

Posted by Jonathan Lacefield <jl...@datastax.com>.
Hello,

    Full snapshot forces a flush, yes.
    Incremental hard-links to SSTables, yes.

   This question really depends on how your cluster was "lost".

   Node Loss:  You would be able to restore a node based on restoring
backups + commit log or just by using repair.
   Cluster Loss: (all nodes down scenario with recoverable machines/disks):
 You would be able to restore to the point where you captured your last
incremental backup.  If you had a commit log, then those operations would
be replayed during bootstrapping.  You would also have to restore SSTables
that were written to disk but not captured in an incremental backup.
   Cluster Loss: (all nodes down scenario with unrecoverable
machines/disks):  You would only be able to restore to the last incremental
backup point.  This assumes you save backups off Cluster.

   The commit log's goal is to provide durability in case of node failure
prior to a flush operation.  The commit log will be replayed during
bootstrapping a node and will repopulate memtables.  There is also a commit
log archive and restore feature as well:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configLogArchive_t.html.
 I have not personally used this feature so cannot comment on it's
performance/stability.

  Does this help?

  BTW:  Here's the 1.2 documentation for backup and restore -
http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_backup_restore_c.html
              Here's the 2.0 documentation for backup and restore -
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html

Thanks,

Jonathan



Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>


<http://www.datastax.com/what-we-offer/products-services/training/virtual-training>


On Thu, Mar 6, 2014 at 9:14 AM, java8964 <ja...@hotmail.com> wrote:

> Hi,
>
> Currently I am looking how the bacup/restore be done in Cassandra, based
> the document from DataStax:
>
> http://www.datastax.com/docs/1.1/backup_restore
>
> Here is one way to do it:
>
> 1) Do a full snapshot every week
> 2) Enable incremental backup every day
>
> So with last snapshot + the incremental backups after that snapshot, you
> can restore the cluster to the stage before it is lost.
>
> Here are my understanding  how Cassandra will flush from Memtable to
> SSTable files in snapshot or incremental backups:
>
> 1) Full snapshot will force Cassandra flush all memtables to SSTable
> files, but incremental backup won't.
> 2) Incremental backup just hard-link all the SSTables after the last
> snapshot.
>
> Here is my question:
>
> If the my above understanding is correct, let's say there is a data change
> happened after the last snapshot, then recorded into commit log, and stored
> in memtable, but never flush to the SSTables yet, at this time, we lost our
> cluster. Will that change be lost just based on last snapshot plus
> incremental backups? Or besides the last snapshot plus incremental backups,
> we also need all commit log for a restore?
>
> Thanks
>
> Yong
>