You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br> on 2013/12/06 14:13:42 UTC

cassandra backup

Hello everyone,

    I am trying to create backups of my data on AWS. My goal is to store
the backups on S3 or glacier, as it's cheap to store this kind of data. So,
if I have a cluster with N nodes, I would like to copy data from all N
nodes to S3 and be able to restore later. I know Priam does that (we were
using it), but I am using the latest cassandra version and we plan to use
DSE some time, I am not sure Priam fits this case.
    I took a look at the docs:
http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html

    And I am trying to understand if it's really needed to take a snapshot
to create my backup. Suppose I do a flush and copy the sstables from each
node, 1 by one, to s3. Not all at the same time, but one by one.
    When I try to restore my backup, data from node 1 will be older than
data from node 2. Will this cause problems? AFAIK, if I am using a
replication factor of 2, for instance, and Cassandra sees data from node X
only, it will automatically copy it to other nodes, right? Is there any
chance of cassandra nodes become corrupt somehow if I do my backups this
way?

Best regards,
Marcelo Valle.

Re: cassandra backup

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Dec 6, 2013 at 5:13 AM, Marcelo Elias Del Valle <
marcelo@s1mbi0se.com.br> wrote:

>     I am trying to create backups of my data on AWS. My goal is to store
> the backups on S3 or glacier, as it's cheap to store this kind of data. So,
> if I have a cluster with N nodes, I would like to copy data from all N
> nodes to S3 and be able to restore later.
>

https://github.com/synack/tablesnap

Automated backup, restore, purging, intended for use with Cassandra.

=Rob

Re: cassandra backup

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

I believe SSTables are written to a temporary file then moved.  If I
remember correctly, tools like tablesnap listen for the inotify event
IN_MOVED_TO.  This should handle the "try to back up sstable while in
mid-write" issue.


On Fri, Dec 6, 2013 at 5:39 AM, Michael Theroux <mt...@yahoo.com> wrote:

> Hi Marcelo,
>
> Cassandra provides and eventually consistent model for backups.  You can
> do staggered backups of data, with the idea that if you restore a node, and
> then do a repair, your data will be once again consistent.  Cassandra will
> not automatically copy the data to other nodes (other than via hinted
> handoff).  You should manually run repair after restoring a node.
>
> You should take snapshots when doing a backup, as it keeps the data you
> are backing up relevant to a single point in time, otherwise compaction
> could add/delete files one you mid-backup, or worse, I imagine attempt to
> access a SSTable mid-write.  Snapshots work by using links, and don't take
> additional storage to perform.  In our process we create the snapshot,
> perform the backup, and then clear the snapshot.
>
> One thing to keep in mind in your S3 cost analysis is that, even though
> storage is cheap, reads/writes to S3 are not (especially writes).  If you
> are using LeveledCompaction, or otherwise have a ton of SSTables, some
> people have encountered increased costs moving the data to S3.
>
> Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync
> data too.  Thus far this has worked very well for us.
>
> -Mike
>
>
>   On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle <
> marcelo@s1mbi0se.com.br> wrote:
>  Hello everyone,
>
>     I am trying to create backups of my data on AWS. My goal is to store
> the backups on S3 or glacier, as it's cheap to store this kind of data. So,
> if I have a cluster with N nodes, I would like to copy data from all N
> nodes to S3 and be able to restore later. I know Priam does that (we were
> using it), but I am using the latest cassandra version and we plan to use
> DSE some time, I am not sure Priam fits this case.
>     I took a look at the docs:
> http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html
>
>     And I am trying to understand if it's really needed to take a snapshot
> to create my backup. Suppose I do a flush and copy the sstables from each
> node, 1 by one, to s3. Not all at the same time, but one by one.
>     When I try to restore my backup, data from node 1 will be older than
> data from node 2. Will this cause problems? AFAIK, if I am using a
> replication factor of 2, for instance, and Cassandra sees data from node X
> only, it will automatically copy it to other nodes, right? Is there any
> chance of cassandra nodes become corrupt somehow if I do my backups this
> way?
>
> Best regards,
> Marcelo Valle.
>
>
>


-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: cassandra backup

Posted by Rahul Menon <ra...@apigee.com>.

You should look at this - https://github.com/amorton/cassback i dont
believe its setup to use 1.2.10 and above but i believe is just small
tweeks to get it running.

Thanks
Rahul


On Fri, Dec 6, 2013 at 7:09 PM, Michael Theroux <mt...@yahoo.com> wrote:

> Hi Marcelo,
>
> Cassandra provides and eventually consistent model for backups.  You can
> do staggered backups of data, with the idea that if you restore a node, and
> then do a repair, your data will be once again consistent.  Cassandra will
> not automatically copy the data to other nodes (other than via hinted
> handoff).  You should manually run repair after restoring a node.
>
> You should take snapshots when doing a backup, as it keeps the data you
> are backing up relevant to a single point in time, otherwise compaction
> could add/delete files one you mid-backup, or worse, I imagine attempt to
> access a SSTable mid-write.  Snapshots work by using links, and don't take
> additional storage to perform.  In our process we create the snapshot,
> perform the backup, and then clear the snapshot.
>
> One thing to keep in mind in your S3 cost analysis is that, even though
> storage is cheap, reads/writes to S3 are not (especially writes).  If you
> are using LeveledCompaction, or otherwise have a ton of SSTables, some
> people have encountered increased costs moving the data to S3.
>
> Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync
> data too.  Thus far this has worked very well for us.
>
> -Mike
>
>
>   On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle <
> marcelo@s1mbi0se.com.br> wrote:
>   Hello everyone,
>
>     I am trying to create backups of my data on AWS. My goal is to store
> the backups on S3 or glacier, as it's cheap to store this kind of data. So,
> if I have a cluster with N nodes, I would like to copy data from all N
> nodes to S3 and be able to restore later. I know Priam does that (we were
> using it), but I am using the latest cassandra version and we plan to use
> DSE some time, I am not sure Priam fits this case.
>     I took a look at the docs:
> http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html
>
>     And I am trying to understand if it's really needed to take a snapshot
> to create my backup. Suppose I do a flush and copy the sstables from each
> node, 1 by one, to s3. Not all at the same time, but one by one.
>     When I try to restore my backup, data from node 1 will be older than
> data from node 2. Will this cause problems? AFAIK, if I am using a
> replication factor of 2, for instance, and Cassandra sees data from node X
> only, it will automatically copy it to other nodes, right? Is there any
> chance of cassandra nodes become corrupt somehow if I do my backups this
> way?
>
> Best regards,
> Marcelo Valle.
>
>
>

Re: cassandra backup

Posted by Michael Theroux <mt...@yahoo.com>.

Hi Marcelo,

Cassandra provides and eventually consistent model for backups. You can do staggered backups of data, with the idea that if you restore a node, and then do a repair, your data will be once again consistent. Cassandra will not automatically copy the data to other nodes (other than via hinted handoff). You should manually run repair after restoring a node.
You should take snapshots when doing a backup, as it keeps the data you are backing up relevant to a single point in time, otherwise compaction could add/delete files one you mid-backup, or worse, I imagine attempt to access a SSTable mid-write. Snapshots work by using links, and don't take additional storage to perform. In our process we create the snapshot, perform the backup, and then clear the snapshot.

One thing to keep in mind in your S3 cost analysis is that, even though storage is cheap, reads/writes to S3 are not (especially writes). If you are using LeveledCompaction, or otherwise have a ton of SSTables, some people have encountered increased costs moving the data to S3.

Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync data too. Thus far this has worked very well for us.

-Mike

On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle <ma...@s1mbi0se.com.br> wrote:

Hello everyone,

I am trying to create backups of my data on AWS. My goal is to store the backups on S3 or glacier, as it's cheap to store this kind of data. So, if I have a cluster with N nodes, I would like to copy data from all N nodes to S3 and be able to restore later. I know Priam does that (we were using it), but I am using the latest cassandra version and we plan to use DSE some time, I am not sure Priam fits this case.
I took a look at the docs: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html
And I am trying to understand if it's really needed to take a snapshot to create my backup. Suppose I do a flush and copy the sstables from each node, 1 by one, to s3. Not all at the same time, but one by one.
When I try to restore my backup, data from node 1 will be older than data from node 2. Will this cause problems? AFAIK, if I am using a replication factor of 2, for instance, and Cassandra sees data from node X only, it will automatically copy it to other nodes, right? Is there any chance of cassandra nodes become corrupt somehow if I do my backups this way?

Best regards,
Marcelo Valle.