You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Amalrik Maia <am...@s1mbi0se.com.br> on 2013/12/06 15:41:38 UTC

help on backup muiltinode cluster

hey guys, I'm trying to take backups of a multi-node cassandra and save them on S3. 
My idea is simply doing ssh to each server and use nodetool to create the snapshots then push then to S3. 

So is this approach recommended? my concerns are about inconsistencies that this approach can lead, since the snapshots are taken one by one and not in parallel.  
Should i worry about it or cassandra finds a way to deal with inconsistencies when doing a restore?

PS: I'm aware that datastax recommends to use pssh to take snapshots in parallel, but i couldn't use pssh because node tool requires you to specify the hostname.
nodetool -h 10.10.10.1 snapshot thissnapshotname

Any help would be appreciated.
[]'s

Re: help on backup muiltinode cluster

Posted by Andre Sprenger <an...@getanet.de>.

If you lose RF + 1 nodes the data that is replicated to only these nodes is
gone, good idea to have a recent backup than. Another situation is when you
deploy a bug in the software and start writing crap data to Cassandra.
Replication does not help and depending on the situation you need to
restore the backup.


2013/12/7 Jason Wee <pe...@gmail.com>

> Hmm... cassandra fundamental key features like fault tolerant, durable and
> replication. Just out of curiousity, why would you want to do backup?
>
> /Jason
>
>
> On Sat, Dec 7, 2013 at 3:31 AM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Fri, Dec 6, 2013 at 6:41 AM, Amalrik Maia <am...@s1mbi0se.com.br>wrote:
>>
>>> hey guys, I'm trying to take backups of a multi-node cassandra and save
>>> them on S3.
>>> My idea is simply doing ssh to each server and use nodetool to create
>>> the snapshots then push then to S3.
>>>
>>
>> https://github.com/synack/tablesnap
>>
>> So is this approach recommended? my concerns are about inconsistencies
>>> that this approach can lead, since the snapshots are taken one by one and
>>> not in parallel.
>>> Should i worry about it or cassandra finds a way to deal with
>>> inconsistencies when doing a restore?
>>>
>>
>> The backup is as consistent as your cluster is at any given moment, which
>> is "not necessarily". Manual repair brings you closer to consistency, but
>> only on data present when the repair started.
>>
>> =Rob
>>
>
>

Re: help on backup muiltinode cluster

Posted by Hannu Kröger <hk...@gmail.com>.

One typical reason is to protect against human error. 

> On 7.12.2013, at 11.09, Jason Wee <pe...@gmail.com> wrote:
> 
> Hmm... cassandra fundamental key features like fault tolerant, durable and replication. Just out of curiousity, why would you want to do backup?
> 
> /Jason
> 
> 
>> On Sat, Dec 7, 2013 at 3:31 AM, Robert Coli <rc...@eventbrite.com> wrote:
>>> On Fri, Dec 6, 2013 at 6:41 AM, Amalrik Maia <am...@s1mbi0se.com.br> wrote:
>>> hey guys, I'm trying to take backups of a multi-node cassandra and save them on S3. 
>>> My idea is simply doing ssh to each server and use nodetool to create the snapshots then push then to S3. 
>> 
>> https://github.com/synack/tablesnap
>> 
>>> So is this approach recommended? my concerns are about inconsistencies that this approach can lead, since the snapshots are taken one by one and not in parallel.  
>>> Should i worry about it or cassandra finds a way to deal with inconsistencies when doing a restore?
>> 
>> The backup is as consistent as your cluster is at any given moment, which is "not necessarily". Manual repair brings you closer to consistency, but only on data present when the repair started.
>> 
>> =Rob 
>

Re: help on backup muiltinode cluster

Posted by Ray Sutton <ra...@gmail.com>.

I have not use tablesnap but it appears that it does not necessarily depend
upon taking a cassandra snapshot. The example given in their documentation
shows the source folder as /var/lib/cassandra/data/GiantKeyspace, which is
the root of the "GiantKeyspace" keyspace. But, snapshots operate at the
column-family level and are stored in a sub directory structure for each
column family. For example, if we have 2 column families in GiantKeyspace,
called cf1 and cf2, the snapshots would be located in
/var/lib/cassandra/data/GiantKeyspace/cf1/snapshots/snapshot_id/ and
var/lib/cassandra/data/GiantKeyspace/cf2/snapshots/snapshot_id/, where
snapshot_id is some unique identifier for that snapshot. Unless tablesnap
will detect changes in subfolders, I don't know how you will tell tablesnap
the name of the actual snapshot folder before the snapshot is taken. I
think tablesnap's premise is that since a snapshot is a simply a hard link
to an existing sstable file and sstables are immutable, it will
simply operate on the original sstable, no need for making a snapshot.

However cassandra also performs compactions on sstables which combines
sstables into new sstables for the purpose of "de-fragging" row data to
optimize lookups. The pre-compaction sstables will be marked for deletion
and removed during the next GC. What this means to me is that you should
use snaphshots to preserve point-in-time state of the data. So there seems
to be a small problem to overcome if using snapshots and tablesnap.

Ideally to create a completely consistent point-in-time backup you would
stop client access to the cluster  (nodetool thriftdisable), execute a
flush to write memtables to disk, then execute the snapshot. In reality, if
you can execute the snapshot on all servers within a "short period of
time", for some value of 'short', your data will be relatively consistent.
If you ever needed to perform a restore from these snapshots, cassandra's
internal read repair feature would fixup any inconsistencies.

I use DataStax OpsCenter to take snapshots and then a homebrew python
script to upload to S3. OpsCenter sends the snapshot command to all servers
nearly simultaneously so the snapshots are executed almost in parallel.
This feature might only be available in the Enterprise version. You could
use a simple bash script to execute the nodetool snapshot command via ssh
to each server sequentially, or use a mutli-window ssh client ( csshX for
OSX https://code.google.com/p/csshx/ ) to execute in true parallel fashion.

--
Ray  //o-o\\

On Sat, Dec 7, 2013 at 4:09 AM, Jason Wee <pe...@gmail.com> wrote:

> Hmm... cassandra fundamental key features like fault tolerant, durable and
> replication. Just out of curiousity, why would you want to do backup?
>
> /Jason
>
>
> On Sat, Dec 7, 2013 at 3:31 AM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Fri, Dec 6, 2013 at 6:41 AM, Amalrik Maia <am...@s1mbi0se.com.br>wrote:
>>
>>> hey guys, I'm trying to take backups of a multi-node cassandra and save
>>> them on S3.
>>> My idea is simply doing ssh to each server and use nodetool to create
>>> the snapshots then push then to S3.
>>>
>>
>> https://github.com/synack/tablesnap
>>
>> So is this approach recommended? my concerns are about inconsistencies
>>> that this approach can lead, since the snapshots are taken one by one and
>>> not in parallel.
>>> Should i worry about it or cassandra finds a way to deal with
>>> inconsistencies when doing a restore?
>>>
>>
>> The backup is as consistent as your cluster is at any given moment, which
>> is "not necessarily". Manual repair brings you closer to consistency, but
>> only on data present when the repair started.
>>
>> =Rob
>>
>
>

Re: help on backup muiltinode cluster

Posted by Jason Wee <pe...@gmail.com>.

Hmm... cassandra fundamental key features like fault tolerant, durable and
replication. Just out of curiousity, why would you want to do backup?

/Jason


On Sat, Dec 7, 2013 at 3:31 AM, Robert Coli <rc...@eventbrite.com> wrote:

> On Fri, Dec 6, 2013 at 6:41 AM, Amalrik Maia <am...@s1mbi0se.com.br>wrote:
>
>> hey guys, I'm trying to take backups of a multi-node cassandra and save
>> them on S3.
>> My idea is simply doing ssh to each server and use nodetool to create the
>> snapshots then push then to S3.
>>
>
> https://github.com/synack/tablesnap
>
> So is this approach recommended? my concerns are about inconsistencies
>> that this approach can lead, since the snapshots are taken one by one and
>> not in parallel.
>> Should i worry about it or cassandra finds a way to deal with
>> inconsistencies when doing a restore?
>>
>
> The backup is as consistent as your cluster is at any given moment, which
> is "not necessarily". Manual repair brings you closer to consistency, but
> only on data present when the repair started.
>
> =Rob
>

Re: help on backup muiltinode cluster

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Dec 6, 2013 at 6:41 AM, Amalrik Maia <am...@s1mbi0se.com.br>wrote:

> hey guys, I'm trying to take backups of a multi-node cassandra and save
> them on S3.
> My idea is simply doing ssh to each server and use nodetool to create the
> snapshots then push then to S3.
>

https://github.com/synack/tablesnap

So is this approach recommended? my concerns are about inconsistencies that
> this approach can lead, since the snapshots are taken one by one and not in
> parallel.
> Should i worry about it or cassandra finds a way to deal with
> inconsistencies when doing a restore?
>

The backup is as consistent as your cluster is at any given moment, which
is "not necessarily". Manual repair brings you closer to consistency, but
only on data present when the repair started.

=Rob