You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by William Oberman <ob...@civicscience.com> on 2011/04/28 21:53:46 UTC

best way to backup

Even with N-nodes for redundancy, I still want to have backups.  I'm an
amazon person, so naturally I'm thinking S3.  Reading over the docs, and
messing with nodeutil, it looks like each new snapshot contains the previous
snapshot as a subset (and I've read how cassandra uses hard links to avoid
excessive disk use).  When does that pattern break down?

I'm basically debating if I can do a "rsync" like backup, or if I should do
a compressed tar backup.  And I obviously want multiple points in time.  S3
does allow file versioning, if a file or file name is changed/resused over
time (only matters in the rsync case).  My only concerns with compressed
tars is I'll have to have free space to create the archive and I get no
"delta" space savings on the backup (the former is solved by not allowing
the disk space to get so low and/or adding more nodes to bring down the
space, the latter is solved by S3 being really cheap anyways).

-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: best way to backup

Posted by Jeremy Hanna <je...@gmail.com>.

Good point - we plan to do regular testing to restore the cluster.  Also we might spin up a snapshot of the cluster for testing as well.

Also I wonder how much time compression will save when it comes to restores.  I'll have to run some tests on that.  Thanks for posting.

Jeremy

On Apr 28, 2011, at 4:15 PM, Adrian Cockcroft wrote:

> Netflix has also gone down this path, we run a regular full backup to
> S3 of a compressed tar, and we have scripts that restore everything
> into the right place on a different cluster (it needs the same node
> count). We also pick up the SSTables as they are created, and drop
> them in S3.
> 
> Whatever you do, make sure you have a regular process to restore the
> data and verify that it contains what you think it should...
> 
> Adrian
> 
> On Thu, Apr 28, 2011 at 1:35 PM, Jeremy Hanna
> <je...@gmail.com> wrote:
>> one thing we're looking at doing is watching the cassandra data directory and backing up the sstables to s3 when they are created.  Some guys at simplegeo started tablesnap that does this:
>> https://github.com/simplegeo/tablesnap
>> 
>> What it does is for every sstable that is pushed to s3, it also copies a json file with the current files in the directory, so you can know what to restore in that event (as far as I understand).
>> 
>> On Apr 28, 2011, at 2:53 PM, William Oberman wrote:
>> 
>>> Even with N-nodes for redundancy, I still want to have backups.  I'm an amazon person, so naturally I'm thinking S3.  Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use).  When does that pattern break down?
>>> 
>>> I'm basically debating if I can do a "rsync" like backup, or if I should do a compressed tar backup.  And I obviously want multiple points in time.  S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case).  My only concerns with compressed tars is I'll have to have free space to create the archive and I get no "delta" space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways).
>>> 
>>> --
>>> Will Oberman
>>> Civic Science, Inc.
>>> 3030 Penn Avenue., First Floor
>>> Pittsburgh, PA 15201
>>> (M) 412-480-7835
>>> (E) oberman@civicscience.com
>> 
>>

Re: best way to backup

Posted by Adrian Cockcroft <ad...@gmail.com>.

Netflix has also gone down this path, we run a regular full backup to
S3 of a compressed tar, and we have scripts that restore everything
into the right place on a different cluster (it needs the same node
count). We also pick up the SSTables as they are created, and drop
them in S3.

Whatever you do, make sure you have a regular process to restore the
data and verify that it contains what you think it should...

Adrian

On Thu, Apr 28, 2011 at 1:35 PM, Jeremy Hanna
<je...@gmail.com> wrote:
> one thing we're looking at doing is watching the cassandra data directory and backing up the sstables to s3 when they are created.  Some guys at simplegeo started tablesnap that does this:
> https://github.com/simplegeo/tablesnap
>
> What it does is for every sstable that is pushed to s3, it also copies a json file with the current files in the directory, so you can know what to restore in that event (as far as I understand).
>
> On Apr 28, 2011, at 2:53 PM, William Oberman wrote:
>
>> Even with N-nodes for redundancy, I still want to have backups.  I'm an amazon person, so naturally I'm thinking S3.  Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use).  When does that pattern break down?
>>
>> I'm basically debating if I can do a "rsync" like backup, or if I should do a compressed tar backup.  And I obviously want multiple points in time.  S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case).  My only concerns with compressed tars is I'll have to have free space to create the archive and I get no "delta" space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways).
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>
>

Re: best way to backup

Posted by William Oberman <ob...@civicscience.com>.

My newbie mistake (always good to test things): my script wasn't
storing/restoring system, only my keyspace.  So, if you want to be able to
restore from backup, make sure you save the keyspace and system!

will

On Thu, Apr 28, 2011 at 4:35 PM, Jeremy Hanna <je...@gmail.com>wrote:

> one thing we're looking at doing is watching the cassandra data directory
> and backing up the sstables to s3 when they are created.  Some guys at
> simplegeo started tablesnap that does this:
> https://github.com/simplegeo/tablesnap
>
> What it does is for every sstable that is pushed to s3, it also copies a
> json file with the current files in the directory, so you can know what to
> restore in that event (as far as I understand).
>
> On Apr 28, 2011, at 2:53 PM, William Oberman wrote:
>
> > Even with N-nodes for redundancy, I still want to have backups.  I'm an
> amazon person, so naturally I'm thinking S3.  Reading over the docs, and
> messing with nodeutil, it looks like each new snapshot contains the previous
> snapshot as a subset (and I've read how cassandra uses hard links to avoid
> excessive disk use).  When does that pattern break down?
> >
> > I'm basically debating if I can do a "rsync" like backup, or if I should
> do a compressed tar backup.  And I obviously want multiple points in time.
>  S3 does allow file versioning, if a file or file name is changed/resused
> over time (only matters in the rsync case).  My only concerns with
> compressed tars is I'll have to have free space to create the archive and I
> get no "delta" space savings on the backup (the former is solved by not
> allowing the disk space to get so low and/or adding more nodes to bring down
> the space, the latter is solved by S3 being really cheap anyways).
> >
> > --
> > Will Oberman
> > Civic Science, Inc.
> > 3030 Penn Avenue., First Floor
> > Pittsburgh, PA 15201
> > (M) 412-480-7835
> > (E) oberman@civicscience.com
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: best way to backup

Posted by Jeremy Hanna <je...@gmail.com>.

one thing we're looking at doing is watching the cassandra data directory and backing up the sstables to s3 when they are created.  Some guys at simplegeo started tablesnap that does this:
https://github.com/simplegeo/tablesnap

What it does is for every sstable that is pushed to s3, it also copies a json file with the current files in the directory, so you can know what to restore in that event (as far as I understand).

On Apr 28, 2011, at 2:53 PM, William Oberman wrote:

> Even with N-nodes for redundancy, I still want to have backups.  I'm an amazon person, so naturally I'm thinking S3.  Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use).  When does that pattern break down?  
> 
> I'm basically debating if I can do a "rsync" like backup, or if I should do a compressed tar backup.  And I obviously want multiple points in time.  S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case).  My only concerns with compressed tars is I'll have to have free space to create the archive and I get no "delta" space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways).
> 
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com

Re: best way to backup

Posted by William Oberman <ob...@civicscience.com>.

Thanks, I think I'm getting some of the file layout/data structures now, so
that helps with the backup strategy.  I might still start simple, as it's
usually harder to screw up simple, but at least I'll know where I can go
with something more clever.

will

On Sat, Apr 30, 2011 at 9:15 AM, Jeremiah Jordan <
JEREMIAH.JORDAN@morningstar.com> wrote:

>  The files inside the keyspace folders are the SSTable.
>
>  ------------------------------
> *From:* aaron morton [mailto:aaron@thelastpickle.com]
> *Sent:* Friday, April 29, 2011 4:49 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: best way to backup
>
> William,
> Some info on the sstables from me
> http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/
>
>  <http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/>If you
> want to know more check out the BigTable and original Facebook papers,
> linked from the wiki
>
>  <http://wiki.apache.org/cassandra/ArchitectureOverview>Aaron
>
>  On 29 Apr 2011, at 23:43, William Oberman wrote:
>
> Dumb question, but referenced twice now: which files are the SSTables and
> why is backing them up incrementally a win?
>
> Or should I not bother to understand internals, and instead just roll with
> the "backup my keyspace(s) and system in a compressed tar" strategy, as
> while it may be excessive, it's guaranteed to work and work easily (which I
> like, a great deal).
>
> will
>
> On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday <
> daniel.doubleday@gmx.net> wrote:
>
>> What we are about to set up is a time machine like backup. This is more
>> like an add on to the s3 backup.
>>
>> Our boxes have an additional larger drive for local backup. We create a
>> new backup snaphot every x hours which hardlinks the files in the previous
>> snapshot (bit like cassandras incremental_backups thing) and than we sync
>> that snapshot dir with the cassandra data dir. We can do archiving / backup
>> to external system from there without impacting the main data raid.
>>
>> But the main reason to do this is to have an 'omg we screwed up big time
>> and deleted / corrupted data' recovery.
>>
>>  On Apr 28, 2011, at 9:53 PM, William Oberman wrote:
>>
>>   Even with N-nodes for redundancy, I still want to have backups.  I'm an
>> amazon person, so naturally I'm thinking S3.  Reading over the docs, and
>> messing with nodeutil, it looks like each new snapshot contains the previous
>> snapshot as a subset (and I've read how cassandra uses hard links to avoid
>> excessive disk use).  When does that pattern break down?
>>
>> I'm basically debating if I can do a "rsync" like backup, or if I should
>> do a compressed tar backup.  And I obviously want multiple points in time.
>> S3 does allow file versioning, if a file or file name is changed/resused
>> over time (only matters in the rsync case).  My only concerns with
>> compressed tars is I'll have to have free space to create the archive and I
>> get no "delta" space savings on the backup (the former is solved by not
>> allowing the disk space to get so low and/or adding more nodes to bring down
>> the space, the latter is solved by S3 being really cheap anyways).
>>
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
>>
>>
>>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

RE: best way to backup

Posted by Jeremiah Jordan <JE...@morningstar.com>.

The files inside the keyspace folders are the SSTable.

________________________________

From: aaron morton [mailto:aaron@thelastpickle.com] 
Sent: Friday, April 29, 2011 4:49 PM
To: user@cassandra.apache.org
Subject: Re: best way to backup


William,  
Some info on the sstables from me
http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

<http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/> If you
want to know more check out the BigTable and original Facebook papers,
linked from the wiki

<http://wiki.apache.org/cassandra/ArchitectureOverview> Aaron

On 29 Apr 2011, at 23:43, William Oberman wrote:


	Dumb question, but referenced twice now: which files are the
SSTables and why is backing them up incrementally a win? 

	Or should I not bother to understand internals, and instead just
roll with the "backup my keyspace(s) and system in a compressed tar"
strategy, as while it may be excessive, it's guaranteed to work and work
easily (which I like, a great deal).

	will
	
	
	On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday
<da...@gmx.net> wrote:
	

		What we are about to set up is a time machine like
backup. This is more like an add on to the s3 backup. 

		Our boxes have an additional larger drive for local
backup. We create a new backup snaphot every x hours which hardlinks the
files in the previous snapshot (bit like cassandras incremental_backups
thing) and than we sync that snapshot dir with the cassandra data dir.
We can do archiving / backup to external system from there without
impacting the main data raid.

		But the main reason to do this is to have an 'omg we
screwed up big time and deleted / corrupted data' recovery.

		On Apr 28, 2011, at 9:53 PM, William Oberman wrote:


			Even with N-nodes for redundancy, I still want
to have backups.  I'm an amazon person, so naturally I'm thinking S3.
Reading over the docs, and messing with nodeutil, it looks like each new
snapshot contains the previous snapshot as a subset (and I've read how
cassandra uses hard links to avoid excessive disk use).  When does that
pattern break down?  
			
			I'm basically debating if I can do a "rsync"
like backup, or if I should do a compressed tar backup.  And I obviously
want multiple points in time.  S3 does allow file versioning, if a file
or file name is changed/resused over time (only matters in the rsync
case).  My only concerns with compressed tars is I'll have to have free
space to create the archive and I get no "delta" space savings on the
backup (the former is solved by not allowing the disk space to get so
low and/or adding more nodes to bring down the space, the latter is
solved by S3 being really cheap anyways).
			
			-- 
			Will Oberman
			Civic Science, Inc.
			3030 Penn Avenue., First Floor
			Pittsburgh, PA 15201
			(M) 412-480-7835
			(E) oberman@civicscience.com
			





	-- 
	Will Oberman
	Civic Science, Inc.
	3030 Penn Avenue., First Floor
	Pittsburgh, PA 15201
	(M) 412-480-7835
	(E) oberman@civicscience.com

Re: best way to backup

Posted by aaron morton <aa...@thelastpickle.com>.

William, 
	Some info on the sstables from me http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

	If you want to know more check out the BigTable and original Facebook papers, linked from the wiki

Aaron

On 29 Apr 2011, at 23:43, William Oberman wrote:

> Dumb question, but referenced twice now: which files are the SSTables and why is backing them up incrementally a win?
> 
> Or should I not bother to understand internals, and instead just roll with the "backup my keyspace(s) and system in a compressed tar" strategy, as while it may be excessive, it's guaranteed to work and work easily (which I like, a great deal).
> 
> will
> 
> On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday <da...@gmx.net> wrote:
> What we are about to set up is a time machine like backup. This is more like an add on to the s3 backup.
> 
> Our boxes have an additional larger drive for local backup. We create a new backup snaphot every x hours which hardlinks the files in the previous snapshot (bit like cassandras incremental_backups thing) and than we sync that snapshot dir with the cassandra data dir. We can do archiving / backup to external system from there without impacting the main data raid.
> 
> But the main reason to do this is to have an 'omg we screwed up big time and deleted / corrupted data' recovery.
> 
> On Apr 28, 2011, at 9:53 PM, William Oberman wrote:
> 
>> Even with N-nodes for redundancy, I still want to have backups.  I'm an amazon person, so naturally I'm thinking S3.  Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use).  When does that pattern break down?  
>> 
>> I'm basically debating if I can do a "rsync" like backup, or if I should do a compressed tar backup.  And I obviously want multiple points in time.  S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case).  My only concerns with compressed tars is I'll have to have free space to create the archive and I get no "delta" space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways).
>> 
>> -- 
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E) oberman@civicscience.com
> 
> 
> 
> 
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com

Re: best way to backup

Posted by William Oberman <ob...@civicscience.com>.

Dumb question, but referenced twice now: which files are the SSTables and
why is backing them up incrementally a win?

Or should I not bother to understand internals, and instead just roll with
the "backup my keyspace(s) and system in a compressed tar" strategy, as
while it may be excessive, it's guaranteed to work and work easily (which I
like, a great deal).

will

On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday
<da...@gmx.net>wrote:

> What we are about to set up is a time machine like backup. This is more
> like an add on to the s3 backup.
>
> Our boxes have an additional larger drive for local backup. We create a new
> backup snaphot every x hours which hardlinks the files in the previous
> snapshot (bit like cassandras incremental_backups thing) and than we sync
> that snapshot dir with the cassandra data dir. We can do archiving / backup
> to external system from there without impacting the main data raid.
>
> But the main reason to do this is to have an 'omg we screwed up big time
> and deleted / corrupted data' recovery.
>
> On Apr 28, 2011, at 9:53 PM, William Oberman wrote:
>
> Even with N-nodes for redundancy, I still want to have backups.  I'm an
> amazon person, so naturally I'm thinking S3.  Reading over the docs, and
> messing with nodeutil, it looks like each new snapshot contains the previous
> snapshot as a subset (and I've read how cassandra uses hard links to avoid
> excessive disk use).  When does that pattern break down?
>
> I'm basically debating if I can do a "rsync" like backup, or if I should do
> a compressed tar backup.  And I obviously want multiple points in time.  S3
> does allow file versioning, if a file or file name is changed/resused over
> time (only matters in the rsync case).  My only concerns with compressed
> tars is I'll have to have free space to create the archive and I get no
> "delta" space savings on the backup (the former is solved by not allowing
> the disk space to get so low and/or adding more nodes to bring down the
> space, the latter is solved by S3 being really cheap anyways).
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: best way to backup

Posted by Daniel Doubleday <da...@gmx.net>.

What we are about to set up is a time machine like backup. This is more like an add on to the s3 backup.

Our boxes have an additional larger drive for local backup. We create a new backup snaphot every x hours which hardlinks the files in the previous snapshot (bit like cassandras incremental_backups thing) and than we sync that snapshot dir with the cassandra data dir. We can do archiving / backup to external system from there without impacting the main data raid.

But the main reason to do this is to have an 'omg we screwed up big time and deleted / corrupted data' recovery.

On Apr 28, 2011, at 9:53 PM, William Oberman wrote:

> Even with N-nodes for redundancy, I still want to have backups.  I'm an amazon person, so naturally I'm thinking S3.  Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use).  When does that pattern break down?  
> 
> I'm basically debating if I can do a "rsync" like backup, or if I should do a compressed tar backup.  And I obviously want multiple points in time.  S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case).  My only concerns with compressed tars is I'll have to have free space to create the archive and I get no "delta" space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways).
> 
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com

Re: best way to backup

Posted by William Oberman <ob...@civicscience.com>.

Interesting.  Both use cases seem easy to code.
Compress to S3 = cassandra snapshot, tar, s3 put
EBS = cassandra snapshot, rsync snapshot dir -> ebs, ebs snapshot

I think the former is cheaper in terms of costs, as my gut says keeping
around an EBS drive is more money than the lack of deltas in S3.  But, EBS
would allow extremely fine grained snapshotting without storage penalties
(EBS snaps are supposed to be compressed deltas behind the scenes).  Of
course, I don't know how cassandra likes frequent snapshotting, and given
the amount of node redundancy/eventual consistency that seems pointless
anyways.

will

On Thu, Apr 28, 2011 at 3:57 PM, Sasha Dolgy <sd...@gmail.com> wrote:

> You could take a snapshot to an EBS volume.  then, take a snapshot of that
> via AWS.  of course, this is ok.when they -arent- having outages and issues
> ...
> On Apr 28, 2011 9:54 PM, "William Oberman" <ob...@civicscience.com>
> wrote:
> > Even with N-nodes for redundancy, I still want to have backups. I'm an
> > amazon person, so naturally I'm thinking S3. Reading over the docs, and
> > messing with nodeutil, it looks like each new snapshot contains the
> previous
> > snapshot as a subset (and I've read how cassandra uses hard links to
> avoid
> > excessive disk use). When does that pattern break down?
> >
> > I'm basically debating if I can do a "rsync" like backup, or if I should
> do
> > a compressed tar backup. And I obviously want multiple points in time. S3
> > does allow file versioning, if a file or file name is changed/resused
> over
> > time (only matters in the rsync case). My only concerns with compressed
> > tars is I'll have to have free space to create the archive and I get no
> > "delta" space savings on the backup (the former is solved by not allowing
> > the disk space to get so low and/or adding more nodes to bring down the
> > space, the latter is solved by S3 being really cheap anyways).
> >
> > --
> > Will Oberman
> > Civic Science, Inc.
> > 3030 Penn Avenue., First Floor
> > Pittsburgh, PA 15201
> > (M) 412-480-7835
> > (E) oberman@civicscience.com
>

-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Re: best way to backup

Posted by Sasha Dolgy <sd...@gmail.com>.

You could take a snapshot to an EBS volume.  then, take a snapshot of that
via AWS.  of course, this is ok.when they -arent- having outages and issues
...
On Apr 28, 2011 9:54 PM, "William Oberman" <ob...@civicscience.com> wrote:
> Even with N-nodes for redundancy, I still want to have backups. I'm an
> amazon person, so naturally I'm thinking S3. Reading over the docs, and
> messing with nodeutil, it looks like each new snapshot contains the
previous
> snapshot as a subset (and I've read how cassandra uses hard links to avoid
> excessive disk use). When does that pattern break down?
>
> I'm basically debating if I can do a "rsync" like backup, or if I should
do
> a compressed tar backup. And I obviously want multiple points in time. S3
> does allow file versioning, if a file or file name is changed/resused over
> time (only matters in the rsync case). My only concerns with compressed
> tars is I'll have to have free space to create the archive and I get no
> "delta" space savings on the backup (the former is solved by not allowing
> the disk space to get so low and/or adding more nodes to bring down the
> space, the latter is solved by S3 being really cheap anyways).
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com