You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by KZ Win <kz...@pelotoncycle.com> on 2014/11/05 21:08:47 UTC

different disk foot print of cassandra data folder on copying

I have cassandra nodes with long uptime.  Disk foot print for
cassandra data older is different when I copy to a different folder.
Why is that ?  I have used rsync and cp.  This can be very confusing
when trying to do certain maintenance tasks like hardware upgrade on
EC2 and backing up a snapshot.

I am talking about as much 100% different for 25-40GB of data.  On
copying they grow to double that.  The server's folder is on EC2
magnetic instance-store and I copied to various EBS.  I do not think
that it's something weird about EC2; when I copied EBS data back to
magnetic instance-store
the size remains the same.    So I am guessing there is some kind of
cassandra magical compression that is fooling the operation system
tools like du and df

Some issue with commitlog folder too but the total size of this folder
is not as big and differences is size percent is low.

Thanks for any insight you can share

k.z.

Re: different disk foot print of cassandra data folder on copying

Posted by KZ Win <kz...@pelotoncycle.com>.
Duh.  I totally forgot about my snapshotting just before daily rsync backup.

k.z.

On Wed, Nov 5, 2014 at 3:13 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Wed, Nov 5, 2014 at 12:08 PM, KZ Win <kz...@pelotoncycle.com> wrote:
>>
>> I have cassandra nodes with long uptime.  Disk foot print for
>> cassandra data older is different when I copy to a different folder.
>
>
>>
>> I am talking about as much 100% different for 25-40GB of data.  On
>> copying they grow to double that.
>
>
> 1) Cassandra automatically "snapshots" SSTables when one does certain
> operations.
> 2) One can also manually create snapshots.
> 3) Snapshots are hard links to files.
> 4) Hard links to files generally become duplicate files when copied to
> another partition, unless rsync or cp is configured to maintain the hard
> link relationship.
> 5) snapshots are kept in a subdirectory of the data directory for the
> columnfamily.
> 6) This all has the pathological seeming outcome that snapshots become
> effectively larger as time passes (because the hard links they contain
> become the only copy of the file when the "original" is deleted from the
> data directory via compaction) and might grow significantly when copied.
>
> tl;dr : modify your rsync to include --exclude=snapshots/
>
> =Rob
>

Re: different disk foot print of cassandra data folder on copying

Posted by Robert Coli <rc...@eventbrite.com>.
On Wed, Nov 5, 2014 at 12:08 PM, KZ Win <kz...@pelotoncycle.com> wrote:

> I have cassandra nodes with long uptime.  Disk foot print for
> cassandra data older is different when I copy to a different folder.
>


> I am talking about as much 100% different for 25-40GB of data.  On
> copying they grow to double that.


1) Cassandra automatically "snapshots" SSTables when one does certain
operations.
2) One can also manually create snapshots.
3) Snapshots are hard links to files.
4) Hard links to files generally become duplicate files when copied to
another partition, unless rsync or cp is configured to maintain the hard
link relationship.
5) snapshots are kept in a subdirectory of the data directory for the
columnfamily.
6) This all has the pathological seeming outcome that snapshots become
effectively larger as time passes (because the hard links they contain
become the only copy of the file when the "original" is deleted from the
data directory via compaction) and might grow significantly when copied.

tl;dr : modify your rsync to include --exclude=snapshots/

=Rob