You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Kevin Burton <bu...@spinn3r.com> on 2014/05/13 01:53:11 UTC

Efficient bulk range deletions without compactions by dropping SSTables.

We have a log only data structure… everything is appended and nothing is
ever updated.

We should be totally fine with having lots of SSTables sitting on disk
because even if we did a major compaction the data would still look the
same.

By 'lots' I mean maybe 1000 max.  Maybe 1GB each.

However, I would like a way to delete older data.

One way to solve this could be to just drop an entire SSTable if all the
records inside have tombstones.

Is this possible, to just drop a specific SSTable?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profile<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Efficient bulk range deletions without compactions by dropping SSTables.

Posted by Paulo Ricardo Motta Gomes <pa...@chaordicsystems.com>.

Hello Kevin,

In 2.0.X an SSTable is automatically dropped if it contains only
tombstones: https://issues.apache.org/jira/browse/CASSANDRA-5228. However
this will most likely happen if you use LCS. STCS will create sstables of
larger size that will probably have mixed expired and unexpired data.  This
could be solved by the single-sstable tombstone compaction that
unfortunately is not working well (
https://issues.apache.org/jira/browse/CASSANDRA-6563).

I don't know of a way to manually drop specific sstables safely, you could
try implementing a script that compares sstables timestamps to check if an
sstable is safely droppable as done in CASSANDRA-5228. There are proposals
to create a compaction strategy optimized for log only data that only
deletes old sstables but it's not ready yet AFAIK.

Cheers,

Paulo

On Mon, May 12, 2014 at 8:53 PM, Kevin Burton <bu...@spinn3r.com> wrote:

> We have a log only data structure… everything is appended and nothing is
> ever updated.
>
> We should be totally fine with having lots of SSTables sitting on disk
> because even if we did a major compaction the data would still look the
> same.
>
> By 'lots' I mean maybe 1000 max.  Maybe 1GB each.
>
> However, I would like a way to delete older data.
>
> One way to solve this could be to just drop an entire SSTable if all the
> records inside have tombstones.
>
> Is this possible, to just drop a specific SSTable?
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile<https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>

-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

Re: Efficient bulk range deletions without compactions by dropping SSTables.

Posted by graham sanderson <gr...@vast.com>.

Just a few data points from our experience

One of our use cases involves storing a periodic full base state for millions of records, then fairly frequent delta updates to subsets of the records in between. C* is great for this because we can read the whole row (or up to the clustering key/column marking “now” as perceived by the client) and munge the base + deltas together in the client.

To keep rows small (and for recovery), we start over in a new CF whenever we start a new base state

The upshot is that we have pretty much the same scenario as Jeremy is describing

For this use case we are also using Astyanax (but C* 2.0.5)

We have not come across many of the schema problems you mention (which is likely accountable to some changes in the 2.0.x line), however one thing to note is that Astyanax itself seems to be very picky about un-resolved schema changes. We found that we had to do the schema changes via a CQL “create table” (we can still use Astyanax for that) rather than creating it via old style thrift CF creation


On May 13, 2014, at 9:42 AM, Jeremy Powell <je...@gmail.com> wrote:

> Hi Kevin,
> 
> C* version: 1.2.xx
> Astyanax: 1.56.xx
> 
> We basically do this same thing in one of our production clusters, but rather than dropping SSTables, we drop Column Families. We time-bucket our CFs, and when a CF has passed some time threshold (metadata or embedded in CF name), it is dropped. This means there is a home-grown system that is doing the bookkeeping/maintenance rather than relying on C*s inner workings. It is unfortunate that we have to maintain a system which maintains CFs, but we've been in a pretty good state for the last 12 months using this method. 
> 
> Some caveats:
> 
> By default, C* makes snapshots of your data when a table is dropped. You can leave that and have something else clear up the snapshots, or if you're less paranoid, set auto_snapshot: false in the cassandra.yaml file.
> 
> Cassandra does not handle 'quick' schema changes very well, and we found that only one node should be used for these changes. When adding or removing column families, we have a single, property defined C* node that is designated as the schema node. After making a schema change, we had to throw in an artificial delay to ensure that the schema change propagated through the cluster before making the next schema change. And of course, relying on a single node being up for schema changes is less than ideal, so handling fail over to a new node is important.
> 
> The final, and hardest problem, is that C* can't really handle schema changes while a node is being bootstrapped (new nodes, replacing a dead node). If a column family is dropped, but the new node has not yet received that data from its replica, the node will fail to bootstrap when it finally begins to receive that data - there is no column family for the data to be written to, so that node will be stuck in the joining state, and it's system keyspace needs to be wiped and re-synced to attempt to get back to a happy state. This unfortunately means we have to stop schema changes when a node needs to be replaced, but we have this flow down pretty well.
> 
> Hope this helps,
> Jeremy Powell
> 
> 
> On Mon, May 12, 2014 at 5:53 PM, Kevin Burton <bu...@spinn3r.com> wrote:
> We have a log only data structure… everything is appended and nothing is ever updated.
> 
> We should be totally fine with having lots of SSTables sitting on disk because even if we did a major compaction the data would still look the same.
> 
> By 'lots' I mean maybe 1000 max.  Maybe 1GB each.
> 
> However, I would like a way to delete older data.
> 
> One way to solve this could be to just drop an entire SSTable if all the records inside have tombstones.
> 
> Is this possible, to just drop a specific SSTable?  
> 
> -- 
> 
> Founder/CEO Spinn3r.com
> Location: San Francisco, CA
> Skype: burtonator
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.
> 
>

Re: Efficient bulk range deletions without compactions by dropping SSTables.

Posted by Kevin Burton <bu...@spinn3r.com>.

>
>
> We basically do this same thing in one of our production clusters, but
> rather than dropping SSTables, we drop Column Families. We time-bucket our
> CFs, and when a CF has passed some time threshold (metadata or embedded in
> CF name), it is dropped. This means there is a home-grown system that is
> doing the bookkeeping/maintenance rather than relying on C*s inner
> workings. It is unfortunate that we have to maintain a system which
> maintains CFs, but we've been in a pretty good state for the last 12 months
> using this method.
>
>
Yup.  This is exactly what we do for MySQL but it's kind of a shame to do
it with Cassandra.  The SSTable system can do it directly.  I had been
working on a bigtable implementation (which I put on hold for now) that
supported this feature.   Since cassandra can do it directly … it seems a
shame that it's not.

Also, this means you generally have to build a duplicate CQL layer on top
of CQL.  For example, what happens if a range query is over a time range
between tables, you have to scan both.

And if you're doing looks by key then you have to also scan all the
temporal column families but this is exactly what cassandra does with
SSTables.


> Some caveats:
>
> By default, C* makes snapshots of your data when a table is dropped. You
> can leave that and have something else clear up the snapshots, or if you're
> less paranoid, set auto_snapshot: false in the cassandra.yaml file.
>
> Cassandra does not handle 'quick' schema changes very well, and we found
> that only one node should be used for these changes. When adding or
> removing column families, we have a single, property defined C* node that
> is designated as the schema node. After making a schema change, we had to
> throw in an artificial delay to ensure that the schema change propagated
> through the cluster before making the next schema change. And of course,
> relying on a single node being up for schema changes is less than ideal, so
> handling fail over to a new node is important.
>
> The final, and hardest problem, is that C* can't really handle schema
> changes while a node is being bootstrapped (new nodes, replacing a dead
> node). If a column family is dropped, but the new node has not yet received
> that data from its replica, the node will fail to bootstrap when it finally
> begins to receive that data - there is no column family for the data to be
> written to, so that node will be stuck in the joining state, and it's
> system keyspace needs to be wiped and re-synced to attempt to get back to a
> happy state. This unfortunately means we have to stop schema changes when a
> node needs to be replaced, but we have this flow down pretty well.
>
>
Nice.  This was excellent feedback.

Thanks!
-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+
profile<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Efficient bulk range deletions without compactions by dropping SSTables.

Posted by Jeremy Powell <je...@gmail.com>.

Hi Kevin,

C* version: 1.2.xx
Astyanax: 1.56.xx

We basically do this same thing in one of our production clusters, but
rather than dropping SSTables, we drop Column Families. We time-bucket our
CFs, and when a CF has passed some time threshold (metadata or embedded in
CF name), it is dropped. This means there is a home-grown system that is
doing the bookkeeping/maintenance rather than relying on C*s inner
workings. It is unfortunate that we have to maintain a system which
maintains CFs, but we've been in a pretty good state for the last 12 months
using this method.

Some caveats:

By default, C* makes snapshots of your data when a table is dropped. You
can leave that and have something else clear up the snapshots, or if you're
less paranoid, set auto_snapshot: false in the cassandra.yaml file.

Cassandra does not handle 'quick' schema changes very well, and we found
that only one node should be used for these changes. When adding or
removing column families, we have a single, property defined C* node that
is designated as the schema node. After making a schema change, we had to
throw in an artificial delay to ensure that the schema change propagated
through the cluster before making the next schema change. And of course,
relying on a single node being up for schema changes is less than ideal, so
handling fail over to a new node is important.

The final, and hardest problem, is that C* can't really handle schema
changes while a node is being bootstrapped (new nodes, replacing a dead
node). If a column family is dropped, but the new node has not yet received
that data from its replica, the node will fail to bootstrap when it finally
begins to receive that data - there is no column family for the data to be
written to, so that node will be stuck in the joining state, and it's
system keyspace needs to be wiped and re-synced to attempt to get back to a
happy state. This unfortunately means we have to stop schema changes when a
node needs to be replaced, but we have this flow down pretty well.

Hope this helps,
Jeremy Powell

On Mon, May 12, 2014 at 5:53 PM, Kevin Burton <bu...@spinn3r.com> wrote:

> We have a log only data structure… everything is appended and nothing is
> ever updated.
>
> We should be totally fine with having lots of SSTables sitting on disk
> because even if we did a major compaction the data would still look the
> same.
>
> By 'lots' I mean maybe 1000 max.  Maybe 1GB each.
>
> However, I would like a way to delete older data.
>
> One way to solve this could be to just drop an entire SSTable if all the
> records inside have tombstones.
>
> Is this possible, to just drop a specific SSTable?
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> Skype: *burtonator*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile<https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> War is peace. Freedom is slavery. Ignorance is strength. Corporations are
> people.
>
>