You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ben Boule <Be...@rapid7.com> on 2013/06/17 17:37:26 UTC

State of Cassandra-Shuffle (1.2.x)

A bit of background:

We are in Beta, we have a very small (2 node) cluster that we created with 1.2.1.  Being new to this we did not enable vnodes, and we got bit hard by the default token generation in production after setting up lots of development & QA clusters without running into the problem.   We ended up with like 97.5% of the tokens belonging to one of the two nodes.   The good thing is even one Cassandra node is doing OK right now with our load.   The bad thing of course is we still would rather it be balanced.   There is only about 120GB of data.

We would like to upgrade this cluster to vNodes.. we first tried doing this on 1.2.1, it did not work due to the bug where the shuffle job inserted a corrupted row into the system.range_xfers column family.   Last week I talked to several people at the summit and it was recommended we try this with 1.2.5.

I have a test cluster I am trying to run this procedure on,  I set it up with 1 token per node, then upgrade it to vnodes, then I upgraded it to 1.2.5 with no problems friday, and let it run over the weekend.  All appeared to be well when I left, there were something like 500 total relocations generated, and it had chugged through ~100 of them after an hour or so and it looked like it was heading towards being balanced.

----@ip-10-10-1-160:/var/lib/cassandra/data/Keyspace1/Standard1/snapshots# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns   Host ID                               Rack
UN  10.10.1.161  1.02 GB    254     66.8%  6d500bc6-95fb-47a3-afb5-c283c4f3de03  rack1
UN  10.10.1.160  1.1 GB     258     33.2%  186d99b8-9fde-4e50-959a-6fba6098fba6  rack1

When I came in to work today (Monday), there were 189 relocations to go, and this is what the status looks like.

----@ip-10-10-1-160:/var/lib/cassandra/data/Keyspace1/Standard1/snapshots# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns   Host ID                               Rack
UN  10.10.1.161  48.11 GB   231     38.7%  6d500bc6-95fb-47a3-afb5-c283c4f3de03  rack1
UN  10.10.1.160  34.5 GB    281     61.3%  186d99b8-9fde-4e50-959a-6fba6098fba6  rack1

An hour later and now it looks like this:

-----@ip-10-10-1-160:/tmp# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns   Host ID                               Rack
UN  10.10.1.161  11.61 GB   231     38.7%  6d500bc6-95fb-47a3-afb5-c283c4f3de03  rack1
UN  10.10.1.160  931.45 MB  281     61.3%  186d99b8-9fde-4e50-959a-6fba6098fba6  rack1

I did notice that it had fallen behind on compaction while this was running.

-----@ip-10-10-1-161:~$ nodetool compactionstats
pending tasks: 6
          compaction type        keyspace   column family       completed           total      unit  progress
               Compaction       Keyspace1       Standard1      1838641124      5133428315     bytes    35.82%
               Compaction       Keyspace1       Standard1      2255463423      5110283630     bytes    44.14%
Active compaction remaining time :   0h06m06s

The reduction in disk space did seem to correspond with about half of the compaction jobs finishing.   It seems to bounce up and down as it runs, consuming huge amounts of space and then freeing it up.

My question is what can we expect out of this job?  Should it really be working?   Do we need to expect it to waste 70-100x disk space while it runs?   Are there compaction options we can set ahead of time to minimize the penalty here?  What is the expected extra space consumed while it runs, what is the expected extra space consumed when it is done?  Note that in my test cluster, I used a keyspace created by cassandra-stress, it uses the default compaction settings, which is SizeTiered and whatever the default thresholds are.   In our real cluster, we did configure compaction.

Our original plan when the job didn't work against 1.2.1 was to bring up a new cluster along side the old one, that was pre-configured for vNodes, and then migrate our data out of the old cluster into the new cluster.  Obviously this requires us to write our own software to do the migration.   We are going to size up the new cluster as well and update the schema, so it's not a total waste, but we would have liked to be able to balance the load on the original cluster in the mean time.

Any advice?  We are planning to migrate to 2.0 later this summer but probably don't want to build it from the beta source ourself right now.

Thank you,
Ben Boule
This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (postmaster@rapid7.com) immediately.

Re: State of Cassandra-Shuffle (1.2.x)

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Jun 17, 2013 at 8:37 AM, Ben Boule <Be...@rapid7.com> wrote:
> We are in Beta, we have a very small (2 node) cluster that we created with
> 1.2.1.

https://issues.apache.org/jira/browse/CASSANDRA-5525

May be relevant?

What RF is this cluster? Given beta and cluster size and data size
this small, I would probably just dump and reload instead of trying to
make shuffle work.

http://palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra

> Being new to this we did not enable vnodes, and we got bit hard by
> the default token generation in production after setting up lots of
> development & QA clusters without running into the problem.

It will perhaps be some consolation to hear that this insane
misfeature of automatic token assignment by range bisection is finally
going away in Cassandra 2.0.

> Any advice?  We are planning to migrate to 2.0 later this summer but
> probably don't want to build it from the beta source ourself right now.

Eric Evans gave a talk at the summit during which he attempted to
communicate that people probably shouldn't use shuffle. Given that he
is the one who wrote the shuffle patch, this seems like a meaningful
data point... :)

=Rob