You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Dan Kinder (JIRA)" <ji...@apache.org> on 2015/03/12 23:22:38 UTC

[jira] [Commented] (CASSANDRA-8961) Data rewrite case causes almost non-functional compaction

    [ https://issues.apache.org/jira/browse/CASSANDRA-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359540#comment-14359540 ] 

Dan Kinder commented on CASSANDRA-8961:
---------------------------------------

I see. Is there some way to make this DELETE query not use RangeTombstones? Would it work to insert the full set of columns (ex. DELETE pk, data FROM ...)?

Also CASSANDRA-6446 seems related.

> Data rewrite case causes almost non-functional compaction
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-8961
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8961
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Centos 6.6, Cassandra 2.0.12 (Also seen in Cassandra 2.1)
>            Reporter: Dan Kinder
>            Priority: Minor
>
> There seems to be a bug of some kind where compaction grinds to a halt in this use case: from time to time we have a set of rows we need to "migrate", changing their primary key by deleting the row and inserting a new row with the same partition key and different cluster key. The python script below demonstrates this; it takes a bit of time to run (didn't try to optimize it) but when it's done it will be trying to compact a few hundred megs of data for a long time... on the order of days, or it will never finish.
> Not verified by this sandboxed experiment but it seems that compression settings do not matter and that this seems to happen to STCS as well, not just LCS. I am still testing if other patterns cause this terrible compaction performance, like deleting all rows then inserting or vice versa.
> Even if it isn't a "bug" per se, is there a way to fix or work around this behavior?
> {code}
> import string
> import random
> from cassandra.cluster import Cluster
> cluster = Cluster(['localhost'])
> db = cluster.connect('walker')
> db.execute("DROP KEYSPACE IF EXISTS trial")
> db.execute("""CREATE KEYSPACE trial
>               WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 1 }""")
> db.execute("""CREATE TABLE trial.tbl (
>                 pk text,
>                 data text,
>                 PRIMARY KEY(pk, data)
>               ) WITH compaction = { 'class' : 'LeveledCompactionStrategy' }
>                 AND compression = {'sstable_compression': ''}""")
> # Number of rows to insert and "move"
> n = 200000                                                                  
>                                                                             
> # Insert n rows with the same partition key, 1KB of unique data in cluster key
> for i in range(n):
>     db.execute("INSERT INTO trial.tbl (pk, data) VALUES ('thepk', %s)",
>         [str(i).zfill(1024)])
> # Update those n rows, deleting each and replacing with a very similar row
> for i in range(n):
>     val = str(i).zfill(1024)
>     db.execute("DELETE FROM trial.tbl WHERE pk = 'thepk' AND data = %s", [val])
>     db.execute("INSERT INTO trial.tbl (pk, data) VALUES ('thepk', %s)", ["1" + val])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)