You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Fabien Rousseau (JIRA)" <ji...@apache.org> on 2016/03/13 12:49:33 UTC

[jira] [Created] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Fabien Rousseau created CASSANDRA-11349:
-------------------------------------------

             Summary: MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
                 Key: CASSANDRA-11349
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
             Project: Cassandra
          Issue Type: Bug
            Reporter: Fabien Rousseau


We observed that repair, for some of our clusters, streamed a lot of data and many partitions were "out of sync".
Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really high.

After investigation, it appears that, if two range tombstones exists for a partition for the same range/interval, they're both included in the merkle tree computation.
But, if for some reason, on another node, the two range tombstones were already compacted into a single range tombstone, this will result in a merkle tree difference.
Currently, this is clearly bad because MerkleTree differences are dependent on compactions (and if a partition is deleted and created multiple times, the only way to ensure that repair "works correctly"/"don't overstream data" is to major compact before each repair... which is not really feasible).

Below is a list of steps allowing to easily reproduce this case:

ccm create test -v 2.1.13 -n 2 -s
ccm node1 cqlsh
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
    c1 text,
    c2 text,
    c3 float,
    c4 float,
    PRIMARY KEY ((c1), c2)
);
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
# now flush only one of the two nodes
ccm node1 flush 
ccm node1 cqlsh
USE test_rt;
INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
ctrl ^d
ccm node1 repair
# now grep the log and observe that there was some inconstencies detected between nodes (while it shouldn't have detected any)
ccm node1 showlog | grep "out of sync"

Consequences of this are a costly repair, accumulating many small SSTables (up to thousands for a rather short period of time when using VNodes, the time for compaction to absorb those small files), but also an increased size on disk.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)