You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2015/01/09 10:15:34 UTC

[jira] [Created] (CASSANDRA-8589) Reconciliation in presence of tombstone might yield state data

Sylvain Lebresne created CASSANDRA-8589:
-------------------------------------------

             Summary: Reconciliation in presence of tombstone might yield state data
                 Key: CASSANDRA-8589
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8589
             Project: Cassandra
          Issue Type: Bug
            Reporter: Sylvain Lebresne


Consider 3 replica A, B, C (so RF=3) and consider that we do the following sequence of actions at {{QUORUM}} where I indicate the replicas acknowledging each operation (and let's assume that a replica that don't ack is a replica that don't get the update):
{noformat}
CREATE TABLE test (k text, t int, v int, PRIMARY KEY (k, t))

INSERT INTO test(k, t, v) VALUES ('k', 0, 0); // acked by A, B and C
INSERT INTO test(k, t, v) VALUES ('k', 1, 1); // acked by A, B and C
INSERT INTO test(k, t, v) VALUES ('k', 2, 2); // acked by A, B and C

DELETE FROM test WHERE k='k' AND t=1;         // acked by A and C

UPDATE test SET v = 3 WHERE k='k' AND t=2;    // acked by B and C

SELECT * FROM test WHERE k='k' LIMIT 2;       // answered by A and B
{noformat}
Every operation has achieved quorum, but on the last read, A will respond {{0->0, tombstone 1, 2->2}} and B will respond {{0->0, 1->1}}. As a consequence we'll answer {{0->0, 2->2}} which is incorrect (we should respond {{0->0, 2->3}}).

Put another way, if we have a limit, every replica honors that limit but since tombstones can "suppress" results from other nodes, we may have some cells for which we actually don't get a quorum of response (even though we globally have a quorum of replica responses).

In practice, this probably occurs rather rarely and so the "simpler" fix is probably to do something similar to the "short reads protection": detect when this could have happen (based on how replica response are reconciled) and do an additional request in that case. That detection will have potential false positives but I suspect we can be precise enough that those false positives will be very very rare (we should nonetheless track how often this code gets triggered and if we see that it's more often than we think, we could pro-actively bump user limits internally to reduce those occurrences).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)