You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Aleksey Yeschenko (Jira)" <ji...@apache.org> on 2020/04/03 18:00:00 UTC
[jira] [Commented] (CASSANDRA-15690) Single partition queries can mistakenly omit partition deletions and resurrect data

    [ https://issues.apache.org/jira/browse/CASSANDRA-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074765#comment-17074765 ] 

Aleksey Yeschenko commented on CASSANDRA-15690:
-----------------------------------------------

A simple in-JVM repro illustration provided by [~samt]:

{code}
    @Test
    public void skippedSSTableWithPartitionDeletionShadowingDataOnAnotherNode() throws Throwable
    {
        try (Cluster cluster = init(Cluster.create(2)))
        {
            cluster.schemaChange("CREATE TABLE " + KEYSPACE + ".tbl (pk int, ck int, v int, PRIMARY KEY(pk, ck))");
            // insert a partition tombstone on node 1, the deletion timestamp should end up being the sstable's minTimestamp
            cluster.get(1).executeInternal("DELETE FROM " + KEYSPACE + ".tbl USING TIMESTAMP 1 WHERE pk = 0");
            // and a row from a different partition, to provide the sstable's min/max clustering
            cluster.get(1).executeInternal("INSERT INTO " + KEYSPACE + ".tbl (pk, ck, v) VALUES (1, 1, 1) USING TIMESTAMP 1");
            cluster.get(1).flush(KEYSPACE);
            // sstable 1 has minTimestamp == maxTimestamp == 1 and is skipped due to its min/max clusterings. Now we
            // insert a row which is not shadowed by the partition delete and flush to a second sstable. Importantly,
            // this sstable's minTimestamp is > than the maxTimestamp of the first sstable. This would cause the first
            // sstable not to be re-included in the merge input, but we can't really make that decision as we don't
            // know what data and/or tombstones are present on other nodes
            cluster.get(1).executeInternal("INSERT INTO " + KEYSPACE + ".tbl (pk, ck, v) VALUES (0, 6, 6) USING TIMESTAMP 2");
            cluster.get(1).flush(KEYSPACE);

            // on node 2, add a row for the deleted partition with an older timestamp than the deletion so it should be shadowed
            cluster.get(2).executeInternal("INSERT INTO " + KEYSPACE + ".tbl (pk, ck, v) VALUES (0, 10, 10) USING TIMESTAMP 0");

            Object[][] rows = cluster.coordinator(1)
                                     .execute("SELECT * FROM " + KEYSPACE + ".tbl WHERE pk=0 AND ck > 5",
                                              ConsistencyLevel.ALL);
            // we expect that the row from node 2 (0, 10, 10) was shadowed by the partition delete, but the row from
            // node 1 (0, 6, 6) was not.
            assertRows(rows, new Object[] {0, 6 ,6});
        }
    }
{code}

> Single partition queries can mistakenly omit partition deletions and resurrect data
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15690
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15690
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Aleksey Yeschenko
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>
> We have logic that allows us to exclude sstables with partition deletions that are older than the minimum collected timestamp in a local request. However, it’s possible that another node could have rows that aren’t known to the local node that are in turn older than the excluded partition deletion. In such a scenario, those will be mistakenly resurrected, which is a correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org