You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jaydeepkumar Chovatia (Jira)" <ji...@apache.org> on 2022/10/25 23:48:00 UTC

[jira] [Updated] (CASSANDRA-17991) Possible data inconsistency during bootstrap/decommission

     [ https://issues.apache.org/jira/browse/CASSANDRA-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaydeepkumar Chovatia updated CASSANDRA-17991:
----------------------------------------------
    Description: 
I am facing one corner case in which the deleted data resurrects. 

tl;dr: This could be because when we stream all the SSTables for a given token range to the new owner, then they are not sent atomically, so the new owner could do compaction on the partially received SSTables, which might remove the tombstones.
 
Here are the reproducible steps:

+*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3 (C* version 3.0.27)
 # 
{code:java}
CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};

CREATE TABLE KS1.T1 (
    key int,
    c1 int,
    c2 int,
    c3 int
    PRIMARY KEY (key)
) WITH CLUSTERING ORDER BY (key ASC)
 AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
 AND gc_grace_seconds = 864000;
{code}

 

*Reproducible Steps*
 * Day1: Insert a new record
{code:java}
INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}

 * Day2: Other records are inserted into this table, and it goes through multiple compactions
 * Day3: Here is the data layout on SSTables on n1, n2, and n3 

{code:java}
SSTable1:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable2:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
{code}
 * Day4: Delete the record
{code:java}
CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}

 * Day5: Here is the data layout on SSTables on n1, n2, and n3 

{code:java}
SSTable1:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable2:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable3 (Tombstone):
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", "local_delete_time" : "2022-10-19T00:00:00.000001Z" },
        "cells" : [ ]
}
{code}
 * Day20: Nothing happens for more than 10 days. Let's say the data layout on SSTables on n1, n2, and n3 is the same as Day5
 * Day20: A new node (n4) joins the ring, and it is going to be responsible for key "1". Let's say it streams the data from n3. The node _n3_ is supposed to stream out SSTable1, SSTable2, and SSTable3, but it does not happen atomically as per the streaming algorithm. Let's consider a scenario in that _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 is streamed, and at this time it will stream the key "1" as well.
 * Day20: _n4_ becomes normal 
{code:java}
Query on n4:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// 1 | 10 | 20 | 30 <-- A record is returned

Query on n1:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// <empty> //no output{code}

 

Does this make sense?{*}{{*}}

*Possible Solution*
 * One of the solutions is to maybe not purge tombstones while there are token range movements in the ring

  was:
I am facing one corner case in which the deleted data resurrects. 

tl;dr: This could be because when we stream all the SSTables for a given token range to the new owner, then they are not sent atomically, so the new owner could do compaction on the partially received SSTables, which might remove the tombstones.
 
Here are the reproducible steps:

+*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3
 # 
{code:java}
CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};

CREATE TABLE KS1.T1 (
    key int,
    c1 int,
    c2 int,
    c3 int
    PRIMARY KEY (key)
) WITH CLUSTERING ORDER BY (key ASC)
 AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
 AND gc_grace_seconds = 864000;
{code}

 

*Reproducible Steps*
 * Day1: Insert a new record
{code:java}
INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}

 * Day2: Other records are inserted into this table, and it goes through multiple compactions
 * Day3: Here is the data layout on SSTables on n1, n2, and n3 

{code:java}
SSTable1:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable2:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
{code}
 * Day4: Delete the record
{code:java}
CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}

 * Day5: Here is the data layout on SSTables on n1, n2, and n3 

{code:java}
SSTable1:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable2:
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
        "cells" : [
            { "name" : "c1", "value" : 10 },
            { "name" : "c2", "value" : 20 },
            { "name" : "c3", "value" : 30 },
          ]
}
.....
SSTable3 (Tombstone):
{
    "partition" : {
      "key" : [ "1" ],
      "position" : 900
    },
    "rows" : [
        "type" : "row",  
        "position" : 10,
        "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", "local_delete_time" : "2022-10-19T00:00:00.000001Z" },
        "cells" : [ ]
}
{code}
 * Day20: Nothing happens for more than 10 days. Let's say the data layout on SSTables on n1, n2, and n3 is the same as Day5
 * Day20: A new node (n4) joins the ring, and it is going to be responsible for key "1". Let's say it streams the data from n3. The node _n3_ is supposed to stream out SSTable1, SSTable2, and SSTable3, but it does not happen atomically as per the streaming algorithm. Let's consider a scenario in that _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 is streamed, and at this time it will stream the key "1" as well.
 * Day20: _n4_ becomes normal 
{code:java}
Query on n4:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// 1 | 10 | 20 | 30 <-- A record is returned

Query on n1:
$> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
// <empty> //no output{code}

 

Does this make sense?{*}{*}

*Possible Solution*
 * One of the solutions is to maybe not purge tombstones while there are token range movements in the ring


> Possible data inconsistency during bootstrap/decommission
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-17991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17991
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Jaydeepkumar Chovatia
>            Priority: Normal
>
> I am facing one corner case in which the deleted data resurrects. 
> tl;dr: This could be because when we stream all the SSTables for a given token range to the new owner, then they are not sent atomically, so the new owner could do compaction on the partially received SSTables, which might remove the tombstones.
>  
> Here are the reproducible steps:
> +*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3 (C* version 3.0.27)
>  # 
> {code:java}
> CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};
> CREATE TABLE KS1.T1 (
>     key int,
>     c1 int,
>     c2 int,
>     c3 int
>     PRIMARY KEY (key)
> ) WITH CLUSTERING ORDER BY (key ASC)
>  AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
>  AND gc_grace_seconds = 864000;
> {code}
>  
> *Reproducible Steps*
>  * Day1: Insert a new record
> {code:java}
> INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
>  * Day2: Other records are inserted into this table, and it goes through multiple compactions
>  * Day3: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> {code}
>  * Day4: Delete the record
> {code:java}
> CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}
>  * Day5: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable3 (Tombstone):
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", "local_delete_time" : "2022-10-19T00:00:00.000001Z" },
>         "cells" : [ ]
> }
> {code}
>  * Day20: Nothing happens for more than 10 days. Let's say the data layout on SSTables on n1, n2, and n3 is the same as Day5
>  * Day20: A new node (n4) joins the ring, and it is going to be responsible for key "1". Let's say it streams the data from n3. The node _n3_ is supposed to stream out SSTable1, SSTable2, and SSTable3, but it does not happen atomically as per the streaming algorithm. Let's consider a scenario in that _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 is streamed, and at this time it will stream the key "1" as well.
>  * Day20: _n4_ becomes normal 
> {code:java}
> Query on n4:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // 1 | 10 | 20 | 30 <-- A record is returned
> Query on n1:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // <empty> //no output{code}
>  
> Does this make sense?{*}{{*}}
> *Possible Solution*
>  * One of the solutions is to maybe not purge tombstones while there are token range movements in the ring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org