You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jaydeepkumar Chovatia (Jira)" <ji...@apache.org> on 2022/10/31 18:57:00 UTC
[jira] [Comment Edited] (CASSANDRA-17991) Possible data inconsistency during bootstrap/decommission

    [ https://issues.apache.org/jira/browse/CASSANDRA-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626767#comment-17626767 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-17991 at 10/31/22 6:56 PM:
-----------------------------------------------------------------------------

[{color:#4a6ee0}Jeff Jirsa{color}|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jjirsa]{color:#0e101a} As per my understanding of the Cassandra streaming code, there is no atomicity in transferring the SSTables because it is not feasible. When Cassandra prepares a streaming plan and candidate SSTables, then it does not look for partition keys overlapping in those because it is not feasible. That means SSTables of a given table could reach at a different time to the destination node in any order. And the scenario I mentioned above could happen in which the destination node could trigger compaction naturally before it has received a full view of the data (i.e. not all the SSTables have been received in which this key has been present on the source nodes) for a given partition key.{color}

{color:#0e101a}The issue I see in production is when a node is getting added, it misses the tombstone even though all copies had the tombstones present before the new node joined. And when I inspected the replicas before the node joined, the non-tombstone data was scattered in multiple SSTables, and there was one SSTable with a tombstone marker (at a later date) as well. So, the theory I have mentioned in the description of this ticket could be possible.{color}


was (Author: chovatia.jaydeep@gmail.com):
[{color:#4a6ee0}Jeff Jirsa{color}|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jjirsa]{color:#0e101a} As per my understanding of the Cassandra streaming code, there is no atomicity in transferring the SSTables because it is not feasible. When Cassandra prepares a streaming plan and candidate SSTables, then it does not look for partition keys overlapping in those because it is not feasible. That means SSTables of a given table could reach at a different time to the destination node in any order. And the scenario I mentioned above could happen in which the destination node could trigger compaction naturally before it has received a full view of the data for a given partition key.{color}

{color:#0e101a}The issue I see in production is when a node is getting added, it misses the tombstone even though all copies had the tombstones present before the new node joined.{color:#0e101a} And when I inspected the replicas before the node joined, the non-tombstone data was scattered in multiple SSTables, and there was one SSTable with a tombstone marker (at a later date) as well. So, the theory I have mentioned in the description of this ticket could be possible.{color}{color}

> Possible data inconsistency during bootstrap/decommission
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-17991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17991
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Jaydeepkumar Chovatia
>            Priority: Normal
>
> I am facing one corner case in which the deleted data resurrects.
> tl;dr: This could be because when we stream all the SSTables for a given token range to the new owner, then they are not sent atomically, so the new owner could do compaction on the partially received SSTables, which might remove the tombstones.
>  
> Here are the reproducible steps:
> +*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3 (C* version 3.0.27)
>  # 
> {code:java}
> CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};
> CREATE TABLE KS1.T1 (
>     key int,
>     c1 int,
>     c2 int,
>     c3 int
>     PRIMARY KEY (key)
> ) WITH CLUSTERING ORDER BY (key ASC)
>  AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
>  AND gc_grace_seconds = 864000;
> {code}
>  
> *Reproducible Steps*
>  * Day1: Insert a new record followed by {_}nodetool flush on n1, n2, and n3{_}. A new SSTable ({_}SSTable1{_}) will be created.
> {code:java}
> INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
>  * Day2: Insert the same record again followed by _nodetool flush_ {_}on n1, n2, and n3{_}{_}.{_} A new SSTable ({_}SSTable2{_}) will be created
> {code:java}
>  INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
>  * Day3: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> {code}
>  * Day4: Delete the record followed by _nodetool flush_ on n1, n2, and n3
> {code:java}
> CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}
>  * Day5: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable3 (Tombstone):
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", "local_delete_time" : "2022-10-19T00:00:00.000001Z" },
>         "cells" : [ ]
> }
> {code}
>  * Day20: Nothing happens for more than 10 days. Let's say the data layout on SSTables on n1, n2, and n3 is the same as Day5
>  * Day20: A new node (n4) joins the ring, and it is going to be responsible for key "1". Let's say it streams the data from n3. The node _n3_ is supposed to stream out SSTable1, SSTable2, and SSTable3, but it does not happen atomically as per the streaming algorithm. Let's consider a scenario in that _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 is streamed, and at this time it will stream the key "1" as well.
>  * Day20: _n4_ becomes normal 
> {code:java}
> Query on n4:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // 1 | 10 | 20 | 30 <-- A record is returned
> Query on n1:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // <empty> //no output{code}
>  
> Does this make sense?
> *Possible Solution*
>  * One of the solutions is to maybe not purge tombstones while there are token range movements in the ring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org