You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jaydeepkumar Chovatia (Jira)" <ji...@apache.org> on 2022/10/31 18:57:00 UTC
[jira] [Comment Edited] (CASSANDRA-17991) Possible data inconsistency during bootstrap/decommission
[ https://issues.apache.org/jira/browse/CASSANDRA-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626767#comment-17626767 ]
Jaydeepkumar Chovatia edited comment on CASSANDRA-17991 at 10/31/22 6:56 PM:
-----------------------------------------------------------------------------
[{color:#4a6ee0}Jeff Jirsa{color}|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jjirsa]{color:#0e101a} As per my understanding of the Cassandra streaming code, there is no atomicity in transferring the SSTables because it is not feasible. When Cassandra prepares a streaming plan and candidate SSTables, then it does not look for partition keys overlapping in those because it is not feasible. That means SSTables of a given table could reach at a different time to the destination node in any order. And the scenario I mentioned above could happen in which the destination node could trigger compaction naturally before it has received a full view of the data (i.e. not all the SSTables have been received in which this key has been present on the source nodes) for a given partition key.{color}
{color:#0e101a}The issue I see in production is when a node is getting added, it misses the tombstone even though all copies had the tombstones present before the new node joined. And when I inspected the replicas before the node joined, the non-tombstone data was scattered in multiple SSTables, and there was one SSTable with a tombstone marker (at a later date) as well. So, the theory I have mentioned in the description of this ticket could be possible.{color}
was (Author: chovatia.jaydeep@gmail.com):
[{color:#4a6ee0}Jeff Jirsa{color}|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jjirsa]{color:#0e101a} As per my understanding of the Cassandra streaming code, there is no atomicity in transferring the SSTables because it is not feasible. When Cassandra prepares a streaming plan and candidate SSTables, then it does not look for partition keys overlapping in those because it is not feasible. That means SSTables of a given table could reach at a different time to the destination node in any order. And the scenario I mentioned above could happen in which the destination node could trigger compaction naturally before it has received a full view of the data for a given partition key.{color}
{color:#0e101a}The issue I see in production is when a node is getting added, it misses the tombstone even though all copies had the tombstones present before the new node joined.{color:#0e101a} And when I inspected the replicas before the node joined, the non-tombstone data was scattered in multiple SSTables, and there was one SSTable with a tombstone marker (at a later date) as well. So, the theory I have mentioned in the description of this ticket could be possible.{color}{color}
> Possible data inconsistency during bootstrap/decommission
> ---------------------------------------------------------
>
> Key: CASSANDRA-17991
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17991
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Bootstrap and Decommission
> Reporter: Jaydeepkumar Chovatia
> Priority: Normal
>
> I am facing one corner case in which the deleted data resurrects.
> tl;dr: This could be because when we stream all the SSTables for a given token range to the new owner, then they are not sent atomically, so the new owner could do compaction on the partially received SSTables, which might remove the tombstones.
>
> Here are the reproducible steps:
> +*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3 (C* version 3.0.27)
> #
> {code:java}
> CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};
> CREATE TABLE KS1.T1 (
> key int,
> c1 int,
> c2 int,
> c3 int
> PRIMARY KEY (key)
> ) WITH CLUSTERING ORDER BY (key ASC)
> AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
> AND gc_grace_seconds = 864000;
> {code}
>
> *Reproducible Steps*
> * Day1: Insert a new record followed by {_}nodetool flush on n1, n2, and n3{_}. A new SSTable ({_}SSTable1{_}) will be created.
> {code:java}
> INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
> * Day2: Insert the same record again followed by _nodetool flush_ {_}on n1, n2, and n3{_}{_}.{_} A new SSTable ({_}SSTable2{_}) will be created
> {code:java}
> INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
> * Day3: Here is the data layout on SSTables on n1, n2, and n3
> {code:java}
> SSTable1:
> {
> "partition" : {
> "key" : [ "1" ],
> "position" : 900
> },
> "rows" : [
> "type" : "row",
> "position" : 10,
> "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
> "cells" : [
> { "name" : "c1", "value" : 10 },
> { "name" : "c2", "value" : 20 },
> { "name" : "c3", "value" : 30 },
> ]
> }
> .....
> SSTable2:
> {
> "partition" : {
> "key" : [ "1" ],
> "position" : 900
> },
> "rows" : [
> "type" : "row",
> "position" : 10,
> "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"},
> "cells" : [
> { "name" : "c1", "value" : 10 },
> { "name" : "c2", "value" : 20 },
> { "name" : "c3", "value" : 30 },
> ]
> }
> {code}
> * Day4: Delete the record followed by _nodetool flush_ on n1, n2, and n3
> {code:java}
> CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}
> * Day5: Here is the data layout on SSTables on n1, n2, and n3
> {code:java}
> SSTable1:
> {
> "partition" : {
> "key" : [ "1" ],
> "position" : 900
> },
> "rows" : [
> "type" : "row",
> "position" : 10,
> "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
> "cells" : [
> { "name" : "c1", "value" : 10 },
> { "name" : "c2", "value" : 20 },
> { "name" : "c3", "value" : 30 },
> ]
> }
> .....
> SSTable2:
> {
> "partition" : {
> "key" : [ "1" ],
> "position" : 900
> },
> "rows" : [
> "type" : "row",
> "position" : 10,
> "liveness_info" : { "tstamp" : "2022-10-16T00:00:00.000001Z"},
> "cells" : [
> { "name" : "c1", "value" : 10 },
> { "name" : "c2", "value" : 20 },
> { "name" : "c3", "value" : 30 },
> ]
> }
> .....
> SSTable3 (Tombstone):
> {
> "partition" : {
> "key" : [ "1" ],
> "position" : 900
> },
> "rows" : [
> "type" : "row",
> "position" : 10,
> "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", "local_delete_time" : "2022-10-19T00:00:00.000001Z" },
> "cells" : [ ]
> }
> {code}
> * Day20: Nothing happens for more than 10 days. Let's say the data layout on SSTables on n1, n2, and n3 is the same as Day5
> * Day20: A new node (n4) joins the ring, and it is going to be responsible for key "1". Let's say it streams the data from n3. The node _n3_ is supposed to stream out SSTable1, SSTable2, and SSTable3, but it does not happen atomically as per the streaming algorithm. Let's consider a scenario in that _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 is streamed, and at this time it will stream the key "1" as well.
> * Day20: _n4_ becomes normal
> {code:java}
> Query on n4:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // 1 | 10 | 20 | 30 <-- A record is returned
> Query on n1:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // <empty> //no output{code}
>
> Does this make sense?
> *Possible Solution*
> * One of the solutions is to maybe not purge tombstones while there are token range movements in the ring
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org