You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Leon Zaruvinsky (Jira)" <ji...@apache.org> on 2019/09/16 22:55:00 UTC
[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

    [ https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930940#comment-16930940 ] 

Leon Zaruvinsky commented on CASSANDRA-15327:
---------------------------------------------

h2. Reproduction (with Bootstrap)
h3. Cluster setup

Apache Cassandra 2.2.14, 3 nodes with SimpleSeedProvider.

 
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.180 101.68 KB 512 66.7% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 101.64 KB 512 67.6% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 101.69 KB 512 65.7% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
{code}
h3. Seed Data

We create a table {{datadump.ddt}} using {{LeveledCompactionStrategy}}, with {{RF = 3}} and {{gc_grace_seconds = 1}}. We write at {{QUORUM}} 20,000 rows of data in the format (row, value) where row is a partition key [0, 20000) and value is a one megabyte blob.

 
{code:java}
cqlsh> describe datadump;

CREATE KEYSPACE datadump WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} AND durable_writes = true;

CREATE TABLE datadump.ddt (
 row bigint,
 col bigint,
 value blob static,
 PRIMARY KEY (row, col)
) WITH CLUSTERING ORDER BY (col ASC)
 AND bloom_filter_fp_chance = 0.1
 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
 AND comment = ''
 AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
 AND compression = {}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 1
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';

cqlsh> SELECT COUNT(row) from datadump.ddt;

system.count(row)

20000
 
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
{code}
h3. Bootstrap + Delete

We bootstrap a fourth node to the cluster, and as soon as it begins to stream data, we delete at {{ALL}} the 20,000 rows that we inserted earlier. The deletes will be forwarded to the bootstrapping node immediately. As soon as the delete queries complete, we run a flush and major compaction on the bootstrapping node to clear the tombstones.

 
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UJ x.x.x.129 14.43 KB 512 ? 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
// First, we trigger deletes on one of the original three nodes.
// Then, we run the below on the bootstrapping node
$ nodetool flush && nodetool compact
$ nodetool compactionhistory
{code}
 


Compaction History:
{code:java}
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged
b63e7e30-d00a-11e9-a7b9-f18b3a65a899 datadump ddt 1567708003731 366768 0 {}
7e2253a0-d00a-11e9-a7b9-f18b3a65a899 system local 1567707909594 10705 10500 {4:1}
{code}
h3. Final State

Once the bootstrap has completed, the cluster looks like this:
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 19.54 GB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.54 GB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 19.54 GB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code}
We run a flush and major compaction on every node. The original three nodes drop everything, but the bootstrapped node holds onto nearly 75% of the data.
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 143.39 KB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 128.6 KB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 149.97 KB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code}
If we run the count query again, we see the data get read-repaired back into the other nodes.

 
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 9.5 GB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 9.58 GB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 9.5 GB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
cqlsh> SELECT COUNT(row) from datadump.ddt;

system.count(row)

15282

(1 rows)
{code}
 

 

> Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15327
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15327
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Leon Zaruvinsky
>            Priority: Normal
>         Attachments: CASSANDRA-15327-2.1.txt
>
>
> Hey,
> We've come across a scenario in production (noticed on Cassandra 2.2.14) where data that is deleted from Cassandra at consistency {{ALL}} can be resurrected.  I've added a reproduction in a comment.
> If a {{delete}} is issued during a range movement (i.e. bootstrap, decommission, move), and {{gc_grace_seconds}} is surpassed before the stream is finished, then the tombstones from the {{delete}} can be purged from the recipient node before the data is streamed. Once the move is complete, the data now exists on the recipient node without a tombstone.
>  
> We noticed this because our bootstrapping time occasionally exceeds our configured gc_grace_seconds, so we lose the consistency guarantee.  As an operator, it would be great to not have to worry about this edge case.
> I've attached a patch that we have tested and successfully used in production, and haven't noticed any ill effects.  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org