You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Aleksey Yeschenko (Jira)" <ji...@apache.org> on 2020/05/05 16:53:00 UTC
[jira] [Comment Edited] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

    [ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100070#comment-17100070 ] 

Aleksey Yeschenko edited comment on CASSANDRA-15789 at 5/5/20, 4:52 PM:
------------------------------------------------------------------------

{{nodetool scrub}} can fix thee sstables by collapsing rows with the same clustering into one, via the logic added in CASSANDRA-12144 to address a similar corruption.


was (Author: iamaleksey):
{{nodetool scrub}} fixes the issue by collapsing rows with the same clustering into one, via the logic added in CASSANDRA-12144 to address a similar corruption.

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15789
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>            Reporter: Aleksey Yeschenko
>            Assignee: Marcus Eriksson
>            Priority: Normal
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, collection overwrites, paging, and read repair can cause 3.X nodes to split individual rows into several rows with identical clustering. This happens due to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
>     pk int,
>     ck int,
>     v map<text, text>,
>     PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ----+----+------------
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>              ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
>                     [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
>                     [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>              [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='00000001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 columns=[[] | [v]]
>     Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, localDeletion=1588588821, [v[c]=d ts=3]
>     Row[info=[ts=-9223372036854775808] del=deletedAt=1, localDeletion=1588588821 ]: ck=2 |
>     Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and finish {{collectionName}} values don't match as a wrapping fragment of a legacy row deletion that's being interrupted by a collection deletion (correctly) - see [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where only one of
> // the bound has a collectionName. That happens if we have a big tombstone A (spanning one
> // or multiple rows) and a collection tombstone B. In that case, RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, and mistakenly starts a new row, while in the middle of an existing one: (see [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org