You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Leon Zaruvinsky (Jira)" <ji...@apache.org> on 2020/11/05 02:45:00 UTC
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade

    [ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226467#comment-17226467 ] 

Leon Zaruvinsky commented on CASSANDRA-15789:
---------------------------------------------

Hey all - just wanted to call out that an errant import made it into the Cassandra 3.11 patch of this PR: https://github.com/apache/cassandra/blob/4d42c189fa82b32fd93ae42a164b91e4db62992e/src/java/org/apache/cassandra/service/StorageProxyMBean.java#L24

Not a big deal, but noticed this because we publish the MBeans/jmx api as their own custom-built package and hit a compilation error because the DatabaseDescriptor wasn't included.

> Rows can get duplicated in mixed major-version clusters and after full upgrade
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15789
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15789
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Local/Memtable, Local/SSTable
>            Reporter: Aleksey Yeschenko
>            Assignee: Marcus Eriksson
>            Priority: Normal
>             Fix For: 3.0.21, 3.11.7, 4.0, 4.0-beta1
>
>
> In a mixed 2.X/3.X major version cluster a sequence of row deletes, collection overwrites, paging, and read repair can cause 3.X nodes to split individual rows into several rows with identical clustering. This happens due to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency.
> To reproduce, set up a 2-node mixed major version cluster with the following table:
> {code}
> CREATE TABLE distributed_test_keyspace.tlb (
>     pk int,
>     ck int,
>     v map<text, text>,
>     PRIMARY KEY (pk, ck)
> );
> {code}
> 1. Using either node as the coordinator, delete the row with ck=2 using timestamp 1
> {code}
> DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2;
> {code}
> 2. Using either node as the coordinator, insert the following 3 rows:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3;
> INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3;
> {code}
> 3. Flush the table on both nodes
> 4. Using the 2.2 node as the coordinator, force read repar by querying the table with page size = 2:
>  
> {code}
> SELECT * FROM tbl;
> {code}
> 5. Overwrite the row with ck=2 using timestamp 5:
> {code}
> INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}}
> {code}
> 6. Query the 3.0 node and observe the split row:
> {code}
> cqlsh> select * from distributed_test_keyspace.tlb ;
>  pk | ck | v
> ----+----+------------
>   1 |  1 | {'e': 'f'}
>   1 |  2 | {'g': 'h'}
>   1 |  2 | {'k': 'l'}
>   1 |  3 | {'i': 'j'}
> {code}
> This happens because the read to query the second page ends up generating the following mutation for the 3.0 node:
> {code}
> ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647,
>              ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821]
>                     [2:v:!-2:!,   deletedAt=1, localDeletion=1588588821]
>                     [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}-
>              [2:v:63:false:1@3,])
> {code}
> Which on 3.0 side gets incorrectly deserialized as
> {code}
> Mutation(keyspace='distributed_test_keyspace', key='00000001', modifications=[
>   [distributed_test_keyspace.tbl] key=1 partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 columns=[[] | [v]]
>     Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, localDeletion=1588588821, [v[c]=d ts=3]
>     Row[info=[ts=-9223372036854775808] del=deletedAt=1, localDeletion=1588588821 ]: ck=2 |
>     Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, localDeletion=1588588821
> ])
> {code}
> {{LegacyLayout}} correctly interprets a range tombstone whose start and finish {{collectionName}} values don't match as a wrapping fragment of a legacy row deletion that's being interrupted by a collection deletion (correctly) - see [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. Quoting the comment inline:
> {code}
> // Because of the way RangeTombstoneList work, we can have a tombstone where only one of
> // the bound has a collectionName. That happens if we have a big tombstone A (spanning one
> // or multiple rows) and a collection tombstone B. In that case, RangeTombstoneList will
> // split this into 3 RTs: the first one from the beginning of A to the beginning of B,
> // then B, then a third one from the end of B to the end of A. To make this simpler, if
>  // we detect that case we transform the 1st and 3rd tombstone so they don't end in the middle
>  // of a row (which is still correct).
> {code}
> {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, and mistakenly starts a new row, while in the middle of an existing one: (see [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org