You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Yifan Cai (Jira)" <ji...@apache.org> on 2021/04/06 21:55:00 UTC

[jira] [Updated] (CASSANDRA-16545) Cluster topology change may produce false unavailable for queries

     [ https://issues.apache.org/jira/browse/CASSANDRA-16545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yifan Cai updated CASSANDRA-16545:
----------------------------------
    Test and Documentation Plan: unit test; ci
                         Status: Patch Available  (was: Open)

PR: https://github.com/apache/cassandra/pull/954
CI: https://app.circleci.com/pipelines/github/yifan-c/cassandra?branch=CASSANDRA-16545%2Ftrunk

The patch is largely a refactor to pass the same {{ReplicationStrategy}} object to build replicaLayout, replicaPlan and CL liveness validation. 
A test is added to prove that the false unavailable can be thrown when creating the replicaPlan. (in the [first commit|https://github.com/apache/cassandra/pull/954/commits/8d921c5d311c6e97d1f757af64a2e65a84b419ef])
The [second commit|Use the same RS object during ReplicaPlan creation] makes sure the same RS object is used for peer selection and CL liveness check to avoid race. 
However, {{blockFor}} calculation can still use a different RS object, leading to that the coordinator blocks for a different condition as it originally calculated for. The rest 2 commits address the problem. 

The highlights of the patch:
* ReplicaLayout and ReplicaPlan now keep a reference to the replication strategy snapshot. The snapshot is now used for peer selection, liveness validation and blockFor calculation. 
* The usage of Keyspace to validate CL liveness is fully eliminated to avoid potential race. It uses replication strategy instead. 

cc: [~aleksey][~cnlwsu]

> Cluster topology change may produce false unavailable for queries
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-16545
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16545
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Yifan Cai
>            Assignee: Yifan Cai
>            Priority: Normal
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When the coordinator processes a query, it first gets the {{ReplicationStrategy}} (RS) from the keyspace to decide the peers to contact. Again, it gets the RS to perform the liveness check for the requested CL. 
> The RS is a volatile filed in Keyspace, and it is possible that those 2 getter calls return different RS values in the presence of cluster topology changes, e.g. add a node, etc. 
> In such scenario, the check at the second step can throw an unexpected unavailable. From the perspective of the query, the cluster can satisfy the CL. 
> We should use a consistent view of RS during the peer selection and CL liveness check. In other word, both steps should reference to the same RS object. It is also more clear and easier to reason about to the clients. Such queries are made before the topology change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org