You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Alex Petrov (JIRA)" <ji...@apache.org> on 2016/05/12 19:33:12 UTC
[jira] [Comment Edited] (CASSANDRA-7592) Ownership changes can violate consistency

    [ https://issues.apache.org/jira/browse/CASSANDRA-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280680#comment-15280680 ] 

Alex Petrov edited comment on CASSANDRA-7592 at 5/12/16 7:32 PM:
-----------------------------------------------------------------

[EDITED] 

My previous comment was off, as I ran the tests incorrectly. 
I have to note that in 3 node cluster such situation is not possible, if I understand everything correctly. In order to have the node bootstrapped, with consistent range movement, all replica has to be alive in order for node to be bootstrapped, otherwise we'd get {{A node required to move the data consistently is down}} exception (so it's quite tricky to even reproduce). This doesn't really contradict the issue description as coordinators are said to be chosen randomly, though I wanted to point it out.

So for example:
{code}
for i in `seq 1 4`; do ccm node$i remove; done
ccm populate -n 3
ccm start
sleep 10
ccm node1 cqlsh < data.cql
# With node1 gossip disabled that won't work, as joning node4 won't be able to bootrap
# ccm node1 nodetool disablegossip
# sleep 10
ccm add node4 -i 127.0.0.4 -b && ccm node4 start
{code}

Would produce something like that:
{code}
node1 
INFO  [SharedPool-Worker-1] 2016-05-12 18:28:05,915 Gossiper.java:1014 - InetAddress /127.0.0.4 is now UP
INFO  [PendingRangeCalculator:1] 2016-05-12 18:28:08,923 TokenMetadata.java:209 - Adding token to endpoint map `577167728929728286` `/127.0.0.4`

node4
INFO  [main] 2016-05-12 18:28:38,448 StorageService.java:1460 - Starting Bootstrap
INFO  [main] 2016-05-12 18:28:38,453 TokenMetadata.java:209 - Adding token to endpoint map `577167728929728286` `/127.0.0.4`
INFO  [main] 2016-05-12 18:28:38,903 StorageService.java:2173 - Node /127.0.0.4 state jump to NORMAL
{code}

Note here that the token for the endpoint lands on the node1 as pending token early enough, and {{StorageProxy::performWrite}} would be aware of the pending tokens as well.

However, judging from the code this situation might actually occur in a larger cluster, where coordinator is chosen from nodes that's not a part of replica for streamed ranges. I'll try to take another try on it some time soon. Although so far I can not see a good solution that would allow simultaneous moves. Given simultaneous moves aren't allowed, the next move might have to be postponed for (or disallowed until) {{ring_delay}}, then we can store the {{tokenEndpointMap}} for the {{ring_delay}} on replicas. 

Coordinator that hasn't learned about ring change will then contact the {{old}} nodes noly. Coordinator that has learned about the ring change will send writes to both {{old}} and {{new }} nodes. More On the replica side, {{old}} replica has to gracefully serve reads and receive writes.

Might be a good idea to use {{TokenAwarePolicy}}, which might help in some cases (when partition key is actually known). 


was (Author: ifesdjeen):
I've been able to reproduce the issue with several simple steps:

  * Populate a cluster with 3 nodes, set RF to 3 and insert some data into the cluster.
  * Disable gossip on {{node1}}.
  * Bring in {{node4}}. {{node1}} will be unaware of the ring change.
  * Run {{select *}}

I've also made a simple prototype that would check whether requested token or range belong to (or intersect with) the current node, when coordinator sends requests to replicas, and was able to fail the coordinator request for that node (with a different replication factor request should however succeed since it still can connect to enough replicas).

The coordinator requests have to get a bit smarter in a way that coordinator has to send reads and mutations to previous owner for the "ring delay". 
With the active announcement, my only question is in case when coordinator can not connect to other nodes, it won't receive the message either actively or passively.

> Ownership changes can violate consistency
> -----------------------------------------
>
>                 Key: CASSANDRA-7592
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7592
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Richard Low
>            Assignee: Alex Petrov
>
> CASSANDRA-2434 goes a long way to avoiding consistency violations when growing a cluster. However, there is still a window when consistency can be violated when switching ownership of a range.
> Suppose you have replication factor 3 and all reads and writes at quorum. The first part of the ring looks like this:
> Z: 0
> A: 100
> B: 200
> C: 300
> Choose two random coordinators, C1 and C2. Then you bootstrap node X at token 50.
> Consider the token range 0-50. Before bootstrap, this is stored on A, B, C. During bootstrap, writes go to X, A, B, C (and must succeed on 3) and reads choose two from A, B, C. After bootstrap, the range is on X, A, B.
> When the bootstrap completes, suppose C1 processes the ownership change at t1 and C2 at t4. Then the following can give an inconsistency:
> t1: C1 switches ownership.
> t2: C1 performs write, so sends write to X, A, B. A is busy and drops the write, but it succeeds because X and B return.
> t3: C2 performs a read. It hasn’t done the switch and chooses A and C. Neither got the write at t2 so null is returned.
> t4: C2 switches ownership.
> This could be solved by continuing writes to the old replica for some time (maybe ring delay) after the ownership changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)