You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paulo Motta (JIRA)" <ji...@apache.org> on 2016/03/25 01:26:25 UTC

[jira] [Commented] (CASSANDRA-9244) replace_address is not topology-aware

    [ https://issues.apache.org/jira/browse/CASSANDRA-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211202#comment-15211202 ] 

Paulo Motta commented on CASSANDRA-9244:
----------------------------------------

What we want here is to restore replica count on other nodes during a node replacement, since replica placement might change on other nodes when the new node is introduced.

For example, imagine nodes A, B, C from racks R1, R1 and R2 respectively. A token X that maps to node A must be replicated to A(R1) and C(R2). If node A is replaced with node D from rack R2, the same token X should now be replicated to node D (R2) and B (R1). So the secondary replica placement of node A changed from node C to node B, and thus node C should stream part of its data to B when D is replacing A.

To solve this we basically need to tell other nodes that we are replacing and ask them to restore replica count, similar to what is done when {{nodetool removenode}} is called. A simple way to do this would be to add a {{RESTORE_REPLICA_COUNT}} message that would be triggered by the replacing node on the nodes that have its replicas affected by the node replacement. While this would solve part of the problem, ongoing writes during the replacement process would not be redirected to the new replicas a la CASSANDRA-8523, reinforcing the need for repair after a replace.

I implemented an initial solution (available on this [branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:9244]) that reworks the replace process by adding a new gossip state BOOT_REPLACE, which is similar to BOOT but calculates pending ranges and restore replica counts on other nodes during the replace, solving both this and CASSANDRA-8523.

While the solution works beautifully when {{replace_address != broadcast_address}} (which I stupidly assumed would be the only case, given my previous EC2 background), the problem arises when {{replace_address == broadcast_address}} since other nodes think the old node is back in town and starts sending reads to him given the node is a natural endpoint (I'd probably figure out this earlier if I talked to [~brandon.williams] before, but well).

Two ways I can think of to solve this would be:
1. keep the node as natural endpoint and do not send reads to him, by checking if its on {{NORMAL}} state before contacting him, such as in the {{StorageService.getLiveNaturalEndpoints}} method.
2. remove the node from the natural endpoint list, but then we would need to restore the ring if the replace fails, deal with hints, clients would probably be confused, etc.

I'm leaning more towards 1, but I still think both of them look a bit workaroundish, error-prone and probably not ideal.

After this I realized that this would be much easier to deal with if we indexed nodes on {{TokenMetadata}} by {{UUID}} rather than {{InetAddress}} as we currently do probably for historical reasons. This would allow for instance, to have a DOWN natural endpoint with {{UUID=X}} and a UP pending/replacing endpoint with {{UUID=Y}}, even though they have the same {{InetAddress}} and tokens. However for this to be done right we would probably need to treat endpoints by UUID in other places as well (including FD, gossiper, StorageProxy, etc) and deal with {{InetAddress}} mostly on messaging service and similar network entry points.

While the effort would be relatively big, I think it would be mostly mechanical and it would probably be mostly internal code changes, not affecting communication, storage or upgrades. Besides allowing us to implement replace correctly once and for all, this would probably make other parts of the code simpler as well (such as the {{prefer_local}} logic). But I'm still not sure if this would be totally doable and worth the effort. Perharps we could do it as a first shot of CASSANDRA-6061 and target it for 4.0? WDYT [~brandon.williams], [~thobbs], [~jkni] ?

> replace_address is not topology-aware
> -------------------------------------
>
>                 Key: CASSANDRA-9244
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9244
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata, Streaming and Messaging
>         Environment: 2.0.12
>            Reporter: Rick Branson
>            Assignee: Paulo Motta
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>
> Replaced a node with one in another rack (using replace_address) and it caused improper distribution after the bootstrap was finished. It looks like the ranges for the streams are not created in a way that is topology-aware. This should probably either be prevented, or ideally, would work properly. The use case is migrating several nodes from one rack to another.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)