You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Joseph Lynch (JIRA)" <ji...@apache.org> on 2018/03/30 23:57:00 UTC

[jira] [Commented] (CASSANDRA-14001) Gossip after node restart can take a long time to converge about "down" nodes in large clusters

    [ https://issues.apache.org/jira/browse/CASSANDRA-14001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421084#comment-16421084 ] 

Joseph Lynch commented on CASSANDRA-14001:
------------------------------------------

After digging deeply I think that the evidence is indicating not an issue with Gossip, but just with how we establish connections on startup. I think we're just hitting a combination of CASSANDRA-13993 and CASSANDRA-14001. I'm going to close this out since I don't think the issue is Gossip related. 

> Gossip after node restart can take a long time to converge about "down" nodes in large clusters
> -----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14001
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14001
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Lifecycle
>            Reporter: Joseph Lynch
>            Priority: Minor
>
> When nodes restart in a large cluster, they mark all nodes as "alive", which first calls {{markDead}} and then creates an {{EchoMessage}} and in the callback to that marks the node as alive. This works great, except when that initial echo fails for w.e. reason and that node is marked as dead, in which case it will remain dead for a long while.
> We mostly see this on 100+ node clusters, and almost always when nodes are in different datacenters that have unreliable network connections (e.g, cross region in AWS) and I think that it comes down to a combination of:
> 1. Only a node itself can mark another node as "UP"
> 2. Nodes only gossip with dead nodes with probability {{#dead / (#live +1)}}
> In particular the algorithm in #2 leads to long convergence times because the number of dead nodes it typically very small compared to the cluster size. My back of the envelope model of this algorithm indicates that for a 100 node cluster this would take an average of ~50 seconds with a stdev of 50 seconds, which means we might be waiting _minutes_ for the nodes to gossip with each other. I'm modeling this as the minimum of two [geometric distributions|https://en.wikipedia.org/wiki/Geometric_distribution] with parameter {{p=1/#nodes}}, yielding a geometric distribution with parameter {{p=1-(1-(1/#nodes)^2)}}. So for a 100 node cluster:
> {noformat}
> 100 node cluster =>
> X = Pr(node1 gossips with node2) = geom(0.01)
> Y = Pr(node 2 gossips with node1) = geom(0.01)
> Z = min(X or Y) = geom(1 - (1 - 0.01)^2) = geom(0.02)
> E[Z] = 1/0.02 = 50
> V[Z] = (1-0.02)/(0.02)^2 = 2450
> 1000 node cluster ->
> Z = geom(1 - (1 - 0.001)^2) = geom(0.002)
> E[Z] = 500
> V[Z] = 24500
> {noformat}
> Since we gossip every second that means that on expectation in a 100 node cluster these nodes would see each other after about a minute and in a thousand node cluster, after ~8 minutes. For 100 node clusters the variance is astounding, and means that in particular edge cases we might be waiting hours before these nodes gossip with each other.
> I'm thinking of writing a patch which either:
> # Makes gossip order a shuffled list that includes dead nodes a la [swim gossip|https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf]. This would make it so that we waste some rounds on dead nodes but guarantee linear bounding of gossip.
> # Adds an endpoint that re-triggers gossip with all nodes. Operators could call this after a restart a few times if they detect a gossip inconsistency.
> # Bounding the probability we gossip with a dead node at some reasonable number like 1/10 or something. This might cause a lot of gossip load when a node is actually down for large clusters, but would also act to bound the variance.
> # Something else?
> I've got a WIP [branch|https://github.com/apache/cassandra/compare/cassandra-3.11...jolynch:force_gossip] on 3.11 which implements options #1 and #2, but I can reduce/change/modify as needed if people think there is a better way. The patch doesn't pass tests yet but I'm not going to change/add the tests unless we think moving to time bounded gossip for down nodes is a good idea.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org