You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Brandon Williams (JIRA)" <ji...@apache.org> on 2014/12/12 23:01:16 UTC

[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

    [ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244857#comment-14244857 ] 

Brandon Williams commented on CASSANDRA-8336:
---------------------------------------------

So the real wrinkle here is that when we send the shutdown message, our latest heartbeat hasn't propagated fully, so even if the nodes quarantine, a newer heartbeat will eventually be seen, marking the killed node as back up.  I'm not sure what we can do about that, short of changing the format for the shutdown message to include the heartbeat and then filtering based on that.  Unfortunately that puts us in 3.1 territory, where we'll have to make 3.0 a prerequisite.

> Quarantine nodes after receiving the gossip shutdown message
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-8336
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.12
>
>
> In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here is that this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things out.  This happens due to gossip propagation time and variance; if node X shuts down and sends the message to Y, but Z has a greater gossip version than Y for X and has not yet received the message, it can initiate gossip with Y and thus mark X alive again.  I propose quarantining to solve this, however I feel it should be a -D parameter you have to specify, so as not to destroy current dev and test practices, since this will mean a node that shuts down will not be able to restart until the quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)