You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Brandon Williams (JIRA)" <ji...@apache.org> on 2015/03/31 21:02:54 UTC

[jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

     [ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-8336:
----------------------------------------
    Attachment: 8336-v4.txt

After wrestling with exceptions for a bit, I came up with a simpler solution.  Gossiper's stop() can examine the local state itself, and skip shutdown announcement if it doesn't exist.  We still need stopSilently (which I renamed in this patch from stopForLeaving) for cases like decom, where we aren't coming back and don't wait to mutate our state on shutdown.

> Quarantine nodes after receiving the gossip shutdown message
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-8336
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.14
>
>         Attachments: 8336-v2.txt, 8336-v3.txt, 8336-v4.txt, 8336.txt
>
>
> In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here is that this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things out.  This happens due to gossip propagation time and variance; if node X shuts down and sends the message to Y, but Z has a greater gossip version than Y for X and has not yet received the message, it can initiate gossip with Y and thus mark X alive again.  I propose quarantining to solve this, however I feel it should be a -D parameter you have to specify, so as not to destroy current dev and test practices, since this will mean a node that shuts down will not be able to restart until the quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)