You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Brandon Williams (JIRA)" <ji...@apache.org> on 2014/02/05 23:44:11 UTC

[jira] [Comment Edited] (CASSANDRA-6590) Gossip does not heal after a temporary partition at startup

    [ https://issues.apache.org/jira/browse/CASSANDRA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892699#comment-13892699 ] 

Brandon Williams edited comment on CASSANDRA-6590 at 2/5/14 10:43 PM:
----------------------------------------------------------------------

I'm not sure why the block in handleMajorStateChange moved, but because the endpoint state is added before that so the check for it will never be null, so it always says the node restarted (and we should keep the 'UP' message there to keep it easy to look for) even though it's the first time it's been seen.

I think the if (!localState.isAlive()) check is problematic, because while it got rid of the repeated UP messages, it also seemed to introduce a race situation where sometimes some nodes would end up in a cluster by themselves.  I briefly tried making Echo verbs droppable in CASSANDRA-6661 instead, but that didn't help, so I'm not sure why we're seemingly building these requests up, or if something else is making realMarkAlive fire so much.

Finally, I think we'll need a separate yaml option, since removing things in a minor is kind of mean to upgraders who don't catch it and their server won't start.




was (Author: brandon.williams):
I'm not sure why the block in handleMajorStateChange, but because the endpoint state is added before that the check for it will never be null, so it always says the node restarted (and we should keep the 'UP' message there to keep it easy to look for) even though it's the first time it's been seen.

I think the if (!localState.isAlive()) check is problematic, because while it got rid of the repeated UP messages, it also seem to introduce a race situation where sometimes some nodes would end up in a cluster by themselves.  I briefly tried making Echo verbs droppable in CASSANDRA-6661 instead, but that didn't help, so I'm not sure why we're seemingly building these requests up, or if something else is making realMarkAlive fire so much.

Finally, I think we'll need a separate yaml option, since removing things in a minor is kind of mean to upgraders who don't catch it and their server won't start.



> Gossip does not heal after a temporary partition at startup
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-6590
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6590
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Vijay
>             Fix For: 2.0.6
>
>         Attachments: 0001-CASSANDRA-6590.patch, 0001-logging-for-6590.patch, 6590_disable_echo.txt
>
>
> See CASSANDRA-6571 for background.  If a node is partitioned on startup when the echo command is sent, but then the partition heals, the halves of the partition will never mark each other up despite being able to communicate.  This stems from CASSANDRA-3533.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)