You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jason Brown (JIRA)" <ji...@apache.org> on 2014/11/22 07:02:34 UTC

[jira] [Commented] (CASSANDRA-8260) Replacing a node can leave the old node in system.peers on the replacement

    [ https://issues.apache.org/jira/browse/CASSANDRA-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221835#comment-14221835 ] 

Jason Brown commented on CASSANDRA-8260:
----------------------------------------

+ 1 on the patch, with one small nit: rename the second parameter on the overloaded quarantineEndpoint() from quarantineExpiration to quarantineStart (or something similar). The reason being is that the timestamp indicates when the endpoint is put into quarantine, not when it should expire.

This is a reasonable fix to resolve this timing issue, but I'll add some thoughts to #8304 about cleaning up the peers.


> Replacing a node can leave the old node in system.peers on the replacement
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8260
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8260
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.12
>
>         Attachments: 8260.txt
>
>
> Here's what happens:
> Nodes: X, Y, Z. Z replaces Y which is dead.
> 0. Replacement finishes
> 1. Z removes Y, quarantines and evicts (that is, removes the state)
> 2. X sees the replacement, quarantines, but keeps state
> 3. 60s elapses
> 4. quarantine on Z expires
> 5. X sends syn to Z, repopulates Y endpoint state and persists to system.peers, but Z sees the conflict and does not update tMD for Y. 
> 6. FatClient timer on Z starts counting.
> 7. quarantine on X expires, fat client has been idle, evicts and re-quarantines
> 8. 30s elapses
> 9. Fat client timeout occurs on Z, evicts and re-quarantines
> 10. 30s elapses
> 11. quarantine on X expires, so it never gets repopulated with Y since Z already removed it
> It's important to note here that there is a small but relevant gap between steps 1 and 2, which then correlates to steps 4 and 5, and step 5 is where the problem occurs. This also explains why it looks related to RING_DELAY, since the quarantine is RING_DELAY * 2, but Y never quarantines and the fat client timeout is RING_DELAY, effectively making the discrepancy near equal to RING_DELAY in the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)