You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jason Brown (JIRA)" <ji...@apache.org> on 2013/06/19 20:00:33 UTC

[jira] [Updated] (CASSANDRA-5665) Gossiper.handleMajorStateChange can lose existing node ApplicationState

     [ https://issues.apache.org/jira/browse/CASSANDRA-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Brown updated CASSANDRA-5665:
-----------------------------------

    Attachment: 5665-v1.diff

The attached patch modifies Gossiper.handleMajorStateChanged by checking if the the endpoint already exists in the endpointStateMap, and adds any previous ApplicationState fields to the new epState 
if a) the AppState does not exist in the new epState struct or b) has AppState whose version is greater than that in the epState.

One the surface the patch is straight forward, but I'm not sure if there's some subtle bugs that might creep in with retaining previous state (although that state might get replaced anyways in a very short time). Thus, while the patch fixes 'my problem', I'm not sure if this is the safest way to resolve.

                
> Gossiper.handleMajorStateChange can lose existing node ApplicationState
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-5665
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5665
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.5
>            Reporter: Jason Brown
>            Priority: Minor
>              Labels: gossip, upgrade
>             Fix For: 1.2.6, 2.0 beta 1
>
>         Attachments: 5665-v1.diff
>
>
> Dovetailing on #5660, I discovered that further along during an upgrade, when more nodes are on the new major version, a node the previous version can get passed some incomplete Gossip info about another, already upgraded node, and the older node drops AppStat info about that node.
> I think what happens is that a 1.1 node (older rev) gets gossip info from a 1.2 node (A), which includes incomplete (lacking some AppState data) gossip info about another 1.2 node (B). The 1.1 node, which has marked incorrectly kicked node B out of gossip due to the bug described in #5660, then takes that incomplete node B info and wholesale replaces any previous known state about node B in Gossiper.handleMajorStateChanged. Thus, if we previously had DC/RACK info, it'll get dropped as part of the endpointStateMap.put(endpointstate). When the data being pased is incomplete, 1.1 will start referencing node B and gets into the NPE situation in #5498.
> Anecdotally, this bad state is short-lived, less than a few minutes, even as short as ten seconds, until gossip catches up and properly propagates the AppState data. Furthermore, when upgrading a two datacenter, 48 node cluster, it only occurred on two nodes for less than a minute each. Thus, the scope seems limited but can occur.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira