You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jason Brown (JIRA)" <ji...@apache.org> on 2015/06/22 14:20:01 UTC

[jira] [Resolved] (CASSANDRA-5665) Gossiper.handleMajorStateChange can lose existing node ApplicationState

     [ https://issues.apache.org/jira/browse/CASSANDRA-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Brown resolved CASSANDRA-5665.
------------------------------------
    Resolution: Won't Fix

(2.0 years later) Yes, we should close this :)

> Gossiper.handleMajorStateChange can lose existing node ApplicationState
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-5665
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5665
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.5
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>              Labels: gossip, upgrade
>         Attachments: 5665-v1.diff, 5665-v2.diff
>
>
> Dovetailing on CASSANDRA-5660, I discovered that further along during an upgrade, when more nodes are on the new major version, a node the previous version can get passed some incomplete Gossip info about another, already upgraded node, and the older node drops AppStat info about that node.
> I think what happens is that a 1.1 node (older rev) gets gossip info from a 1.2 node (A), which includes incomplete (lacking some AppState data) gossip info about another 1.2 node (B). The 1.1 node, which has marked incorrectly kicked node B out of gossip due to the bug described in #5660, then takes that incomplete node B info and wholesale replaces any previous known state about node B in Gossiper.handleMajorStateChanged. Thus, if we previously had DC/RACK info, it'll get dropped as part of the endpointStateMap.put(endpointstate). When the data being pased is incomplete, 1.1 will start referencing node B and gets into the NPE situation in #5498.
> Anecdotally, this bad state is short-lived, less than a few minutes, even as short as ten seconds, until gossip catches up and properly propagates the AppState data. Furthermore, when upgrading a two datacenter, 48 node cluster, it only occurred on two nodes for less than a minute each. Thus, the scope seems limited but can occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)