You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefania (JIRA)" <ji...@apache.org> on 2015/07/15 10:57:05 UTC

[jira] [Comment Edited] (CASSANDRA-9765) checkForEndpointCollision fails for legitimate collisions

    [ https://issues.apache.org/jira/browse/CASSANDRA-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627739#comment-14627739 ] 

Stefania edited comment on CASSANDRA-9765 at 7/15/15 8:56 AM:
--------------------------------------------------------------

Thank you for your in-dept analysis Jason. If I understood correctly, this method should be called in {{checkForEndpointCollision}} as follows and replace all other conditions:

{code}
if (!Gossiper.instance.isSafeForBootstrap(FBUtilities.getBroadcastAddress()))
{code}

I think it makes sense to look at the gossip states rather than {{TMD.isMember()}}. I am still unclear however on the empty, BOOTSTRAP and SHUTDOWN states. 

In your method the first is not allowed and the last two are both allowed. According to CASSANDRA-7939 BOOTSTRAP should be allowed. What about SHUTDOWN? 

If you clarify which states are OK to join the cluster then I can rewrite and test the method, here is what I have so far:

{code}
public boolean isSafeForBootstrap(InetAddress endpoint)
    {
        EndpointState epState = endpointStateMap.get(endpoint);
        
        // if there's no previous state, or the node was previously removed from the cluster, we're good
        if (epState == null || isDeadState(epState))
            return true;

       String state = getApplicationState(epState);

        // these states are not allowed to join the cluster
        List<String> states = new ArrayList<String>() {{
            add(""); // failed bootstrap but we did start gossiping <== is this allowed or NOT? (CASSANDRA-7939)
            add(VersionedValue.STATUS_NORMAL); // node is legit in the cluster or it was stopped with kill -9 <== definitely not allowed
            add(VersionedValue.STATUS_BOOTSTRAPPING); // failed bootstrap (e.g. kill -9 during bootstrap) <== is this allowed or NOT? (CASSANDRA-7939)
            add(VersionedValue.SHUTDOWN); }}; // node was shutdown but not removed <== not allowed also, right?
        return !states.contains(state);
    }
{code}

As for the tests, I was able to add 3 more tests so right now we have the following states:

* SHUTDOWN - node is shutdown gracefully, test name : _shutdown_wiped_node_cannot_join_test_
* NORMAL - node is shutdown with kill -9, test name : _killed_wiped_node_cannot_join_test_
* LEFT - node is decommissioned, test name : _decommissioned_wiped_node_can_join_test_
* BOOT - node is shutdown with kill -9 during bootstrap, test name : _failed_bootstap_wiped_node_can_join_test_

I think these are the essential tests, the only one missing is perhaps killing the node after it starts gossiping and before it enters the bootstrap, but I am not sure how easy it is as the python code must reliably kill the node at the right time (with the bootstrap I was able to extend it by increasing the amount of data streamed). 
 


was (Author: stefania):
Thank you for your in-dept analysis Jason. If I understood correctly, this method should be called in {{checkForEndpointCollision}} as follows and replace all other conditions:

{code}
if (!Gossiper.instance.isSafeForBootstrap(FBUtilities.getBroadcastAddress()))
{code}

I think it makes sense to look at the gossip states rather than {{TMD.isMember()}}. I am still unclear however on the empty, BOOTSTRAP and SHUTDOWN states. In your method the first is not allowed and the last two are both allowed.
According to CASSANDRA-7939 BOOTSTRAP should be allowed but not SHUTDOWN? 

If you clarify which states are OK to join the cluster I can rewrite the method, here is what I have so far:

{code}
public boolean isSafeForBootstrap(InetAddress endpoint)
    {
        EndpointState epState = endpointStateMap.get(endpoint);
        
        // if there's no previous state, or the node was previously removed from the cluster, we're good
        if (epState == null || isDeadState(epState))
            return true;

       String state = getApplicationState(epState);

        // these states are not allowed to join the cluster
        List<String> states = new ArrayList<String>() {{
            add(""); // failed bootstrap but we did start gossiping <== is this allowed or NOT? (CASSANDRA-7939)
            add(VersionedValue.STATUS_NORMAL); // node is legit in the cluster or it was stopped with kill -9 <== definitely not allowed
            add(VersionedValue.STATUS_BOOTSTRAPPING); // failed bootstrap (e.g. kill -9 during bootstrap) <== is this allowed or NOT? (CASSANDRA-7939)
            add(VersionedValue.SHUTDOWN); }}; // node was shutdown but not removed <== not allowed also, right?
        return !states.contains(state);
    }
{code}

As for the tests, I was able to add 3 more tests so right now we have the following states:

* SHUTDOWN - node is shutdown gracefully, test name : _shutdown_wiped_node_cannot_join_test_
* NORMAL - node is shutdown with kill -9, test name : _killed_wiped_node_cannot_join_test_
* LEFT - node is decommissioned, test name : _decommissioned_wiped_node_can_join_test_
* BOOT - node is shutdown with kill -9 during bootstrap, test name : _failed_bootstap_wiped_node_can_join_test_

I think these are the essential tests, the only one missing is perhaps killing the node after it starts gossiping and before it enters the bootstrap, but I am not sure how easy it is as the python code must reliably kill the node at the right time (with the bootstrap I was able to extend it by increasing the amount of data streamed). 
 

> checkForEndpointCollision fails for legitimate collisions
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9765
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9765
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Richard Low
>            Assignee: Stefania
>             Fix For: 2.0.17
>
>
> Since CASSANDRA-7939, checkForEndpointCollision no longer catches a legitimate collision. Without CASSANDRA-7939, wiping a node and starting it again fails with 'A node with address %s already exists', but with it the node happily enters joining state, potentially streaming from the wrong place and violating consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)