You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "David Capwell (Jira)" <ji...@apache.org> on 2020/11/13 18:31:00 UTC

[jira] [Comment Edited] (CASSANDRA-16213) Cannot replace_address /X because it doesn't exist in gossip

    [ https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231736#comment-17231736 ] 

David Capwell edited comment on CASSANDRA-16213 at 11/13/20, 6:30 PM:
----------------------------------------------------------------------

Found the issue, it was caused by CASSANDRA-15158 where it creates a config of milliseconds, calls a delay which takes milliseconds, but converts the mills as if they were seconds, causing a much longer delay than expected.

Once I fix that I then hit the next issue, we now block waiting on schema which will fail since it has a downed node.

{code}
case SCHEMA:
                        SystemKeyspace.updatePeerInfo(endpoint, "schema_version", UUID.fromString(value.value));
                        MigrationCoordinator.instance.reportEndpointVersion(endpoint, UUID.fromString(value.value));
                        break;
{code}

{code}
boolean schemasReceived = MigrationCoordinator.instance.awaitSchemaRequests(SCHEMA_DELAY_MILLIS);

        if (schemasReceived)
            return;

        logger.warn(String.format("There are nodes in the cluster with a different schema version than us we did not merged schemas from, " +
                                  "our version : (%s), outstanding versions -> endpoints : %s",
                                  Schema.instance.getVersion(),
                                  MigrationCoordinator.instance.outstandingVersions()));

        if (REQUIRE_SCHEMAS)
            throw new RuntimeException("Didn't receive schemas for all known versions within the timeout");
{code}

when we get the gossip info from the peers it will have node2 (the node that crashed abruptly) and wait until it gets the schema, but this won't happen since node2 is down and we are replacing it.

This looks unrelated to this patch, but also is a bad condition as any schema change with a downed node will cause nodes to fail to start up...


was (Author: dcapwell):
Found the issue, it was caused by CASSANDRA-15158 where it creates a config of milliseconds, calls a delay which takes milliseconds, but converts the mills as if they were seconds, causing a much longer delay than expected.

Once I fix that I then hit the next issue, we now block waiting on schema which will fail since it has a downed node.

{code}
case SCHEMA:
                        SystemKeyspace.updatePeerInfo(endpoint, "schema_version", UUID.fromString(value.value));
                        MigrationCoordinator.instance.reportEndpointVersion(endpoint, UUID.fromString(value.value));
                        break;
{code}

when we get the gossip info from the peers it will have node2 (the node that crashed abruptly) and wait until it gets the schema, but this won't happen since node2 is down and we are replacing it.

This looks unrelated to this patch, but also is a bad condition as any schema change with a downed node will cause nodes to fail to start up...

> Cannot replace_address /X because it doesn't exist in gossip
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-16213
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16213
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip, Cluster/Membership
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> We see this exception around nodes crashing and trying to do a host replacement; this error appears to be correlated around multiple node failures.
> A simplified case to trigger this is the following
> *) Have a N node cluster
> *) Shutdown all N nodes
> *) Bring up N-1 nodes (at least 1 seed, else replace seed)
> *) Host replace the N-1th node -> this will fail with the above
> The reason this happens is that the N-1th node isn’t gossiping anymore, and the existing nodes do not have its details in gossip (but have the details in the peers table), so the host replacement fails as the node isn’t known in gossip.
> This affects all versions (tested 3.0 and trunk, assume 2.2 as well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org