You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Constance Eustace (JIRA)" <ji...@apache.org> on 2019/06/12 21:01:00 UTC

[jira] [Commented] (CASSANDRA-15155) Bootstrapping node finishes 'successfully' before schema synced, data not streamed

    [ https://issues.apache.org/jira/browse/CASSANDRA-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862458#comment-16862458 ] 

Constance Eustace commented on CASSANDRA-15155:
-----------------------------------------------

Seems very similar to https://issues.apache.org/jira/browse/CASSANDRA-6648 from a long long time ago.

> Bootstrapping node finishes 'successfully' before schema synced, data not streamed
> ----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15155
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Membership, Consistency/Bootstrap and Decommission
>            Reporter: Constance Eustace
>            Priority: Normal
>         Attachments: debug.log.zip
>
>
> Bootstrap of a new node to expand an existing cluster is completing the bootstrapping "successfully", joining the cluster as UN in nodetool status, but has no data and no active streams. Writes and reads start being served. 
> Environment: AWS, IPV6, three datacenters, asia / europe / us
> Version: 2.2.13
> We have previously scaled the europe and us datacenters from 5 nodes to 25 nodes (one node at a time) without incident.
> Asia (tokyo) is a different story. We have seen multiple failure scenarios, but the most troubling is a string of attempted node bootstrappings where the bootstrap completes and the node joins the ring, but there is no data. 
> We were able to expand Asia by four nodes by increasing ring delay to 100 seconds, but that has not worked anymore. 
> Attached Log: Our autoscaling + Ansible initial setup creates the node, but the ansible has not run yet, so the autostarted cassandra fails to load, but it has no security group yet so it did not communicate with any other node. That is the 15:15:XX series log messages at the very top.
> Then 15:20:XX series messages begin after ansible has completed setup of the box, and the data dirs and commit log dirs have been scrubbed. 
> This same process ran for EU and US expansions without incident. 
> From what I can tell of the log (DEBUG was enabled):
> Ring information collection begins, so some sort of gossip/cluster communication is healthy:
> INFO [main] 2019-06-12 15:20:05,468 StorageService.java:1142 - JOINING: waiting for ring information
> But almost all of those checks output:
> DEBUG [GossipStage:1] 2019-06-12 15:20:07,673 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
> Which seems weird, as we shall see soon. After all the nodes have reported in a similar way, most not pulling because of the above message, and a couple that were interpreted as DOWN, it then does:
> INFO [main] 2019-06-12 15:21:45,486 StorageService.java:1142 - JOINING: schema complete, ready to bootstrap
> INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: waiting for pending range calculation
> INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: calculation complete, ready to bootstrap
> We then get a huge number of
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404
> INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING: Starting to bootstrap...
> INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for Bootstrap
> INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap completed! for the tokens [-7314981925085449175, ...<tokens>... 5499447097629838103]
> Here are the log messages for MIgrationManager for schema gossiping:
> DEBUG [main] 2019-06-12 15:20:05,423 MigrationManager.java:493 - Gossiping my schema version 59adb24e-f3cd-3e02-97f0-5b395827453f
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,694 MigrationManager.java:493 - Gossiping my schema version 3d1a9d9e-1120-37ae-abe0-e064cd147a99
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,775 MigrationManager.java:493 - Gossiping my schema version 0bf74f5a-4b39-3525-b217-e9ccf7a1b6cb
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,905 MigrationManager.java:493 - Gossiping my schema version b145475a-02dc-370c-8af7-a9aba2d61362
> DEBUG [InternalResponseStage:12] 2019-06-12 15:24:26,445 MigrationManager.java:493 - Gossiping my schema version 9c2ed14a-8db5-39b3-af48-6cdb5463c772
> the schema version ending in -6cdb5463c772 is the proper version in the other nodes per gossipinfo. But as can be seen, the bootstrap completion message (15:23:25,526) is logged before four or five intermediate schema versions are created, which seem to be due to system_distributed and other keyspaces being created.
> The Bootstrap completed! message comes from a Future in StorageService line 1199 for cassandra 2.2.13 source:
> {code:java}
> ListenableFuture<StreamState> bootstrapStream = bootstrapper.bootstrap(streamStateStore, !replacing && useStrictConsistency); // handles token update Futures.addCallback(bootstrapStream, new FutureCallback<StreamState>() { @Override public void onSuccess(StreamState streamState) { isBootstrapMode = false; logger.info("Bootstrap completed! for the tokens {}", tokens); }
> {code}
> That future seems to complete a bit too quickly. Perhaps the schema did not synchronize quickly enough and therefore a list of migration/bootstrap tasks not created in time for the bootstrapper future to handle. We suspect the AWS asia datacenter networking/WAN links are very slow, or IPV6 is introducing other delays. 
> But the code should not allow a bootstrap to silently/falsely complete "successfully". 
> Smells like a race condition. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org