You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jay Zhuang (JIRA)" <ji...@apache.org> on 2016/12/28 02:14:58 UTC

[jira] [Commented] (CASSANDRA-12172) Fail to bootstrap new node.

    [ https://issues.apache.org/jira/browse/CASSANDRA-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781814#comment-15781814 ] 

Jay Zhuang commented on CASSANDRA-12172:
----------------------------------------

We saw similar issue but only when the bootstrapping is interrupted. For a brand new node, bootstrap works fine. But if it's interrupted by any reason, and restarted again, we saw this issue: {{A node required to move the data consistently is down (/IP)}}. It could be reproduced by killing a {{UJ}} node and re-start it again.

For us, the workaround is either deleting the data (then bootstrap again), or increasing the [{{ring_delay_ms}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L122]. And the larger the cluster is, the longer {{ring_delay_ms}} is needed. Based on our tests, for a 40 nodes cluster, it requires {{ring_delay_ms}} to be >50seconds. For a 70 nodes cluster, >100seconds. Default is 30seconds.

I guess the problem maybe because when [{{addSavedEndpoint}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1396], the initial status are marked as [{{dead}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1416], it took time for large cluster to mark all nodes to live. Especially for when messagingService version after [{{VERSION_20}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L961], which [{{sendRR()}}|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/gms/Gossiper.java#L984] to check.

A simple fix would be set the {{ring_delay_ms}} based on the number of nodes.

> Fail to bootstrap new node.
> ---------------------------
>
>                 Key: CASSANDRA-12172
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12172
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Dikang Gu
>
> When I try to bootstrap new node in the cluster, sometimes it failed because of following exceptions.
> {code}
> 2016-07-12_05:14:55.58509 INFO  05:14:55 [main]: JOINING: Starting to bootstrap...
> 2016-07-12_05:14:56.07491 INFO  05:14:56 [GossipTasks:1]: InetAddress /2401:db00:2011:50c7:face:0:9:0 is now DOWN
> 2016-07-12_05:14:56.32219 Exception (java.lang.RuntimeException) encountered during startup: A node required to move the data consistently is down (/2401:db00:2011:50c7:face:0:9:0). If you wish to move the data from a potentially inconsis
> tent replica, restart the node with -Dcassandra.consistent.rangemovement=false
> 2016-07-12_05:14:56.32582 ERROR 05:14:56 [main]: Exception encountered during startup
> 2016-07-12_05:14:56.32583 java.lang.RuntimeException: A node required to move the data consistently is down (/2401:db00:2011:50c7:face:0:9:0). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dc
> assandra.consistent.rangemovement=false
> 2016-07-12_05:14:56.32584       at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:264) ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:147) ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:82) ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1230) ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32584       at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:924) ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585       at org.apache.cassandra.service.StorageService.initServer(StorageService.java:709) ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585       at org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) ~[apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32585       at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32586       at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32586       at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.5+git20160315.c29948b.jar:2.2.5+git20160315.c29948b]
> 2016-07-12_05:14:56.32730 WARN  05:14:56 [StorageServiceShutdownHook]: No local state or state is in silent shutdown, not announcing shutdown
> {code}
> Here are more logs: https://gist.github.com/DikangGu/c6a83eafdbc091250eade4a3bddcc40b
> I'm pretty sure there are no DOWN nodes or restarted nodes in the cluster, but I still see a lot of nodes UP and DOWN in the gossip log, which failed the bootstrap at the end, is this a known bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)