You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Karl Rieb <ka...@gmail.com> on 2015/03/15 16:04:57 UTC

bootstrap failure and strange gossiper state

I am also experiencing issues bootstrapping new nodes in my 2.0.10
Cassandra cluster.  The first attempt to bootstrap ALWAYS fails, followed
by a second bootstrap attempt that ALWAYS succeeds.

The first attempt at bootstrapping fails with:

 INFO [main] 2015-03-15 02:41:02,550 StorageService.java (line 966)
JOINING: Starting to bootstrap...
ERROR [main] 2015-03-15 02:41:02,872 CassandraDaemon.java (line 513)
Exception encountered during startup
java.lang.IllegalStateException: unable to find sufficient sources for
streaming range (7169067280919608187,7171404468239785904]
        at
org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:201)
        at
org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:125)
        at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:72)
        at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:994)
        at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:797)
        at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)
        at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:502)
        at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
        at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
        at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)


After the failure, I

   1. stop the node,
   2. clear our the data/saved_caches/commitlog directories, and
   3. remove the node from all peers (usually by manually deleting the node
   from their peers table)
   4. restart the node to re-attempt bootstrap

The bootstrap always seems to work on this second attempt.


I tried comparing the logs from the failed bootstrap and the successful
one, and the main difference I see is that the failed bootstrap contains
many "unknown endpoint" lines:

 INFO [main] 2015-03-15 02:40:25,160 StorageService.java (line 966)
JOINING: waiting for ring information
 INFO [HANDSHAKE-/10.30.30.30] 2015-03-15 02:40:25,175
OutboundTcpConnection.java (line 386) Handshaking version with /10.30.30.30
 INFO [HANDSHAKE-/10.4.4.4] 2015-03-15 02:40:25,279
OutboundTcpConnection.java (line 386) Handshaking version with /10.4.4.4
 INFO [HANDSHAKE-/10.20.20.20] 2015-03-15 02:40:25,383
OutboundTcpConnection.java (line 386) Handshaking version with /10.20.20.20
 INFO [HANDSHAKE-/10.10.10.10] 2015-03-15 02:40:25,489
OutboundTcpConnection.java (line 386) Handshaking version with /10.10.10.10
 INFO [HANDSHAKE-/10.5.5.5] 2015-03-15 02:40:25,596
OutboundTcpConnection.java (line 386) Handshaking version with /10.5.5.5
 INFO [RequestResponseStage:3] 2015-03-15 02:40:25,700 Gossiper.java (line
876) InetAddress /10.2.2.2 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,701 FailureDetector.java (line
200) unknown endpoint /10.2.2.2
ERROR [MigrationStage:1] 2015-03-15 02:40:25,701 MigrationTask.java (line
55) Can't send migration request: node /10.2.2.2 is down.
 INFO [RequestResponseStage:4] 2015-03-15 02:40:25,716 Gossiper.java (line
876) InetAddress /10.1.1.1 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,716 FailureDetector.java (line
200) unknown endpoint /10.1.1.1
ERROR [MigrationStage:1] 2015-03-15 02:40:25,716 MigrationTask.java (line
55) Can't send migration request: node /10.1.1.1 is down.
 INFO [RequestResponseStage:1] 2015-03-15 02:40:25,719 Gossiper.java (line
876) InetAddress /10.3.3.3 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,720 FailureDetector.java (line
200) unknown endpoint /10.3.3.3
ERROR [MigrationStage:1] 2015-03-15 02:40:25,720 MigrationTask.java (line
55) Can't send migration request: node /10.3.3.3 is down.
 INFO [RequestResponseStage:2] 2015-03-15 02:40:25,739 Gossiper.java (line
876) InetAddress /10.4.4.4 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,739 FailureDetector.java (line
200) unknown endpoint /10.4.4.4
ERROR [MigrationStage:1] 2015-03-15 02:40:25,740 MigrationTask.java (line
55) Can't send migration request: node /10.4.4.4 is down.
 INFO [RequestResponseStage:3] 2015-03-15 02:40:25,742 Gossiper.java (line
876) InetAddress /10.30.30.30 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,743 FailureDetector.java (line
200) unknown endpoint /10.30.30.30
ERROR [MigrationStage:1] 2015-03-15 02:40:25,743 MigrationTask.java (line
55) Can't send migration request: node /10.30.30.30 is down.
 INFO [RequestResponseStage:4] 2015-03-15 02:40:25,747 Gossiper.java (line
876) InetAddress /10.20.20.20 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,747 FailureDetector.java (line
200) unknown endpoint /10.20.20.20
ERROR [MigrationStage:1] 2015-03-15 02:40:25,748 MigrationTask.java (line
55) Can't send migration request: node /10.20.20.20 is down.
 INFO [RequestResponseStage:1] 2015-03-15 02:40:25,823 Gossiper.java (line
876) InetAddress /10.5.5.5 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,824 FailureDetector.java (line
200) unknown endpoint /10.5.5.5
ERROR [MigrationStage:1] 2015-03-15 02:40:25,824 MigrationTask.java (line
55) Can't send migration request: node /10.5.5.5 is down.
 INFO [RequestResponseStage:2] 2015-03-15 02:40:25,825 Gossiper.java (line
876) InetAddress /10.10.10.10 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:25,825 FailureDetector.java (line
200) unknown endpoint /10.10.10.10
ERROR [MigrationStage:1] 2015-03-15 02:40:25,825 MigrationTask.java (line
55) Can't send migration request: node /10.10.10.10 is down.
 INFO [HANDSHAKE-/10.6.6.6] 2015-03-15 02:40:26,009
OutboundTcpConnection.java (line 386) Handshaking version with /10.6.6.6
 INFO [RequestResponseStage:3] 2015-03-15 02:40:26,430 Gossiper.java (line
876) InetAddress /10.6.6.6 is now UP
ERROR [MigrationStage:1] 2015-03-15 02:40:26,430 FailureDetector.java (line
200) unknown endpoint /10.6.6.6
ERROR [MigrationStage:1] 2015-03-15 02:40:26,430 MigrationTask.java (line
55) Can't send migration request: node /10.6.6.6 is down.
 INFO [GossipStage:1] 2015-03-15 02:40:28,934 Gossiper.java (line 910) Node
/10.9.9.9 is now part of the cluster
 INFO [GossipStage:1] 2015-03-15 02:40:29,009 Gossiper.java (line 910) Node
/10.1.1.1 is now part of the cluster
 INFO [HANDSHAKE-/10.1.1.1] 2015-03-15 02:40:29,010
OutboundTcpConnection.java (line 386) Handshaking version with /10.1.1.1
 INFO [RequestResponseStage:4] 2015-03-15 02:40:29,018 Gossiper.java (line
876) InetAddress /10.1.1.1 is now UP
 INFO [GossipStage:1] 2015-03-15 02:40:29,084 Gossiper.java (line 910) Node
/10.8.8.8 is now part of the cluster
 INFO [HANDSHAKE-/10.8.8.8] 2015-03-15 02:40:29,086
OutboundTcpConnection.java (line 386) Handshaking version with /10.8.8.8
 INFO [HANDSHAKE-/10.1.1.1] 2015-03-15 02:40:29,087
OutboundTcpConnection.java (line 386) Handshaking version with /10.1.1.1
 INFO [RequestResponseStage:1] 2015-03-15 02:40:29,113 Gossiper.java (line
876) InetAddress /10.8.8.8 is now UP
 INFO [GossipStage:1] 2015-03-15 02:40:29,159 Gossiper.java (line 910) Node
/10.2.2.2 is now part of the cluster
 INFO [HANDSHAKE-/10.8.8.8] 2015-03-15 02:40:29,160
OutboundTcpConnection.java (line 386) Handshaking version with /10.8.8.8
 INFO [HANDSHAKE-/10.2.2.2] 2015-03-15 02:40:29,161
OutboundTcpConnection.java (line 386) Handshaking version with /10.2.2.2
 INFO [RequestResponseStage:2] 2015-03-15 02:40:29,166 Gossiper.java (line
876) InetAddress /10.2.2.2 is now UP
 INFO [GossipStage:1] 2015-03-15 02:40:29,213 Gossiper.java (line 910) Node
/10.4.4.4 is now part of the cluster
 INFO [HANDSHAKE-/10.2.2.2] 2015-03-15 02:40:29,227
OutboundTcpConnection.java (line 386) Handshaking version with /10.2.2.2
 INFO [HANDSHAKE-/10.4.4.4] 2015-03-15 02:40:29,243
OutboundTcpConnection.java (line 386) Handshaking version with /10.4.4.4
 INFO [RequestResponseStage:3] 2015-03-15 02:40:29,252 Gossiper.java (line
876) InetAddress /10.4.4.4 is now UP
 INFO [GossipStage:1] 2015-03-15 02:40:29,265 Gossiper.java (line 910) Node
/10.6.6.6 is now part of the cluster
 INFO [HANDSHAKE-/10.6.6.6] 2015-03-15 02:40:29,277
OutboundTcpConnection.java (line 386) Handshaking version with /10.6.6.6
 INFO [RequestResponseStage:4] 2015-03-15 02:40:29,283 Gossiper.java (line
876) InetAddress /10.6.6.6 is now UP
 INFO [HANDSHAKE-/10.4.4.4] 2015-03-15 02:40:29,304
OutboundTcpConnection.java (line 386) Handshaking version with /10.4.4.4
 INFO [GossipStage:1] 2015-03-15 02:40:29,314 Gossiper.java (line 910) Node
/10.3.3.3 is now part of the cluster
 INFO [HANDSHAKE-/10.6.6.6] 2015-03-15 02:40:29,342
OutboundTcpConnection.java (line 386) Handshaking version with /10.6.6.6
 INFO [HANDSHAKE-/10.3.3.3] 2015-03-15 02:40:29,342
OutboundTcpConnection.java (line 386) Handshaking version with /10.3.3.3
ERROR [MigrationStage:1] 2015-03-15 02:40:29,353 MigrationTask.java (line
55) Can't send migration request: node /10.3.3.3 is down.
 INFO [GossipStage:1] 2015-03-15 02:40:29,354 Gossiper.java (line 910) Node
/10.30.30.30 is now part of the cluster
 INFO [HANDSHAKE-/10.30.30.30] 2015-03-15 02:40:29,371
OutboundTcpConnection.java (line 386) Handshaking version with /10.30.30.30
 INFO [RequestResponseStage:1] 2015-03-15 02:40:29,379 Gossiper.java (line
876) InetAddress /10.30.30.30 is now UP
 INFO [GossipStage:1] 2015-03-15 02:40:29,387 Gossiper.java (line 910) Node
/10.20.20.20 is now part of the cluster
 INFO [HANDSHAKE-/10.20.20.20] 2015-03-15 02:40:29,394
OutboundTcpConnection.java (line 386) Handshaking version with /10.20.20.20
 INFO [HANDSHAKE-/10.30.30.30] 2015-03-15 02:40:29,395
OutboundTcpConnection.java (line 386) Handshaking version with /10.30.30.30
 INFO [RequestResponseStage:2] 2015-03-15 02:40:29,403 Gossiper.java (line
876) InetAddress /10.20.20.20 is now UP
 INFO [InternalResponseStage:1] 2015-03-15 02:40:29,407
ColumnFamilyStore.java (line 794) Enqueuing flush of
Memtable-schema_keyspaces@96619019(663/6630 serialized/live bytes, 16 ops)
...
 INFO [RequestResponseStage:3] 2015-03-15 02:40:38,647 Gossiper.java (line
876) InetAddress /10.3.3.3 is now UP
 INFO [RequestResponseStage:1] 2015-03-15 02:40:38,647 Gossiper.java (line
876) InetAddress /10.3.3.3 is now UP
 INFO [RequestResponseStage:4] 2015-03-15 02:40:38,647 Gossiper.java (line
876) InetAddress /10.3.3.3 is now UP
 INFO [RequestResponseStage:2] 2015-03-15 02:40:38,647 Gossiper.java (line
876) InetAddress /10.3.3.3 is now UP
 INFO [main] 2015-03-15 02:41:02,550 StorageService.java (line 966)
JOINING: Starting to bootstrap...
ERROR [main] 2015-03-15 02:41:02,872 CassandraDaemon.java (line 513)
Exception encountered during startup
java.lang.IllegalStateException: unable to find sufficient sources for
streaming range (7169067280919608187,7171404468239785904]


There appears to be a disconnect between what the Gossiper is logging (the
node is up and endpoint state is saved) and what the MigrationStage is
seeing.  I can't seem to pinpoint the issue.  The other nodes all show UN
(Up/Normal) for the ring during the new node's bootstrap.  Also this issue
is reproducible in our cluster every time we attempt to add a new node
(i.e. it is not intermittent or load dependent).

We never had this problem before when we were on 1.2.11.  Since then, we
have

   1. upgraded to 2.0.10,
   2. migrated our hardware to new i2.xlarge instances (using rsync and the
   replace address system property)
   3. shrunk our cluster down from 16 nodes to 9 (due to the much larger
   newer instances)

I have searched around for this issue, and only really see a couple of
references to it:

   - http://www.mail-archive.com/user%40cassandra.apache.org/msg34971.html
   - http://www.mail-archive.com/user%40cassandra.apache.org/msg38065.html

Any help would be appreciated,
Karl