You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Arindam Barua <ab...@247-inc.com> on 2014/02/14 10:04:00 UTC

Bootstrap stuck: vnode enabled 1.2.12

After our otherwise successful upgrade procedure to enable vnodes, when adding back "new" hosts to our cluster, one non-seed host ran into a hardware issue during bootstrap. By the time the hardware issue was fixed a week later, all other nodes were added successfully, cleaned, repaired. The disks on this node were untouched, and when the node was started back up, it detected an interrupted bootstrap, and attempted to bootstrap. However, after ~24 hrs it was still stuck in the 'JOINING' state according to nodetool netstats on that node, even though no streams were flowing to/from it. Also, it did not appear in nodetool status in any way/form (not even as JOINING).

>From couple of observed thread dumps, the stack of the thread blocked during bootstrap is at [1].

Since the node wasn't making any progress, I ended up stopping Cassandra, cleaning up the data and commitlog directories, and attempted a fresh bootstrap. Nodetool netstats immediately reported a whole bunch of streams queued up, and data started streaming to the node. The data directory quickly grew to 18 GB (the other nodes had ~25GB, but we have lot of data with low TTLs). However, the node ended up being in the earlier reported state, i.e. nodetool netstats doesn't have anything queued, but still reports the JOINING state, even though it's been > 24 hrs. There are no other ERRORS in the logs, and new data being written to the cluster makes it to this node just fine, triggering compactions, etc from time to time.

Any help is appreciated.

Thanks,
Arindam

[1] Thread dump
Thread 3708: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may
   be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
   line=156 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
   @bci=1, line=811 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int)
   @bci=55, line=969 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)
   @bci=24, line=1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted
   frame)
 - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=209, line=256
   (Interpreted frame)
 - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=120, line=84
   (Interpreted frame)
 - org.apache.cassandra.service.StorageService.bootstrap(java.util.Collection)
   @bci=172, line=978 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=827,
   line=744 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer(int) @bci=363,
   line=585 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer() @bci=4, line=482
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.setup() @bci=1069, line=348
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.activate() @bci=59, line=447
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[]) @bci=3,
   line=490 (Interpreted frame)

RE: Bootstrap stuck: vnode enabled 1.2.12

Posted by Arindam Barua <ab...@247-inc.com>.

As an update - finally got the node to join the ring.

Restarting all the nodes in the cluster, followed by a clean bootstrap of the node that was stuck did the trick.

-Arindam

From: Arindam Barua [mailto:abarua@247-inc.com]
Sent: Monday, February 24, 2014 5:04 PM
To: user@cassandra.apache.org
Subject: RE: Bootstrap stuck: vnode enabled 1.2.12

The host would not join the ring after more clean bootstrap attempts.

Noticed nodetool netstats, even though doesn't repair any streaming, does constantly report "Nothing streaming from" 3 specific hosts in the ring.

$ nodetool netstats
xss =  -ea -d64 -javaagent:/usr/local/cassandra/bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8043M -Xmx8043M -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
Mode: JOINING
Not sending any streams.
Nothing streaming from /10.67.XXX.XXX
Nothing streaming from /10.67.XXX.XXX
Nothing streaming from /10.67.XXX.XXX

Today when I had to do some unrelated maintenance and attempted to drain the hosts mentioned above before restarting cassandra, the drain would just hang. Other hosts in the ring did not have any issue.
Also the original host that is stuck in the joining state, logged the following:

[24/02/2014:15:49:42 PST] GossipTasks:1: ERROR AbstractStreamSession.java (line 110) Stream failed because /10.67.XXX.XXX died or was restarted/removed (streams may still be active in background, but further streams won't be started)
[24/02/2014:15:49:42 PST] GossipTasks:1:  WARN RangeStreamer.java (line 246) Streaming from /10.67.XXX.XXX failed

From: Arindam Barua [mailto:abarua@247-inc.com]
Sent: Tuesday, February 18, 2014 5:16 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: RE: Bootstrap stuck: vnode enabled 1.2.12

I believe you are talking about CASSANDRA-6685, which was introduced in 1.2.15.

I'm trying to add a node to a production ring. I have added nodes previously just fine. However, this node had hardware issues during a previous bootstrap, and now even a clean bootstrap seems to be having problems. Does the ring somehow remember about this node and if so can I make it forget about it? Decommission/removenode does not work on a node that hasn't yet bootstrapped.

From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Tuesday, February 18, 2014 12:30 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Bootstrap stuck: vnode enabled 1.2.12

There is a bug where a node without schema can not bootstrap. Do you have schema?

On Tue, Feb 18, 2014 at 1:29 PM, Arindam Barua <ab...@247-inc.com>> wrote:

The node is still out of the ring. Any suggestions on how to get it in will be very helpful.

From: Arindam Barua [mailto:abarua@247-inc.com<ma...@247-inc.com>]
Sent: Friday, February 14, 2014 1:04 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Bootstrap stuck: vnode enabled 1.2.12

After our otherwise successful upgrade procedure to enable vnodes, when adding back "new" hosts to our cluster, one non-seed host ran into a hardware issue during bootstrap. By the time the hardware issue was fixed a week later, all other nodes were added successfully, cleaned, repaired. The disks on this node were untouched, and when the node was started back up, it detected an interrupted bootstrap, and attempted to bootstrap. However, after ~24 hrs it was still stuck in the 'JOINING' state according to nodetool netstats on that node, even though no streams were flowing to/from it. Also, it did not appear in nodetool status in any way/form (not even as JOINING).

>From couple of observed thread dumps, the stack of the thread blocked during bootstrap is at [1].

Since the node wasn't making any progress, I ended up stopping Cassandra, cleaning up the data and commitlog directories, and attempted a fresh bootstrap. Nodetool netstats immediately reported a whole bunch of streams queued up, and data started streaming to the node. The data directory quickly grew to 18 GB (the other nodes had ~25GB, but we have lot of data with low TTLs). However, the node ended up being in the earlier reported state, i.e. nodetool netstats doesn't have anything queued, but still reports the JOINING state, even though it's been > 24 hrs. There are no other ERRORS in the logs, and new data being written to the cluster makes it to this node just fine, triggering compactions, etc from time to time.

Any help is appreciated.

Thanks,
Arindam
[1] Thread dump
Thread 3708: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may
   be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
   line=156 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
   @bci=1, line=811 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int)
   @bci=55, line=969 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)
   @bci=24, line=1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted
   frame)
 - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=209, line=256
   (Interpreted frame)
 - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=120, line=84
   (Interpreted frame)
 - org.apache.cassandra.service.StorageService.bootstrap(java.util.Collection)
   @bci=172, line=978 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=827,
   line=744 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer(int) @bci=363,
   line=585 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer() @bci=4, line=482
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.setup() @bci=1069, line=348
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.activate() @bci=59, line=447
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[]) @bci=3,
   line=490 (Interpreted frame)

RE: Bootstrap stuck: vnode enabled 1.2.12

Posted by Arindam Barua <ab...@247-inc.com>.

The host would not join the ring after more clean bootstrap attempts.

Noticed nodetool netstats, even though doesn't repair any streaming, does constantly report "Nothing streaming from" 3 specific hosts in the ring.

$ nodetool netstats
xss =  -ea -d64 -javaagent:/usr/local/cassandra/bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8043M -Xmx8043M -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -Xss256k
Mode: JOINING
Not sending any streams.
Nothing streaming from /10.67.XXX.XXX
Nothing streaming from /10.67.XXX.XXX
Nothing streaming from /10.67.XXX.XXX

Today when I had to do some unrelated maintenance and attempted to drain the hosts mentioned above before restarting cassandra, the drain would just hang. Other hosts in the ring did not have any issue.
Also the original host that is stuck in the joining state, logged the following:

[24/02/2014:15:49:42 PST] GossipTasks:1: ERROR AbstractStreamSession.java (line 110) Stream failed because /10.67.XXX.XXX died or was restarted/removed (streams may still be active in background, but further streams won't be started)
[24/02/2014:15:49:42 PST] GossipTasks:1:  WARN RangeStreamer.java (line 246) Streaming from /10.67.XXX.XXX failed


From: Arindam Barua [mailto:abarua@247-inc.com]
Sent: Tuesday, February 18, 2014 5:16 PM
To: user@cassandra.apache.org
Subject: RE: Bootstrap stuck: vnode enabled 1.2.12


I believe you are talking about CASSANDRA-6685, which was introduced in 1.2.15.

I'm trying to add a node to a production ring. I have added nodes previously just fine. However, this node had hardware issues during a previous bootstrap, and now even a clean bootstrap seems to be having problems. Does the ring somehow remember about this node and if so can I make it forget about it? Decommission/removenode does not work on a node that hasn't yet bootstrapped.

From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Tuesday, February 18, 2014 12:30 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Bootstrap stuck: vnode enabled 1.2.12

There is a bug where a node without schema can not bootstrap. Do you have schema?

On Tue, Feb 18, 2014 at 1:29 PM, Arindam Barua <ab...@247-inc.com>> wrote:

The node is still out of the ring. Any suggestions on how to get it in will be very helpful.

From: Arindam Barua [mailto:abarua@247-inc.com<ma...@247-inc.com>]
Sent: Friday, February 14, 2014 1:04 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Bootstrap stuck: vnode enabled 1.2.12


After our otherwise successful upgrade procedure to enable vnodes, when adding back "new" hosts to our cluster, one non-seed host ran into a hardware issue during bootstrap. By the time the hardware issue was fixed a week later, all other nodes were added successfully, cleaned, repaired. The disks on this node were untouched, and when the node was started back up, it detected an interrupted bootstrap, and attempted to bootstrap. However, after ~24 hrs it was still stuck in the 'JOINING' state according to nodetool netstats on that node, even though no streams were flowing to/from it. Also, it did not appear in nodetool status in any way/form (not even as JOINING).

>From couple of observed thread dumps, the stack of the thread blocked during bootstrap is at [1].

Since the node wasn't making any progress, I ended up stopping Cassandra, cleaning up the data and commitlog directories, and attempted a fresh bootstrap. Nodetool netstats immediately reported a whole bunch of streams queued up, and data started streaming to the node. The data directory quickly grew to 18 GB (the other nodes had ~25GB, but we have lot of data with low TTLs). However, the node ended up being in the earlier reported state, i.e. nodetool netstats doesn't have anything queued, but still reports the JOINING state, even though it's been > 24 hrs. There are no other ERRORS in the logs, and new data being written to the cluster makes it to this node just fine, triggering compactions, etc from time to time.

Any help is appreciated.

Thanks,
Arindam
[1] Thread dump
Thread 3708: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may
   be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
   line=156 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
   @bci=1, line=811 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int)
   @bci=55, line=969 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)
   @bci=24, line=1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted
   frame)
 - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=209, line=256
   (Interpreted frame)
 - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=120, line=84
   (Interpreted frame)
 - org.apache.cassandra.service.StorageService.bootstrap(java.util.Collection)
   @bci=172, line=978 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=827,
   line=744 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer(int) @bci=363,
   line=585 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer() @bci=4, line=482
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.setup() @bci=1069, line=348
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.activate() @bci=59, line=447
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[]) @bci=3,
   line=490 (Interpreted frame)

RE: Bootstrap stuck: vnode enabled 1.2.12

Posted by Arindam Barua <ab...@247-inc.com>.

I believe you are talking about CASSANDRA-6685, which was introduced in 1.2.15.

I'm trying to add a node to a production ring. I have added nodes previously just fine. However, this node had hardware issues during a previous bootstrap, and now even a clean bootstrap seems to be having problems. Does the ring somehow remember about this node and if so can I make it forget about it? Decommission/removenode does not work on a node that hasn't yet bootstrapped.

From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Tuesday, February 18, 2014 12:30 PM
To: user@cassandra.apache.org
Subject: Re: Bootstrap stuck: vnode enabled 1.2.12

There is a bug where a node without schema can not bootstrap. Do you have schema?

On Tue, Feb 18, 2014 at 1:29 PM, Arindam Barua <ab...@247-inc.com>> wrote:

The node is still out of the ring. Any suggestions on how to get it in will be very helpful.

From: Arindam Barua [mailto:abarua@247-inc.com<ma...@247-inc.com>]
Sent: Friday, February 14, 2014 1:04 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Bootstrap stuck: vnode enabled 1.2.12

After our otherwise successful upgrade procedure to enable vnodes, when adding back "new" hosts to our cluster, one non-seed host ran into a hardware issue during bootstrap. By the time the hardware issue was fixed a week later, all other nodes were added successfully, cleaned, repaired. The disks on this node were untouched, and when the node was started back up, it detected an interrupted bootstrap, and attempted to bootstrap. However, after ~24 hrs it was still stuck in the 'JOINING' state according to nodetool netstats on that node, even though no streams were flowing to/from it. Also, it did not appear in nodetool status in any way/form (not even as JOINING).

>From couple of observed thread dumps, the stack of the thread blocked during bootstrap is at [1].

Since the node wasn't making any progress, I ended up stopping Cassandra, cleaning up the data and commitlog directories, and attempted a fresh bootstrap. Nodetool netstats immediately reported a whole bunch of streams queued up, and data started streaming to the node. The data directory quickly grew to 18 GB (the other nodes had ~25GB, but we have lot of data with low TTLs). However, the node ended up being in the earlier reported state, i.e. nodetool netstats doesn't have anything queued, but still reports the JOINING state, even though it's been > 24 hrs. There are no other ERRORS in the logs, and new data being written to the cluster makes it to this node just fine, triggering compactions, etc from time to time.

Any help is appreciated.

Thanks,
Arindam
[1] Thread dump
Thread 3708: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may
   be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
   line=156 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
   @bci=1, line=811 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int)
   @bci=55, line=969 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)
   @bci=24, line=1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted
   frame)
 - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=209, line=256
   (Interpreted frame)
 - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=120, line=84
   (Interpreted frame)
 - org.apache.cassandra.service.StorageService.bootstrap(java.util.Collection)
   @bci=172, line=978 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=827,
   line=744 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer(int) @bci=363,
   line=585 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer() @bci=4, line=482
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.setup() @bci=1069, line=348
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.activate() @bci=59, line=447
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[]) @bci=3,
   line=490 (Interpreted frame)

Re: Bootstrap stuck: vnode enabled 1.2.12

Posted by Edward Capriolo <ed...@gmail.com>.

There is a bug where a node without schema can not bootstrap. Do you have
schema?


On Tue, Feb 18, 2014 at 1:29 PM, Arindam Barua <ab...@247-inc.com> wrote:

>
>
> The node is still out of the ring. Any suggestions on how to get it in
> will be very helpful.
>
>
>
> *From:* Arindam Barua [mailto:abarua@247-inc.com]
> *Sent:* Friday, February 14, 2014 1:04 AM
> *To:* user@cassandra.apache.org
> *Subject:* Bootstrap stuck: vnode enabled 1.2.12
>
>
>
>
>
> After our otherwise successful upgrade procedure to enable vnodes, when
> adding back "new" hosts to our cluster, one non-seed host ran into a
> hardware issue during bootstrap. By the time the hardware issue was fixed a
> week later, all other nodes were added successfully, cleaned, repaired. The
> disks on this node were untouched, and when the node was started back up,
> it detected an interrupted bootstrap, and attempted to bootstrap. However,
> after ~24 hrs it was still stuck in the 'JOINING' state according to
> nodetool netstats on that node, even though no streams were flowing to/from
> it. Also, it did not appear in nodetool status in any way/form (not even as
> JOINING).
>
>
>
> From couple of observed thread dumps, the stack of the thread blocked
> during bootstrap is at [1].
>
>
>
> Since the node wasn't making any progress, I ended up stopping Cassandra,
> cleaning up the data and commitlog directories, and attempted a fresh
> bootstrap. Nodetool netstats immediately reported a whole bunch of streams
> queued up, and data started streaming to the node. The data directory
> quickly grew to 18 GB (the other nodes had ~25GB, but we have lot of data
> with low TTLs). However, the node ended up being in the earlier reported
> state, i.e. nodetool netstats doesn't have anything queued, but still
> reports the JOINING state, even though it's been > 24 hrs. There are no
> other ERRORS in the logs, and new data being written to the cluster makes
> it to this node just fine, triggering compactions, etc from time to time.
>
>
>
> Any help is appreciated.
>
>
>
> Thanks,
>
> Arindam
>
> [1] Thread dump
> Thread 3708: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information
> may
>    be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
>    line=156 (Interpreted frame)
>  -
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
>    @bci=1, line=811 (Interpreted frame)
>  -
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int)
>    @bci=55, line=969 (Interpreted frame)
>  -
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)
>    @bci=24, line=1281 (Interpreted frame)
>  - java.util.concurrent.CountDownLatch.await() @bci=5, line=207
> (Interpreted
>    frame)
>  - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=209, line=256
>    (Interpreted frame)
>  - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=120, line=84
>    (Interpreted frame)
>  -
> org.apache.cassandra.service.StorageService.bootstrap(java.util.Collection)
>    @bci=172, line=978 (Interpreted frame)
>  - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=827,
>    line=744 (Interpreted frame)
>  - org.apache.cassandra.service.StorageService.initServer(int) @bci=363,
>    line=585 (Interpreted frame)
>  - org.apache.cassandra.service.StorageService.initServer() @bci=4,
> line=482
>    (Interpreted frame)
>  - org.apache.cassandra.service.CassandraDaemon.setup() @bci=1069, line=348
>    (Interpreted frame)
>  - org.apache.cassandra.service.CassandraDaemon.activate() @bci=59,
> line=447
>    (Interpreted frame)
>  - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[])
> @bci=3,
>    line=490 (Interpreted frame)
>

RE: Bootstrap stuck: vnode enabled 1.2.12

Posted by Arindam Barua <ab...@247-inc.com>.

The node is still out of the ring. Any suggestions on how to get it in will be very helpful.

From: Arindam Barua [mailto:abarua@247-inc.com]
Sent: Friday, February 14, 2014 1:04 AM
To: user@cassandra.apache.org
Subject: Bootstrap stuck: vnode enabled 1.2.12

After our otherwise successful upgrade procedure to enable vnodes, when adding back "new" hosts to our cluster, one non-seed host ran into a hardware issue during bootstrap. By the time the hardware issue was fixed a week later, all other nodes were added successfully, cleaned, repaired. The disks on this node were untouched, and when the node was started back up, it detected an interrupted bootstrap, and attempted to bootstrap. However, after ~24 hrs it was still stuck in the 'JOINING' state according to nodetool netstats on that node, even though no streams were flowing to/from it. Also, it did not appear in nodetool status in any way/form (not even as JOINING).

>From couple of observed thread dumps, the stack of the thread blocked during bootstrap is at [1].

Since the node wasn't making any progress, I ended up stopping Cassandra, cleaning up the data and commitlog directories, and attempted a fresh bootstrap. Nodetool netstats immediately reported a whole bunch of streams queued up, and data started streaming to the node. The data directory quickly grew to 18 GB (the other nodes had ~25GB, but we have lot of data with low TTLs). However, the node ended up being in the earlier reported state, i.e. nodetool netstats doesn't have anything queued, but still reports the JOINING state, even though it's been > 24 hrs. There are no other ERRORS in the logs, and new data being written to the cluster makes it to this node just fine, triggering compactions, etc from time to time.

Any help is appreciated.

Thanks,
Arindam
[1] Thread dump
Thread 3708: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may
   be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14,
   line=156 (Interpreted frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
   @bci=1, line=811 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(int)
   @bci=55, line=969 (Interpreted frame)
 -
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(int)
   @bci=24, line=1281 (Interpreted frame)
 - java.util.concurrent.CountDownLatch.await() @bci=5, line=207 (Interpreted
   frame)
 - org.apache.cassandra.dht.RangeStreamer.fetch() @bci=209, line=256
   (Interpreted frame)
 - org.apache.cassandra.dht.BootStrapper.bootstrap() @bci=120, line=84
   (Interpreted frame)
 - org.apache.cassandra.service.StorageService.bootstrap(java.util.Collection)
   @bci=172, line=978 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.joinTokenRing(int) @bci=827,
   line=744 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer(int) @bci=363,
   line=585 (Interpreted frame)
 - org.apache.cassandra.service.StorageService.initServer() @bci=4, line=482
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.setup() @bci=1069, line=348
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.activate() @bci=59, line=447
   (Interpreted frame)
 - org.apache.cassandra.service.CassandraDaemon.main(java.lang.String[]) @bci=3,
   line=490 (Interpreted frame)