You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Shiwen Cheng <ch...@gmail.com> on 2015/03/26 19:31:13 UTC

Issue with removing a node and adding it back

Hi all,

I encountered an issue by removing and adding back a node.
Here is how this issue came out:
(1) We have four nodes cluster running, but there was a hard disk failure
on one of the node.
Since we need to replace the hard disk, I chose to use *removenode *to
remove the failed node.
(2) few days later, after the new hard disk is installed. I re-install the
Cassandra on this node.
I checked the .yaml file and it is the same as the other three nodes (only
difference is the listen_address), the newly added node is not in the seeds
list.
I used the same Cassandra version as other nodes which is 2.0.5
(3) With bootstrap set to true (by default), the new node "seems" can join
the cluster.
But:
 (a) OpsCenter shows this node is with "*unknow datacenter*".
 (b) the status of this node in OpsCenter is shown as "*joining*"
 (c) one of the node starts to streaming the data to the new node.
However, after few hours there is no futher streaming, but the data size is
not even close to other nodes which is definitely not finished.
 (d) The node is still shown as "joining" with "unknown datacenter"  in
OpsCenter.  More than 12 hours in this status.
 (e) *nodetool status* on other three machines doesn't show this newly
added node.

There are no exceptions in the log from the newly added node.
I tried many times to re-install cassandra, opscenter and datastax-agent
but no luck to solve it.
So I got stuck here.

Can anybody help? I really appreciate!

Thanks,
Shiwen

Re: Issue with removing a node and adding it back

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Mar 27, 2015 at 4:27 PM, Shiwen Cheng <ch...@gmail.com>
wrote:

> Thanks Robert!
> Yes I tried what you said: clean the data and re-bootstrap. But still it
> failed, once at the point of 600GB transferred and once at 1.1TB :(
>


1) figure out what is making your streams die (usually either flaky network
(AWS) or stop-the-world GC) and fix that
OR
2) try tuning streaming_socket_timeout_in_ms

=Rob

Re: Issue with removing a node and adding it back

Posted by Shiwen Cheng <ch...@gmail.com>.

Thanks Robert!
Yes I tried what you said: clean the data and re-bootstrap. But still it
failed, once at the point of 600GB transferred and once at 1.1TB :(

But I could see following exceptions from time to time:
=====================
java.io.IOException: net.jpountz.lz4.LZ4Exception: Error decoding offset 15
of input buffer
        at
org.apache.cassandra.io.compress.LZ4Compressor.uncompress(LZ4Compressor.java:89)
        at
org.apache.cassandra.streaming.compress.CompressedInputStream.decompress(CompressedInputStream.java:108)
        at
org.apache.cassandra.streaming.compress.CompressedInputStream.read(CompressedInputStream.java:86)
        at java.io.InputStream.read(InputStream.java:170)
        at java.io.InputStream.skip(InputStream.java:222)
        at
org.apache.cassandra.streaming.StreamReader.drain(StreamReader.java:117)
        at
org.apache.cassandra.streaming.compress.CompressedStreamReader.read(CompressedStreamReader.java:89)
        at
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:47)
        at
org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:37)
        at
org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
        at
org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:283)
        at java.lang.Thread.run(Thread.java:744)
=======================

And
=======================
CassandraDaemon.java (line 479) Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed
        at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86)
        at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:975)
        at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:736)
        at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:583)
        at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:482)
        at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:345)
        at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
        at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
        at
org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
        at
com.google.common.util.concurrent.Futures$4.run(Futures.java:1160)
        at
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
        at
com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
        at
com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
        at
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
        at
org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:211)
        at
org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:186)
        at
org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:329)
        at
org.apache.cassandra.streaming.StreamSession.convict(StreamSession.java:592)
        at
org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:236)
        at
org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:623)
        at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:64)
        at
org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:170)
        at
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:75)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
 INFO [StorageServiceShutdownHook] 2015-03-26 10:29:48,471 Gossiper.java
(line 1251) Announcing shutdown
==========================

Is there anything else I could try?
Thanks!

Shiwen

On Thu, Mar 26, 2015 at 4:18 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Thu, Mar 26, 2015 at 11:31 AM, Shiwen Cheng <ch...@gmail.com>
> wrote:
>
>> I encountered an issue by removing and adding back a node.
>>
>
> You are encountering a failed/hung bootstrap, which probably has nothing
> to do with the node having been previously removenoded.
>
> Stop the node, wipe all the data on the node, including it's system
> directory and re-bootstrap.
>
> =Rob
>
>

Re: Issue with removing a node and adding it back

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Mar 26, 2015 at 11:31 AM, Shiwen Cheng <ch...@gmail.com>
wrote:

> I encountered an issue by removing and adding back a node.
>

You are encountering a failed/hung bootstrap, which probably has nothing to
do with the node having been previously removenoded.

Stop the node, wipe all the data on the node, including it's system
directory and re-bootstrap.

=Rob