You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Artur Kronenberg <ar...@openmarket.com> on 2015/01/19 15:29:28 UTC

Nodetool removenode stuck

Hi,

we have had an issue with one of our nodes today:

1. Due to a wrong setup the starting node failed to properly bootstrap. 
It was shown as UN in the cluster however did not contain any data and 
we shut it down to fix our configuration issue.

2. We figured we need to remove the node from the cluster before being 
able to restart it cleanly and have it bootstrap automatically. We used 
"nodetool removenode UUID" which caused mutliple nodes in our Datacenter 
to be marked as DOWN for some reason (taken from the log) and a bunch of 
operations against our cluster to fail. The nodes have come up again and 
other than a slight heart attack we are fine.
However, the removenode operation is now stuck and won't continue.

Can anyone recommend on how to proceed safely from here? The node is 
marked as DL in our cluster. I found 
https://issues.apache.org/jira/browse/CASSANDRA-6542 however there is no 
hint on how to handle this properly.

Is it save to use the force option here? We don't want to risk the 
cluster going down for whatever reason again.

Thank you!

Artur

Re: Nodetool removenode stuck

Posted by Eric Stevens <mi...@gmail.com>.

I've seen removenode hang indefinitely also (per CASSANDRA-6542).
Generally speaking, if a node is in good health and you want to take it out
of the cluster for whatever reason (including the one you mentioned),
nodetool decommission is a better choice.  Removenode is for when a node is
unrecoverably offline.  Especially for your scenario where the node joined
the cluster without bootstrapping, decommission should have been fast, it
would have streamed what little data it knows about to the replicas it had
just taken over for, then exited the cluster gracefully.

Theoretically you ought to be able to do removenode force, and that
shouldn't cause disruption in your cluster, but it would make you
immediately due for a repair (why decommission is preferable).

It's surprising to me that removenode caused outages in the rest of your
cluster - I've not seen that personally.  Since you're already seeing
removenode cause troubles it shouldn't cause, it's tough to speculate on
whether variants on it will also be safe.

On Mon, Jan 19, 2015 at 7:29 AM, Artur Kronenberg <
artur.kronenberg@openmarket.com> wrote:

> Hi,
>
> we have had an issue with one of our nodes today:
>
> 1. Due to a wrong setup the starting node failed to properly bootstrap. It
> was shown as UN in the cluster however did not contain any data and we shut
> it down to fix our configuration issue.
>
> 2. We figured we need to remove the node from the cluster before being
> able to restart it cleanly and have it bootstrap automatically. We used
> "nodetool removenode UUID" which caused mutliple nodes in our Datacenter to
> be marked as DOWN for some reason (taken from the log) and a bunch of
> operations against our cluster to fail. The nodes have come up again and
> other than a slight heart attack we are fine.
> However, the removenode operation is now stuck and won't continue.
>
> Can anyone recommend on how to proceed safely from here? The node is
> marked as DL in our cluster. I found https://issues.apache.org/
> jira/browse/CASSANDRA-6542 however there is no hint on how to handle this
> properly.
>
> Is it save to use the force option here? We don't want to risk the cluster
> going down for whatever reason again.
>
> Thank you!
>
> Artur
>