You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ryan Hadley <ry...@sgizmo.com> on 2011/09/14 15:54:02 UTC

Nodetool removetoken taking days to run.

Hi,

So, here's the backstory:

We were running Cassandra 0.7.4 and at one point in time had a node in the ring at 10.84.73.18. We removed this node from the ring successfully in 0.7.4. It stopped showing in the nodetool ring command. But occasionally we'd still get weird log entries about failing to write/read to IP 10.84.73.18.

We upgraded to Cassandra 0.8.4. Now, nodetool ring shows this old node:

10.84.73.18     datacenter1 rack1       Down   Leaving ?               6.71%   32695837177645752437561450928649262701      

So I started a nodetool removetoken on 32695837177645752437561450928649262701 last Friday. It's still going strong this morning, on day 5:

./bin/nodetool -h 10.84.73.47 -p 8080 removetoken status
RemovalStatus: Removing token (32695837177645752437561450928649262701). Waiting for replication confirmation from [/10.84.73.49,/10.84.73.48,/10.84.73.51].

Should I just be patient? Or is something really weird with this node?

Thanks-
ryan

Re: Nodetool removetoken taking days to run.

Posted by Brandon Williams <dr...@gmail.com>.
On Wed, Sep 14, 2011 at 4:25 PM, Ryan Hadley <ry...@sgizmo.com> wrote:
> Hi Brandon,
>
> Thanks for the reply. Quick question though:
>
> 1. We write all data to this ring with a TTL of 30 days
> 2. This node hasn't been in the ring for at least 90 days, more like 120 days since it's been in the ring.
>
> So, if I nodetool removetoken forced it, would I still have to be concerned about running a repair?

There have probably been some writes that thought that node was part
of the replica set, so you may still be missing a replica in that
regard.  If you're only holding the data for 30 days though, it might
not be worth the trouble of repairing and instead bet that not all of
the live replicas will die in the next month.

> Also, after this node is removed, I'm going to rebalance with nodetool move. Would that remove the repair requirement too?

If you intend to replace the node, it's better to bootstrap the new
node at the dead node's token minus one, and then do the removetoken
force.  This would actually obviate the need to repair (except for one
key, you can move the node to the old token once it has been removed)
assuming that your consistency level was greater than ONE for writes,
or your clients always replayed any failures. This holds true for
moving to the old token as well.

-Brandon

Re: Nodetool removetoken taking days to run.

Posted by Ryan Hadley <ry...@sgizmo.com>.
On Sep 14, 2011, at 2:08 PM, Brandon Williams wrote:

> On Wed, Sep 14, 2011 at 8:54 AM, Ryan Hadley <ry...@sgizmo.com> wrote:
>> Hi,
>> 
>> So, here's the backstory:
>> 
>> We were running Cassandra 0.7.4 and at one point in time had a node in the ring at 10.84.73.18. We removed this node from the ring successfully in 0.7.4. It stopped showing in the nodetool ring command. But occasionally we'd still get weird log entries about failing to write/read to IP 10.84.73.18.
>> 
>> We upgraded to Cassandra 0.8.4. Now, nodetool ring shows this old node:
>> 
>> 10.84.73.18     datacenter1 rack1       Down   Leaving ?               6.71%   32695837177645752437561450928649262701
>> 
>> So I started a nodetool removetoken on 32695837177645752437561450928649262701 last Friday. It's still going strong this morning, on day 5:
>> 
>> ./bin/nodetool -h 10.84.73.47 -p 8080 removetoken status
>> RemovalStatus: Removing token (32695837177645752437561450928649262701). Waiting for replication confirmation from [/10.84.73.49,/10.84.73.48,/10.84.73.51].
>> 
>> Should I just be patient? Or is something really weird with this node?
> 
> 5 days seems excessive unless there is a very large amount of data per
> node.  I would check nodetool netstats, and if the streams don't look
> active issue a 'removetoken force' against 10.84.73.47 and accept that
> you may possibly need to run repair to restore the replica count.
> 
> -Brandon

Hi Brandon,

Thanks for the reply. Quick question though:

1. We write all data to this ring with a TTL of 30 days
2. This node hasn't been in the ring for at least 90 days, more like 120 days since it's been in the ring.

So, if I nodetool removetoken forced it, would I still have to be concerned about running a repair?

Also, after this node is removed, I'm going to rebalance with nodetool move. Would that remove the repair requirement too?

Thanks-
Ryan

Re: Nodetool removetoken taking days to run.

Posted by Brandon Williams <dr...@gmail.com>.
On Wed, Sep 14, 2011 at 8:54 AM, Ryan Hadley <ry...@sgizmo.com> wrote:
> Hi,
>
> So, here's the backstory:
>
> We were running Cassandra 0.7.4 and at one point in time had a node in the ring at 10.84.73.18. We removed this node from the ring successfully in 0.7.4. It stopped showing in the nodetool ring command. But occasionally we'd still get weird log entries about failing to write/read to IP 10.84.73.18.
>
> We upgraded to Cassandra 0.8.4. Now, nodetool ring shows this old node:
>
> 10.84.73.18     datacenter1 rack1       Down   Leaving ?               6.71%   32695837177645752437561450928649262701
>
> So I started a nodetool removetoken on 32695837177645752437561450928649262701 last Friday. It's still going strong this morning, on day 5:
>
> ./bin/nodetool -h 10.84.73.47 -p 8080 removetoken status
> RemovalStatus: Removing token (32695837177645752437561450928649262701). Waiting for replication confirmation from [/10.84.73.49,/10.84.73.48,/10.84.73.51].
>
> Should I just be patient? Or is something really weird with this node?

5 days seems excessive unless there is a very large amount of data per
node.  I would check nodetool netstats, and if the streams don't look
active issue a 'removetoken force' against 10.84.73.47 and accept that
you may possibly need to run repair to restore the replica count.

-Brandon