You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bryce Godfrey <Br...@azaleos.com> on 2011/08/19 23:48:38 UTC

Completely removing a node from the cluster

I'm on 0.8.4

I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.

However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:

Cluster Information:
   Snitch: org.apache.cassandra.locator.SimpleSnitch
   Partitioner: org.apache.cassandra.dht.RandomPartitioner
   Schema versions:
        dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
        UNREACHABLE: [192.168.20.1]


Do I need to do something else to completely remove this node?

Thanks,
Bryce

Re: Completely removing a node from the cluster

Posted by aaron morton <aa...@thelastpickle.com>.

I normally link to the data stax article to avoid having to actually write those words :)

http://www.datastax.com/docs/0.8/troubleshooting/index#view-of-ring-differs-between-some-nodes
A
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23/08/2011, at 7:45 PM, Jonathan Colby wrote:

> I ran into this.  I also tried log_ring_state=false which also did not help.   The way I got through this was to stop the entire cluster and start the nodes one-by-one.   
> 
> I realize this is not a practical solution for everyone, but if you can afford to stop the cluster for a few minutes, it's worth a try.
> 
> 
> On Aug 23, 2011, at 9:26 AM, aaron morton wrote:
> 
>> I'm running low on ideas for this one. Anyone else ? 
>> 
>> If the phantom node is not listed in the ring, other nodes should not be storing hints for it. You can see what nodes they are storing hints for via JConsole. 
>> 
>> You can try a rolling restart passing the JVM opt -Dcassandra.load_ring_state=false However if the phantom node is been passed around in the gossip state it will probably just come back again. 
>> 
>> Cheers
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:
>> 
>>> Could this ghost node be causing my hints column family to grow to this size?  I also crash after about 24 hours due to commit logs growth taking up all the drive space.  A manual nodetool flush keeps it under control though.
>>> 
>>> 
>>>              Column Family: HintsColumnFamily
>>>              SSTable count: 6
>>>              Space used (live): 666480352
>>>              Space used (total): 666480352
>>>              Number of Keys (estimate): 768
>>>              Memtable Columns Count: 1043
>>>              Memtable Data Size: 461773
>>>              Memtable Switch Count: 3
>>>              Read Count: 38
>>>              Read Latency: 131.289 ms.
>>>              Write Count: 582108
>>>              Write Latency: 0.019 ms.
>>>              Pending Tasks: 0
>>>              Key cache capacity: 7
>>>              Key cache size: 6
>>>              Key cache hit rate: 0.8333333333333334
>>>              Row cache: disabled
>>>              Compacted row minimum size: 2816160
>>>              Compacted row maximum size: 386857368
>>>              Compacted row mean size: 120432714
>>> 
>>> Is there a way for me to manually remove this dead node?
>>> 
>>> -----Original Message-----
>>> From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com] 
>>> Sent: Sunday, August 21, 2011 9:09 PM
>>> To: user@cassandra.apache.org
>>> Subject: RE: Completely removing a node from the cluster
>>> 
>>> It's been at least 4 days now.
>>> 
>>> -----Original Message-----
>>> From: aaron morton [mailto:aaron@thelastpickle.com] 
>>> Sent: Sunday, August 21, 2011 3:16 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Completely removing a node from the cluster
>>> 
>>> I see the mistake I made about ring, gets the endpoint list from the same place but uses the token's to drive the whole process. 
>>> 
>>> I'm guessing here, don't have time to check all the code. But there is a 3 day timeout in the gossip system. Not sure if it applies in this case. 
>>> 
>>> Anyone know ?
>>> 
>>> Cheers
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
>>> 
>>>> Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still:
>>>> 192.168.20.2
>>>> 192.168.20.3
>>>> 192.168.20.1
>>>> 
>>>> The removetoken was done a few days ago, and I believe the remove was done from .2
>>>> 
>>>> Here is what ring outlook looks like, not sure why I get that token on the empty first line either:
>>>> Address         DC          Rack        Status State   Load            Owns    Token
>>>>                                                                            85070591730234615865843651857942052864
>>>> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%  0
>>>> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%  85070591730234615865843651857942052864
>>>> 
>>>> Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable.
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: aaron morton [mailto:aaron@thelastpickle.com] 
>>>> Sent: Sunday, August 21, 2011 4:23 AM
>>>> To: user@cassandra.apache.org
>>>> Subject: Re: Completely removing a node from the cluster
>>>> 
>>>> Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
>>>> The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 
>>>> 
>>>> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 
>>>> 
>>>> Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 
>>>> 
>>>> Cheers
>>>> 
>>>> 
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>> 
>>>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>>>> 
>>>>> I'm on 0.8.4
>>>>> 
>>>>> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
>>>>> 
>>>>> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
>>>>> 
>>>>> Cluster Information:
>>>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>>>> Schema versions:
>>>>>    dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>>>>    UNREACHABLE: [192.168.20.1]
>>>>> 
>>>>> 
>>>>> Do I need to do something else to completely remove this node?
>>>>> 
>>>>> Thanks,
>>>>> Bryce
>>>> 
>>> 
>> 
>

Re: Completely removing a node from the cluster

Posted by Jonathan Colby <jo...@gmail.com>.

I ran into this.  I also tried log_ring_state=false which also did not help.   The way I got through this was to stop the entire cluster and start the nodes one-by-one.   

I realize this is not a practical solution for everyone, but if you can afford to stop the cluster for a few minutes, it's worth a try.


On Aug 23, 2011, at 9:26 AM, aaron morton wrote:

> I'm running low on ideas for this one. Anyone else ? 
> 
> If the phantom node is not listed in the ring, other nodes should not be storing hints for it. You can see what nodes they are storing hints for via JConsole. 
> 
> You can try a rolling restart passing the JVM opt -Dcassandra.load_ring_state=false However if the phantom node is been passed around in the gossip state it will probably just come back again. 
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:
> 
>> Could this ghost node be causing my hints column family to grow to this size?  I also crash after about 24 hours due to commit logs growth taking up all the drive space.  A manual nodetool flush keeps it under control though.
>> 
>> 
>>               Column Family: HintsColumnFamily
>>               SSTable count: 6
>>               Space used (live): 666480352
>>               Space used (total): 666480352
>>               Number of Keys (estimate): 768
>>               Memtable Columns Count: 1043
>>               Memtable Data Size: 461773
>>               Memtable Switch Count: 3
>>               Read Count: 38
>>               Read Latency: 131.289 ms.
>>               Write Count: 582108
>>               Write Latency: 0.019 ms.
>>               Pending Tasks: 0
>>               Key cache capacity: 7
>>               Key cache size: 6
>>               Key cache hit rate: 0.8333333333333334
>>               Row cache: disabled
>>               Compacted row minimum size: 2816160
>>               Compacted row maximum size: 386857368
>>               Compacted row mean size: 120432714
>> 
>> Is there a way for me to manually remove this dead node?
>> 
>> -----Original Message-----
>> From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com] 
>> Sent: Sunday, August 21, 2011 9:09 PM
>> To: user@cassandra.apache.org
>> Subject: RE: Completely removing a node from the cluster
>> 
>> It's been at least 4 days now.
>> 
>> -----Original Message-----
>> From: aaron morton [mailto:aaron@thelastpickle.com] 
>> Sent: Sunday, August 21, 2011 3:16 PM
>> To: user@cassandra.apache.org
>> Subject: Re: Completely removing a node from the cluster
>> 
>> I see the mistake I made about ring, gets the endpoint list from the same place but uses the token's to drive the whole process. 
>> 
>> I'm guessing here, don't have time to check all the code. But there is a 3 day timeout in the gossip system. Not sure if it applies in this case. 
>> 
>> Anyone know ?
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
>> 
>>> Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still:
>>> 192.168.20.2
>>> 192.168.20.3
>>> 192.168.20.1
>>> 
>>> The removetoken was done a few days ago, and I believe the remove was done from .2
>>> 
>>> Here is what ring outlook looks like, not sure why I get that token on the empty first line either:
>>> Address         DC          Rack        Status State   Load            Owns    Token
>>>                                                                             85070591730234615865843651857942052864
>>> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%  0
>>> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%  85070591730234615865843651857942052864
>>> 
>>> Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable.
>>> 
>>> 
>>> -----Original Message-----
>>> From: aaron morton [mailto:aaron@thelastpickle.com] 
>>> Sent: Sunday, August 21, 2011 4:23 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Completely removing a node from the cluster
>>> 
>>> Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
>>> The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 
>>> 
>>> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 
>>> 
>>> Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 
>>> 
>>> Cheers
>>> 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>>> 
>>>> I'm on 0.8.4
>>>> 
>>>> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
>>>> 
>>>> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
>>>> 
>>>> Cluster Information:
>>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>>> Schema versions:
>>>>     dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>>>     UNREACHABLE: [192.168.20.1]
>>>> 
>>>> 
>>>> Do I need to do something else to completely remove this node?
>>>> 
>>>> Thanks,
>>>> Bryce
>>> 
>> 
>

RE: Completely removing a node from the cluster

Posted by Bryce Godfrey <Br...@azaleos.com>.

Taking the cluster down completely did remove the phantom node.  The hintscolumnfamily is causing a lot of commit logs to back up and threaten the commit log drive to run out of space.  A manual flush of that column family always clears out the files though.

-----Original Message-----
From: Brandon Williams [mailto:driftx@gmail.com] 
Sent: Tuesday, August 23, 2011 10:42 AM
To: user@cassandra.apache.org
Subject: Re: Completely removing a node from the cluster

On Tue, Aug 23, 2011 at 2:26 AM, aaron morton <aa...@thelastpickle.com> wrote:
> I'm running low on ideas for this one. Anyone else ?
>
> If the phantom node is not listed in the ring, other nodes should not be storing hints for it. You can see what nodes they are storing hints for via JConsole.

I think I found it in https://issues.apache.org/jira/browse/CASSANDRA-3071

--Brandon

Re: Completely removing a node from the cluster

Posted by Brandon Williams <dr...@gmail.com>.

On Tue, Aug 23, 2011 at 2:26 AM, aaron morton <aa...@thelastpickle.com> wrote:
> I'm running low on ideas for this one. Anyone else ?
>
> If the phantom node is not listed in the ring, other nodes should not be storing hints for it. You can see what nodes they are storing hints for via JConsole.

I think I found it in https://issues.apache.org/jira/browse/CASSANDRA-3071

--Brandon

Re: Completely removing a node from the cluster

Posted by aaron morton <aa...@thelastpickle.com>.

I'm running low on ideas for this one. Anyone else ? 

If the phantom node is not listed in the ring, other nodes should not be storing hints for it. You can see what nodes they are storing hints for via JConsole. 

You can try a rolling restart passing the JVM opt -Dcassandra.load_ring_state=false However if the phantom node is been passed around in the gossip state it will probably just come back again. 

Cheers


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:

> Could this ghost node be causing my hints column family to grow to this size?  I also crash after about 24 hours due to commit logs growth taking up all the drive space.  A manual nodetool flush keeps it under control though.
> 
> 
>                Column Family: HintsColumnFamily
>                SSTable count: 6
>                Space used (live): 666480352
>                Space used (total): 666480352
>                Number of Keys (estimate): 768
>                Memtable Columns Count: 1043
>                Memtable Data Size: 461773
>                Memtable Switch Count: 3
>                Read Count: 38
>                Read Latency: 131.289 ms.
>                Write Count: 582108
>                Write Latency: 0.019 ms.
>                Pending Tasks: 0
>                Key cache capacity: 7
>                Key cache size: 6
>                Key cache hit rate: 0.8333333333333334
>                Row cache: disabled
>                Compacted row minimum size: 2816160
>                Compacted row maximum size: 386857368
>                Compacted row mean size: 120432714
> 
> Is there a way for me to manually remove this dead node?
> 
> -----Original Message-----
> From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com] 
> Sent: Sunday, August 21, 2011 9:09 PM
> To: user@cassandra.apache.org
> Subject: RE: Completely removing a node from the cluster
> 
> It's been at least 4 days now.
> 
> -----Original Message-----
> From: aaron morton [mailto:aaron@thelastpickle.com] 
> Sent: Sunday, August 21, 2011 3:16 PM
> To: user@cassandra.apache.org
> Subject: Re: Completely removing a node from the cluster
> 
> I see the mistake I made about ring, gets the endpoint list from the same place but uses the token's to drive the whole process. 
> 
> I'm guessing here, don't have time to check all the code. But there is a 3 day timeout in the gossip system. Not sure if it applies in this case. 
> 
> Anyone know ?
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
> 
>> Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still:
>> 192.168.20.2
>> 192.168.20.3
>> 192.168.20.1
>> 
>> The removetoken was done a few days ago, and I believe the remove was done from .2
>> 
>> Here is what ring outlook looks like, not sure why I get that token on the empty first line either:
>> Address         DC          Rack        Status State   Load            Owns    Token
>>                                                                              85070591730234615865843651857942052864
>> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%  0
>> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%  85070591730234615865843651857942052864
>> 
>> Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable.
>> 
>> 
>> -----Original Message-----
>> From: aaron morton [mailto:aaron@thelastpickle.com] 
>> Sent: Sunday, August 21, 2011 4:23 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Completely removing a node from the cluster
>> 
>> Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
>> The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 
>> 
>> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 
>> 
>> Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 
>> 
>> Cheers
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>> 
>>> I'm on 0.8.4
>>> 
>>> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
>>> 
>>> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
>>> 
>>> Cluster Information:
>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>> Schema versions:
>>>      dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>>      UNREACHABLE: [192.168.20.1]
>>> 
>>> 
>>> Do I need to do something else to completely remove this node?
>>> 
>>> Thanks,
>>> Bryce
>> 
>

RE: Completely removing a node from the cluster

Posted by Bryce Godfrey <Br...@azaleos.com>.

Could this ghost node be causing my hints column family to grow to this size?  I also crash after about 24 hours due to commit logs growth taking up all the drive space.  A manual nodetool flush keeps it under control though.


                Column Family: HintsColumnFamily
                SSTable count: 6
                Space used (live): 666480352
                Space used (total): 666480352
                Number of Keys (estimate): 768
                Memtable Columns Count: 1043
                Memtable Data Size: 461773
                Memtable Switch Count: 3
                Read Count: 38
                Read Latency: 131.289 ms.
                Write Count: 582108
                Write Latency: 0.019 ms.
                Pending Tasks: 0
                Key cache capacity: 7
                Key cache size: 6
                Key cache hit rate: 0.8333333333333334
                Row cache: disabled
                Compacted row minimum size: 2816160
                Compacted row maximum size: 386857368
                Compacted row mean size: 120432714

Is there a way for me to manually remove this dead node?

-----Original Message-----
From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com] 
Sent: Sunday, August 21, 2011 9:09 PM
To: user@cassandra.apache.org
Subject: RE: Completely removing a node from the cluster

It's been at least 4 days now.

-----Original Message-----
From: aaron morton [mailto:aaron@thelastpickle.com] 
Sent: Sunday, August 21, 2011 3:16 PM
To: user@cassandra.apache.org
Subject: Re: Completely removing a node from the cluster

I see the mistake I made about ring, gets the endpoint list from the same place but uses the token's to drive the whole process. 

I'm guessing here, don't have time to check all the code. But there is a 3 day timeout in the gossip system. Not sure if it applies in this case. 

Anyone know ?

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:

> Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still:
> 192.168.20.2
> 192.168.20.3
> 192.168.20.1
> 
> The removetoken was done a few days ago, and I believe the remove was done from .2
> 
> Here is what ring outlook looks like, not sure why I get that token on the empty first line either:
> Address         DC          Rack        Status State   Load            Owns    Token
>                                                                               85070591730234615865843651857942052864
> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%  0
> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%  85070591730234615865843651857942052864
> 
> Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable.
> 
> 
> -----Original Message-----
> From: aaron morton [mailto:aaron@thelastpickle.com] 
> Sent: Sunday, August 21, 2011 4:23 AM
> To: user@cassandra.apache.org
> Subject: Re: Completely removing a node from the cluster
> 
> Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
> The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 
> 
> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 
> 
> Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
> 
>> I'm on 0.8.4
>> 
>> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
>> 
>> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
>> 
>> Cluster Information:
>>  Snitch: org.apache.cassandra.locator.SimpleSnitch
>>  Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>  Schema versions:
>>       dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>       UNREACHABLE: [192.168.20.1]
>> 
>> 
>> Do I need to do something else to completely remove this node?
>> 
>> Thanks,
>> Bryce
>

RE: Completely removing a node from the cluster

Posted by Bryce Godfrey <Br...@azaleos.com>.

It's been at least 4 days now.

-----Original Message-----
From: aaron morton [mailto:aaron@thelastpickle.com] 
Sent: Sunday, August 21, 2011 3:16 PM
To: user@cassandra.apache.org
Subject: Re: Completely removing a node from the cluster

I see the mistake I made about ring, gets the endpoint list from the same place but uses the token's to drive the whole process. 

I'm guessing here, don't have time to check all the code. But there is a 3 day timeout in the gossip system. Not sure if it applies in this case. 

Anyone know ?

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:

> Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still:
> 192.168.20.2
> 192.168.20.3
> 192.168.20.1
> 
> The removetoken was done a few days ago, and I believe the remove was done from .2
> 
> Here is what ring outlook looks like, not sure why I get that token on the empty first line either:
> Address         DC          Rack        Status State   Load            Owns    Token
>                                                                               85070591730234615865843651857942052864
> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%  0
> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%  85070591730234615865843651857942052864
> 
> Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable.
> 
> 
> -----Original Message-----
> From: aaron morton [mailto:aaron@thelastpickle.com] 
> Sent: Sunday, August 21, 2011 4:23 AM
> To: user@cassandra.apache.org
> Subject: Re: Completely removing a node from the cluster
> 
> Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
> The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 
> 
> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 
> 
> Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
> 
>> I'm on 0.8.4
>> 
>> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
>> 
>> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
>> 
>> Cluster Information:
>>  Snitch: org.apache.cassandra.locator.SimpleSnitch
>>  Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>  Schema versions:
>>       dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>       UNREACHABLE: [192.168.20.1]
>> 
>> 
>> Do I need to do something else to completely remove this node?
>> 
>> Thanks,
>> Bryce
>

Re: Completely removing a node from the cluster

Posted by aaron morton <aa...@thelastpickle.com>.

I see the mistake I made about ring, gets the endpoint list from the same place but uses the token's to drive the whole process. 

I'm guessing here, don't have time to check all the code. But there is a 3 day timeout in the gossip system. Not sure if it applies in this case. 

Anyone know ?

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:

> Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still:
> 192.168.20.2
> 192.168.20.3
> 192.168.20.1
> 
> The removetoken was done a few days ago, and I believe the remove was done from .2
> 
> Here is what ring outlook looks like, not sure why I get that token on the empty first line either:
> Address         DC          Rack        Status State   Load            Owns    Token
>                                                                               85070591730234615865843651857942052864
> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%  0
> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%  85070591730234615865843651857942052864
> 
> Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable.
> 
> 
> -----Original Message-----
> From: aaron morton [mailto:aaron@thelastpickle.com] 
> Sent: Sunday, August 21, 2011 4:23 AM
> To: user@cassandra.apache.org
> Subject: Re: Completely removing a node from the cluster
> 
> Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
> The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 
> 
> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 
> 
> Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
> 
>> I'm on 0.8.4
>> 
>> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
>> 
>> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
>> 
>> Cluster Information:
>>  Snitch: org.apache.cassandra.locator.SimpleSnitch
>>  Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>  Schema versions:
>>       dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>       UNREACHABLE: [192.168.20.1]
>> 
>> 
>> Do I need to do something else to completely remove this node?
>> 
>> Thanks,
>> Bryce
>

RE: Completely removing a node from the cluster

Posted by Bryce Godfrey <Br...@azaleos.com>.

Both .2 and .3 list the same from the mbean that Unreachable is empty collection, and Live node lists all 3 nodes still:
192.168.20.2
192.168.20.3
192.168.20.1

The removetoken was done a few days ago, and I believe the remove was done from .2

Here is what ring outlook looks like, not sure why I get that token on the empty first line either:
Address         DC          Rack        Status State   Load            Owns    Token
                                                                               85070591730234615865843651857942052864
192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       50.00%  0
192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       50.00%  85070591730234615865843651857942052864

Yes, both nodes show the same thing when doing a describe cluster, that .1 is unreachable.


-----Original Message-----
From: aaron morton [mailto:aaron@thelastpickle.com] 
Sent: Sunday, August 21, 2011 4:23 AM
To: user@cassandra.apache.org
Subject: Re: Completely removing a node from the cluster

Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 

Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 

Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 

Cheers


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:

> I'm on 0.8.4
> 
> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
> 
> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
> 
> Cluster Information:
>   Snitch: org.apache.cassandra.locator.SimpleSnitch
>   Partitioner: org.apache.cassandra.dht.RandomPartitioner
>   Schema versions:
>        dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>        UNREACHABLE: [192.168.20.1]
> 
> 
> Do I need to do something else to completely remove this node?
> 
> Thanks,
> Bryce

Re: Completely removing a node from the cluster

Posted by aaron morton <aa...@thelastpickle.com>.

Unreachable nodes in either did not respond to the message or were known to be down and were not sent a message. 
The way the node lists are obtained for the ring command and describe cluster are the same. So it's a bit odd. 

Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean ? What do the LiveNode and UnrechableNodes attributes say ? 

Also how long ago did you remove the token and on which machine? Do both 20.2 and 20.3 think 20.1 is still around ? 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:

> I'm on 0.8.4
> 
> I have removed a dead node from the cluster using nodetool removetoken command, and moved one of the remaining nodes to rebalance the tokens.  Everything looks fine when I run nodetool ring now, as it only lists the remaining 2 nodes and they both look fine, owning 50% of the tokens.
> 
> However, I can still see it being considered as part of the cluster from the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried that the cluster is still queuing up hints for the node, or any other issues it may cause:
> 
> Cluster Information:
>   Snitch: org.apache.cassandra.locator.SimpleSnitch
>   Partitioner: org.apache.cassandra.dht.RandomPartitioner
>   Schema versions:
>        dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>        UNREACHABLE: [192.168.20.1]
> 
> 
> Do I need to do something else to completely remove this node?
> 
> Thanks,
> Bryce