You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Sebastian Estevez <se...@datastax.com> on 2015/09/21 20:11:48 UTC

Re: Unable to remove dead node from cluster.

Order is decommission, remove, assassinate.

Which have you tried?
On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:

> Hi there,
>
> I have a dead node in our cluster, which is a wired state right now, and
> can not be removed from cluster.
>
> The nodestatus shows:
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address                          Load       Tokens  Owns    Host ID
>                             Rack
> DN  10.210.165.55                    ?          256     ?       null
>                            r1
>
> I tried the unsafeAssassinateEndpoint, but got exception like:
> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is
> now DOWN
> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
> Thread[GossipStage:1,5,main]
> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
> 2015-09-18_23:21:40.80669       at
> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80669       at
> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80670       at
> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80671       at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80671       at
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80672       at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80673       at
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80673       at
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80673       at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80674       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> ~[na:1.7.0_45]
> 2015-09-18_23:21:40.80674       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> ~[na:1.7.0_45]
> 2015-09-18_23:21:40.80674       at java.lang.Thread.run(Thread.java:744)
> ~[na:1.7.0_45]
> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to
> local pause of 10852378435 > 5000000000
>
> Any suggestions about how to remove it?
> Thanks.
>
> --
> Dikang
>
>

Re: Unable to remove dead node from cluster.

Posted by Jeff Jirsa <je...@crowdstrike.com>.

Apparently this was reported back in May: https://issues.apache.org/jira/browse/CASSANDRA-9510

- Jeff

From:  Dikang Gu
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, September 25, 2015 at 11:31 AM
To:  cassandra
Subject:  Re: Unable to remove dead node from cluster.

The NPE throws when node tried to handleStateLeft, because it can not find the tokens associated with the node, can we just ignore the NPE and continue to remove the endpoint from the ring?

On Fri, Sep 25, 2015 at 10:52 AM, Dikang Gu <di...@gmail.com> wrote:
@Jeff, yeah, I run the nodetool grep, and in my case, some nodes return "301", and some nodes return "300". And 300 is the correct number of nodes in my cluster. 

So it does look like an inconsistent issue, can you open a jira for this? Also, I'm looking for a quick fix/patch for this.

On Fri, Sep 25, 2015 at 7:43 AM, Nate McCall <na...@thelastpickle.com> wrote:
A few other folks have reported issues with lingering dead nodes on large clusters - Jason Brown *just* gave an excellent gossip presentation at the summit regarding gossip optimizations for large clusters. 

Gossip is in the process of being refactored (here's at least one of the issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it would be worth opening an issue with as much information as you can provide to, at the very least, have information avaiable for others. 

On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa <je...@crowdstrike.com> wrote:
The stack trace is one similar to one I recall seeing recently, but don’t have in front of me. This is an outside chance that is not at all certain to be the case.

For EACH of the hundreds of nodes in your cluster, I suggest you run 

nodetool status | egrep “(^UN|^DN)" | wc -l 

and count to see if every node really has every other node in its ring properly. 

I suspect, but am not at all sure, that you have inconsistencies you’re not yet aware of (for example, if you expect that you have 100 nodes in the cluster, I’m betting that the query above returns 99 on at least one of the nodes).  If this is the case, please reply so that you and I can submit a Jira and compare our stack traces and we can find the underlying root cause of this together. 

- Jeff

From: Dikang Gu
Reply-To: "user@cassandra.apache.org"
Date: Thursday, September 24, 2015 at 9:10 PM
To: cassandra 

Subject: Re: Unable to remove dead node from cluster.

@Jeff, I just use jmx connect to one node, run the unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.

Yes, we have hundreds of other nodes in the nodetool status output as well.

On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa <je...@crowdstrike.com> wrote:
When you run unsafeAssassinateEndpoint, to which host are you connected, and what argument are you passing?

Are there other nodes in the ring that you’re not including in the ‘nodetool status’ output?

From: Dikang Gu
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, September 22, 2015 at 10:09 PM
To: cassandra
Cc: "dev@cassandra.apache.org"
Subject: Re: Unable to remove dead node from cluster.

ping.

On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:
I have tried all of them, neither of them worked. 
1. decommission: the host had hardware issue, and I can not connect to it.
2. remove, there is not HostID, so the removenode did not work.
3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we fix it?

Thanks
Dikang.

On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <se...@datastax.com> wrote:

Order is decommission, remove, assassinate.

Which have you tried?

On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
Hi there, 

I have a dead node in our cluster, which is a wired state right now, and can not be removed from cluster.

The nodestatus shows:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                          Load       Tokens  Owns    Host ID                               Rack
DN  10.210.165.55                    ?          256     ?       null                                  r1

I tried the unsafeAssassinateEndpoint, but got exception like:
2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is now DOWN
2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread Thread[GossipStage:1,5,main]
2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
2015-09-18_23:21:40.80669       at org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80669       at org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80670       at org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80672       at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80674       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_45]
2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to local pause of 10852378435 > 5000000000

Any suggestions about how to remove it?
Thanks.

-- 
Dikang

-- 
Dikang

-- 
Dikang

-- 
Dikang

-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

-- 
Dikang

-- 
Dikang

Re: Unable to remove dead node from cluster.

Posted by Dikang Gu <di...@gmail.com>.

The NPE throws when node tried to handleStateLeft, because it can not find
the tokens associated with the node, can we just ignore the NPE and
continue to remove the endpoint from the ring?

On Fri, Sep 25, 2015 at 10:52 AM, Dikang Gu <di...@gmail.com> wrote:

> @Jeff, yeah, I run the nodetool grep, and in my case, some nodes return
> "301", and some nodes return "300". And 300 is the correct number of nodes
> in my cluster.
>
> So it does look like an inconsistent issue, can you open a jira for this?
> Also, I'm looking for a quick fix/patch for this.
>
> On Fri, Sep 25, 2015 at 7:43 AM, Nate McCall <na...@thelastpickle.com>
> wrote:
>
>> A few other folks have reported issues with lingering dead nodes on large
>> clusters - Jason Brown *just* gave an excellent gossip presentation at the
>> summit regarding gossip optimizations for large clusters.
>>
>> Gossip is in the process of being refactored (here's at least one of the
>> issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it
>> would be worth opening an issue with as much information as you can provide
>> to, at the very least, have information avaiable for others.
>>
>> On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>>> The stack trace is one similar to one I recall seeing recently, but
>>> don’t have in front of me. This is an outside chance that is not at all
>>> certain to be the case.
>>>
>>> For EACH of the hundreds of nodes in your cluster, I suggest you run
>>>
>>> nodetool status | egrep “(^UN|^DN)" | wc -l
>>>
>>> and count to see if every node really has every other node in its ring
>>> properly.
>>>
>>> I suspect, but am not at all sure, that you have inconsistencies you’re
>>> not yet aware of (for example, if you expect that you have 100 nodes in the
>>> cluster, I’m betting that the query above returns 99 on at least one of the
>>> nodes).  If this is the case, please reply so that you and I can submit a
>>> Jira and compare our stack traces and we can find the underlying root cause
>>> of this together.
>>>
>>> - Jeff
>>>
>>> From: Dikang Gu
>>> Reply-To: "user@cassandra.apache.org"
>>> Date: Thursday, September 24, 2015 at 9:10 PM
>>> To: cassandra
>>>
>>> Subject: Re: Unable to remove dead node from cluster.
>>>
>>> @Jeff, I just use jmx connect to one node, run the
>>> unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.
>>>
>>> Yes, we have hundreds of other nodes in the nodetool status output as
>>> well.
>>>
>>> On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com
>>> > wrote:
>>>
>>>> When you run unsafeAssassinateEndpoint, to which host are you
>>>> connected, and what argument are you passing?
>>>>
>>>> Are there other nodes in the ring that you’re not including in the
>>>> ‘nodetool status’ output?
>>>>
>>>>
>>>> From: Dikang Gu
>>>> Reply-To: "user@cassandra.apache.org"
>>>> Date: Tuesday, September 22, 2015 at 10:09 PM
>>>> To: cassandra
>>>> Cc: "dev@cassandra.apache.org"
>>>> Subject: Re: Unable to remove dead node from cluster.
>>>>
>>>> ping.
>>>>
>>>> On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:
>>>>
>>>>> I have tried all of them, neither of them worked.
>>>>> 1. decommission: the host had hardware issue, and I can not connect to
>>>>> it.
>>>>> 2. remove, there is not HostID, so the removenode did not work.
>>>>> 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before,
>>>>> can we fix it?
>>>>>
>>>>> Thanks
>>>>> Dikang.
>>>>>
>>>>> On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
>>>>> sebastian.estevez@datastax.com> wrote:
>>>>>
>>>>>> Order is decommission, remove, assassinate.
>>>>>>
>>>>>> Which have you tried?
>>>>>> On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I have a dead node in our cluster, which is a wired state right now,
>>>>>>> and can not be removed from cluster.
>>>>>>>
>>>>>>> The nodestatus shows:
>>>>>>> Datacenter: DC1
>>>>>>> ===============
>>>>>>> Status=Up/Down
>>>>>>> |/ State=Normal/Leaving/Joining/Moving
>>>>>>> --  Address                          Load       Tokens  Owns    Host
>>>>>>> ID                               Rack
>>>>>>> DN  10.210.165.55                    ?          256     ?       null
>>>>>>>                                  r1
>>>>>>>
>>>>>>> I tried the unsafeAssassinateEndpoint, but got exception like:
>>>>>>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55
>>>>>>> is now DOWN
>>>>>>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>>>>>>> Thread[GossipStage:1,5,main]
>>>>>>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>>>>>>> 2015-09-18_23:21:40.80669       at
>>>>>>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80669       at
>>>>>>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80670       at
>>>>>>> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80671       at
>>>>>>> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80671       at
>>>>>>> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80672       at
>>>>>>> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80673       at
>>>>>>> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80673       at
>>>>>>> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80673       at
>>>>>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>>> 2015-09-18_23:21:40.80674       at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>> ~[na:1.7.0_45]
>>>>>>> 2015-09-18_23:21:40.80674       at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>> ~[na:1.7.0_45]
>>>>>>> 2015-09-18_23:21:40.80674       at
>>>>>>> java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_45]
>>>>>>> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due
>>>>>>> to local pause of 10852378435 > 5000000000
>>>>>>>
>>>>>>> Any suggestions about how to remove it?
>>>>>>> Thanks.
>>>>>>>
>>>>>>> --
>>>>>>> Dikang
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dikang
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dikang
>>>>
>>>>
>>>
>>>
>>> --
>>> Dikang
>>>
>>>
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>>
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>
> --
> Dikang
>
>


-- 
Dikang

Re: Unable to remove dead node from cluster.

Posted by Dikang Gu <di...@gmail.com>.

@Jeff, yeah, I run the nodetool grep, and in my case, some nodes return
"301", and some nodes return "300". And 300 is the correct number of nodes
in my cluster.

So it does look like an inconsistent issue, can you open a jira for this?
Also, I'm looking for a quick fix/patch for this.

On Fri, Sep 25, 2015 at 7:43 AM, Nate McCall <na...@thelastpickle.com> wrote:

> A few other folks have reported issues with lingering dead nodes on large
> clusters - Jason Brown *just* gave an excellent gossip presentation at the
> summit regarding gossip optimizations for large clusters.
>
> Gossip is in the process of being refactored (here's at least one of the
> issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it
> would be worth opening an issue with as much information as you can provide
> to, at the very least, have information avaiable for others.
>
> On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> The stack trace is one similar to one I recall seeing recently, but don’t
>> have in front of me. This is an outside chance that is not at all certain
>> to be the case.
>>
>> For EACH of the hundreds of nodes in your cluster, I suggest you run
>>
>> nodetool status | egrep “(^UN|^DN)" | wc -l
>>
>> and count to see if every node really has every other node in its ring
>> properly.
>>
>> I suspect, but am not at all sure, that you have inconsistencies you’re
>> not yet aware of (for example, if you expect that you have 100 nodes in the
>> cluster, I’m betting that the query above returns 99 on at least one of the
>> nodes).  If this is the case, please reply so that you and I can submit a
>> Jira and compare our stack traces and we can find the underlying root cause
>> of this together.
>>
>> - Jeff
>>
>> From: Dikang Gu
>> Reply-To: "user@cassandra.apache.org"
>> Date: Thursday, September 24, 2015 at 9:10 PM
>> To: cassandra
>>
>> Subject: Re: Unable to remove dead node from cluster.
>>
>> @Jeff, I just use jmx connect to one node, run the
>> unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.
>>
>> Yes, we have hundreds of other nodes in the nodetool status output as
>> well.
>>
>> On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa <je...@crowdstrike.com>
>> wrote:
>>
>>> When you run unsafeAssassinateEndpoint, to which host are you connected,
>>> and what argument are you passing?
>>>
>>> Are there other nodes in the ring that you’re not including in the
>>> ‘nodetool status’ output?
>>>
>>>
>>> From: Dikang Gu
>>> Reply-To: "user@cassandra.apache.org"
>>> Date: Tuesday, September 22, 2015 at 10:09 PM
>>> To: cassandra
>>> Cc: "dev@cassandra.apache.org"
>>> Subject: Re: Unable to remove dead node from cluster.
>>>
>>> ping.
>>>
>>> On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:
>>>
>>>> I have tried all of them, neither of them worked.
>>>> 1. decommission: the host had hardware issue, and I can not connect to
>>>> it.
>>>> 2. remove, there is not HostID, so the removenode did not work.
>>>> 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can
>>>> we fix it?
>>>>
>>>> Thanks
>>>> Dikang.
>>>>
>>>> On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
>>>> sebastian.estevez@datastax.com> wrote:
>>>>
>>>>> Order is decommission, remove, assassinate.
>>>>>
>>>>> Which have you tried?
>>>>> On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I have a dead node in our cluster, which is a wired state right now,
>>>>>> and can not be removed from cluster.
>>>>>>
>>>>>> The nodestatus shows:
>>>>>> Datacenter: DC1
>>>>>> ===============
>>>>>> Status=Up/Down
>>>>>> |/ State=Normal/Leaving/Joining/Moving
>>>>>> --  Address                          Load       Tokens  Owns    Host
>>>>>> ID                               Rack
>>>>>> DN  10.210.165.55                    ?          256     ?       null
>>>>>>                                  r1
>>>>>>
>>>>>> I tried the unsafeAssassinateEndpoint, but got exception like:
>>>>>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55
>>>>>> is now DOWN
>>>>>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>>>>>> Thread[GossipStage:1,5,main]
>>>>>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>>>>>> 2015-09-18_23:21:40.80669       at
>>>>>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80669       at
>>>>>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80670       at
>>>>>> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80671       at
>>>>>> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80671       at
>>>>>> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80672       at
>>>>>> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80673       at
>>>>>> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80673       at
>>>>>> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80673       at
>>>>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>>> 2015-09-18_23:21:40.80674       at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>> ~[na:1.7.0_45]
>>>>>> 2015-09-18_23:21:40.80674       at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>> ~[na:1.7.0_45]
>>>>>> 2015-09-18_23:21:40.80674       at
>>>>>> java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_45]
>>>>>> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due
>>>>>> to local pause of 10852378435 > 5000000000
>>>>>>
>>>>>> Any suggestions about how to remove it?
>>>>>> Thanks.
>>>>>>
>>>>>> --
>>>>>> Dikang
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Dikang
>>>>
>>>>
>>>
>>>
>>> --
>>> Dikang
>>>
>>>
>>
>>
>> --
>> Dikang
>>
>>
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



-- 
Dikang

Re: Unable to remove dead node from cluster.

Posted by Nate McCall <na...@thelastpickle.com>.

A few other folks have reported issues with lingering dead nodes on large
clusters - Jason Brown *just* gave an excellent gossip presentation at the
summit regarding gossip optimizations for large clusters.

Gossip is in the process of being refactored (here's at least one of the
issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it would
be worth opening an issue with as much information as you can provide to,
at the very least, have information avaiable for others.

On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> The stack trace is one similar to one I recall seeing recently, but don’t
> have in front of me. This is an outside chance that is not at all certain
> to be the case.
>
> For EACH of the hundreds of nodes in your cluster, I suggest you run
>
> nodetool status | egrep “(^UN|^DN)" | wc -l
>
> and count to see if every node really has every other node in its ring
> properly.
>
> I suspect, but am not at all sure, that you have inconsistencies you’re
> not yet aware of (for example, if you expect that you have 100 nodes in the
> cluster, I’m betting that the query above returns 99 on at least one of the
> nodes).  If this is the case, please reply so that you and I can submit a
> Jira and compare our stack traces and we can find the underlying root cause
> of this together.
>
> - Jeff
>
> From: Dikang Gu
> Reply-To: "user@cassandra.apache.org"
> Date: Thursday, September 24, 2015 at 9:10 PM
> To: cassandra
>
> Subject: Re: Unable to remove dead node from cluster.
>
> @Jeff, I just use jmx connect to one node, run the
> unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.
>
> Yes, we have hundreds of other nodes in the nodetool status output as well.
>
> On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa <je...@crowdstrike.com>
> wrote:
>
>> When you run unsafeAssassinateEndpoint, to which host are you connected,
>> and what argument are you passing?
>>
>> Are there other nodes in the ring that you’re not including in the
>> ‘nodetool status’ output?
>>
>>
>> From: Dikang Gu
>> Reply-To: "user@cassandra.apache.org"
>> Date: Tuesday, September 22, 2015 at 10:09 PM
>> To: cassandra
>> Cc: "dev@cassandra.apache.org"
>> Subject: Re: Unable to remove dead node from cluster.
>>
>> ping.
>>
>> On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:
>>
>>> I have tried all of them, neither of them worked.
>>> 1. decommission: the host had hardware issue, and I can not connect to
>>> it.
>>> 2. remove, there is not HostID, so the removenode did not work.
>>> 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can
>>> we fix it?
>>>
>>> Thanks
>>> Dikang.
>>>
>>> On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
>>> sebastian.estevez@datastax.com> wrote:
>>>
>>>> Order is decommission, remove, assassinate.
>>>>
>>>> Which have you tried?
>>>> On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I have a dead node in our cluster, which is a wired state right now,
>>>>> and can not be removed from cluster.
>>>>>
>>>>> The nodestatus shows:
>>>>> Datacenter: DC1
>>>>> ===============
>>>>> Status=Up/Down
>>>>> |/ State=Normal/Leaving/Joining/Moving
>>>>> --  Address                          Load       Tokens  Owns    Host
>>>>> ID                               Rack
>>>>> DN  10.210.165.55                    ?          256     ?       null
>>>>>                                r1
>>>>>
>>>>> I tried the unsafeAssassinateEndpoint, but got exception like:
>>>>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55
>>>>> is now DOWN
>>>>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>>>>> Thread[GossipStage:1,5,main]
>>>>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>>>>> 2015-09-18_23:21:40.80669       at
>>>>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80669       at
>>>>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80670       at
>>>>> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80671       at
>>>>> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80671       at
>>>>> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80672       at
>>>>> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80673       at
>>>>> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80673       at
>>>>> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80673       at
>>>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>>> 2015-09-18_23:21:40.80674       at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> ~[na:1.7.0_45]
>>>>> 2015-09-18_23:21:40.80674       at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> ~[na:1.7.0_45]
>>>>> 2015-09-18_23:21:40.80674       at
>>>>> java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_45]
>>>>> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to
>>>>> local pause of 10852378435 > 5000000000
>>>>>
>>>>> Any suggestions about how to remove it?
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Dikang
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Dikang
>>>
>>>
>>
>>
>> --
>> Dikang
>>
>>
>
>
> --
> Dikang
>
>


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Unable to remove dead node from cluster.

Posted by Jeff Jirsa <je...@crowdstrike.com>.

The stack trace is one similar to one I recall seeing recently, but don’t have in front of me. This is an outside chance that is not at all certain to be the case.

For EACH of the hundreds of nodes in your cluster, I suggest you run 

nodetool status | egrep “(^UN|^DN)" | wc -l 

and count to see if every node really has every other node in its ring properly. 

I suspect, but am not at all sure, that you have inconsistencies you’re not yet aware of (for example, if you expect that you have 100 nodes in the cluster, I’m betting that the query above returns 99 on at least one of the nodes).  If this is the case, please reply so that you and I can submit a Jira and compare our stack traces and we can find the underlying root cause of this together. 

- Jeff

From:  Dikang Gu
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, September 24, 2015 at 9:10 PM
To:  cassandra
Subject:  Re: Unable to remove dead node from cluster.

@Jeff, I just use jmx connect to one node, run the unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.

Yes, we have hundreds of other nodes in the nodetool status output as well.

On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa <je...@crowdstrike.com> wrote:
When you run unsafeAssassinateEndpoint, to which host are you connected, and what argument are you passing?

Are there other nodes in the ring that you’re not including in the ‘nodetool status’ output?


From: Dikang Gu
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, September 22, 2015 at 10:09 PM
To: cassandra
Cc: "dev@cassandra.apache.org"
Subject: Re: Unable to remove dead node from cluster.

ping.

On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:
I have tried all of them, neither of them worked. 
1. decommission: the host had hardware issue, and I can not connect to it.
2. remove, there is not HostID, so the removenode did not work.
3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we fix it?

Thanks
Dikang.

On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <se...@datastax.com> wrote:

Order is decommission, remove, assassinate.

Which have you tried?

On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
Hi there, 

I have a dead node in our cluster, which is a wired state right now, and can not be removed from cluster.

The nodestatus shows:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                          Load       Tokens  Owns    Host ID                               Rack
DN  10.210.165.55                    ?          256     ?       null                                  r1

I tried the unsafeAssassinateEndpoint, but got exception like:
2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is now DOWN
2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread Thread[GossipStage:1,5,main]
2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
2015-09-18_23:21:40.80669       at org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80669       at org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80670       at org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80672       at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80674       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_45]
2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to local pause of 10852378435 > 5000000000

Any suggestions about how to remove it?
Thanks.

-- 
Dikang




-- 
Dikang




-- 
Dikang




-- 
Dikang

Re: Unable to remove dead node from cluster.

Posted by Dikang Gu <di...@gmail.com>.

@Jeff, I just use jmx connect to one node, run the unsafeAssainateEndpoint,
and pass in the "10.210.165.55" ip address.

Yes, we have hundreds of other nodes in the nodetool status output as well.

On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> When you run unsafeAssassinateEndpoint, to which host are you connected,
> and what argument are you passing?
>
> Are there other nodes in the ring that you’re not including in the
> ‘nodetool status’ output?
>
>
> From: Dikang Gu
> Reply-To: "user@cassandra.apache.org"
> Date: Tuesday, September 22, 2015 at 10:09 PM
> To: cassandra
> Cc: "dev@cassandra.apache.org"
> Subject: Re: Unable to remove dead node from cluster.
>
> ping.
>
> On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:
>
>> I have tried all of them, neither of them worked.
>> 1. decommission: the host had hardware issue, and I can not connect to it.
>> 2. remove, there is not HostID, so the removenode did not work.
>> 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can
>> we fix it?
>>
>> Thanks
>> Dikang.
>>
>> On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
>> sebastian.estevez@datastax.com> wrote:
>>
>>> Order is decommission, remove, assassinate.
>>>
>>> Which have you tried?
>>> On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
>>>
>>>> Hi there,
>>>>
>>>> I have a dead node in our cluster, which is a wired state right now,
>>>> and can not be removed from cluster.
>>>>
>>>> The nodestatus shows:
>>>> Datacenter: DC1
>>>> ===============
>>>> Status=Up/Down
>>>> |/ State=Normal/Leaving/Joining/Moving
>>>> --  Address                          Load       Tokens  Owns    Host ID
>>>>                               Rack
>>>> DN  10.210.165.55                    ?          256     ?       null
>>>>                                r1
>>>>
>>>> I tried the unsafeAssassinateEndpoint, but got exception like:
>>>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is
>>>> now DOWN
>>>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>>>> Thread[GossipStage:1,5,main]
>>>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>>>> 2015-09-18_23:21:40.80669       at
>>>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80669       at
>>>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80670       at
>>>> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80671       at
>>>> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80671       at
>>>> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80672       at
>>>> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80673       at
>>>> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80673       at
>>>> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80673       at
>>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>>> 2015-09-18_23:21:40.80674       at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> ~[na:1.7.0_45]
>>>> 2015-09-18_23:21:40.80674       at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> ~[na:1.7.0_45]
>>>> 2015-09-18_23:21:40.80674       at
>>>> java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_45]
>>>> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to
>>>> local pause of 10852378435 > 5000000000
>>>>
>>>> Any suggestions about how to remove it?
>>>> Thanks.
>>>>
>>>> --
>>>> Dikang
>>>>
>>>>
>>
>>
>> --
>> Dikang
>>
>>
>
>
> --
> Dikang
>
>


-- 
Dikang

Re: Unable to remove dead node from cluster.

Posted by Jeff Jirsa <je...@crowdstrike.com>.

When you run unsafeAssassinateEndpoint, to which host are you connected, and what argument are you passing?

Are there other nodes in the ring that you’re not including in the ‘nodetool status’ output?


From:  Dikang Gu
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, September 22, 2015 at 10:09 PM
To:  cassandra
Cc:  "dev@cassandra.apache.org"
Subject:  Re: Unable to remove dead node from cluster.

ping.

On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:
I have tried all of them, neither of them worked. 
1. decommission: the host had hardware issue, and I can not connect to it.
2. remove, there is not HostID, so the removenode did not work.
3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we fix it?

Thanks
Dikang.

On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <se...@datastax.com> wrote:

Order is decommission, remove, assassinate.

Which have you tried?

On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
Hi there, 

I have a dead node in our cluster, which is a wired state right now, and can not be removed from cluster.

The nodestatus shows:
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                          Load       Tokens  Owns    Host ID                               Rack
DN  10.210.165.55                    ?          256     ?       null                                  r1

I tried the unsafeAssassinateEndpoint, but got exception like:
2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is now DOWN
2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread Thread[GossipStage:1,5,main]
2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
2015-09-18_23:21:40.80669       at org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80669       at org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80670       at org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671       at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80672       at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673       at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80674       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) ~[na:1.7.0_45]
2015-09-18_23:21:40.80674       at java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_45]
2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to local pause of 10852378435 > 5000000000

Any suggestions about how to remove it?
Thanks.

-- 
Dikang




-- 
Dikang




-- 
Dikang

Re: Unable to remove dead node from cluster.

Posted by Dikang Gu <di...@gmail.com>.

ping.

On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu <di...@gmail.com> wrote:

> I have tried all of them, neither of them worked.
> 1. decommission: the host had hardware issue, and I can not connect to it.
> 2. remove, there is not HostID, so the removenode did not work.
> 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we
> fix it?
>
> Thanks
> Dikang.
>
> On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
> sebastian.estevez@datastax.com> wrote:
>
>> Order is decommission, remove, assassinate.
>>
>> Which have you tried?
>> On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> I have a dead node in our cluster, which is a wired state right now, and
>>> can not be removed from cluster.
>>>
>>> The nodestatus shows:
>>> Datacenter: DC1
>>> ===============
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address                          Load       Tokens  Owns    Host ID
>>>                               Rack
>>> DN  10.210.165.55                    ?          256     ?       null
>>>                              r1
>>>
>>> I tried the unsafeAssassinateEndpoint, but got exception like:
>>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is
>>> now DOWN
>>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>>> Thread[GossipStage:1,5,main]
>>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>>> 2015-09-18_23:21:40.80669       at
>>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80669       at
>>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80670       at
>>> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80671       at
>>> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80671       at
>>> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80672       at
>>> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80673       at
>>> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80673       at
>>> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80673       at
>>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80674       at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> ~[na:1.7.0_45]
>>> 2015-09-18_23:21:40.80674       at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> ~[na:1.7.0_45]
>>> 2015-09-18_23:21:40.80674       at java.lang.Thread.run(Thread.java:744)
>>> ~[na:1.7.0_45]
>>> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to
>>> local pause of 10852378435 > 5000000000
>>>
>>> Any suggestions about how to remove it?
>>> Thanks.
>>>
>>> --
>>> Dikang
>>>
>>>
>
>
> --
> Dikang
>
>


-- 
Dikang

Re: Unable to remove dead node from cluster.

Posted by Dikang Gu <di...@gmail.com>.

I have tried all of them, neither of them worked.
1. decommission: the host had hardware issue, and I can not connect to it.
2. remove, there is not HostID, so the removenode did not work.
3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we
fix it?

Thanks
Dikang.

On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
sebastian.estevez@datastax.com> wrote:

> Order is decommission, remove, assassinate.
>
> Which have you tried?
> On Sep 21, 2015 10:47 AM, "Dikang Gu" <di...@gmail.com> wrote:
>
>> Hi there,
>>
>> I have a dead node in our cluster, which is a wired state right now, and
>> can not be removed from cluster.
>>
>> The nodestatus shows:
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address                          Load       Tokens  Owns    Host ID
>>                             Rack
>> DN  10.210.165.55                    ?          256     ?       null
>>                              r1
>>
>> I tried the unsafeAssassinateEndpoint, but got exception like:
>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is
>> now DOWN
>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>> Thread[GossipStage:1,5,main]
>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>> 2015-09-18_23:21:40.80669       at
>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80669       at
>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80670       at
>> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80671       at
>> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80671       at
>> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80672       at
>> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80673       at
>> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80673       at
>> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80673       at
>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80674       at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> ~[na:1.7.0_45]
>> 2015-09-18_23:21:40.80674       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> ~[na:1.7.0_45]
>> 2015-09-18_23:21:40.80674       at java.lang.Thread.run(Thread.java:744)
>> ~[na:1.7.0_45]
>> 2015-09-18_23:21:40.85812 WARN  23:21:40 Not marking nodes down due to
>> local pause of 10852378435 > 5000000000
>>
>> Any suggestions about how to remove it?
>> Thanks.
>>
>> --
>> Dikang
>>
>>


-- 
Dikang