You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ray Slakinski <ra...@gmail.com> on 2011/07/13 17:10:35 UTC

One node down but it thinks its fine...

One of our nodes, which happens to be the seed thinks its Up and all the other nodes are down. However all the other nodes thinks the seed is down instead. The logs for the seed node show everything is running as it should be. I've tried restarting the node, turning on/off gossip and thrift and nothing seems to get the node to see the rest of its ring as up and running. I have also tried restarting one of the other nodes, which had no affect on the situation. Below is the ring outputs for the seed and one other node in the ring, plus a ping to show that the seed can ping the other node.

# bin/nodetool -h 0.0.0.0 ring
Address Status State Load Owns Token 
 141784319550391026443072753096570088105 
127.0.0.1 Up Normal 4.61 GB 16.67% 0 
xx.xxx.30.210 Down Normal ? 16.67% 28356863910078205288614550619314017621 
xx.xx.90.87 Down Normal ? 16.67% 56713727820156410577229101238628035242 
xx.xx.22.236 Down Normal ? 16.67% 85070591730234615865843651857942052863 
xx.xx.97.96 Down Normal ? 16.67% 113427455640312821154458202477256070484 
xx.xxx.17.122 Down Normal ? 16.67% 141784319550391026443072753096570088105 


# ping xx.xxx.30.210
PING xx.xxx.30.210 (xx.xxx.30.210) 56(84) bytes of data.
64 bytes from xx.xxx.30.210: icmp_req=1 ttl=61 time=0.299 ms
64 bytes from xx.xxx.30.210: icmp_req=2 ttl=61 time=0.287 ms
^C
--- xx.xxx.30.210 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.287/0.293/0.299/0.006 ms


# bin/nodetool -h xx.xxx.30.210 ring
Address Status State Load Owns Token 
 141784319550391026443072753096570088105 
xx.xxx.23.40 Down Normal ? 16.67% 0 
xx.xxx.30.210 Up Normal 10.58 GB 16.67% 28356863910078205288614550619314017621 
xx.xx.90.87 Up Normal 10.47 GB 16.67% 56713727820156410577229101238628035242 
xx.xx.22.236 Up Normal 9.63 GB 16.67% 85070591730234615865843651857942052863 
xx.xx.97.96 Up Normal 10.68 GB 16.67% 113427455640312821154458202477256070484 
xx.xxx.17.122 Up Normal 10.18 GB 16.67% 141784319550391026443072753096570088105 

-- 
Ray Slakinski

Re: One node down but it thinks its fine...

Posted by Ray Slakinski <ra...@gmail.com>.

And fixed! a co-worker put in a bad host line entry last night that through it all off :( Thanks for the assist guys.

-- 
Ray Slakinski


On Wednesday, July 13, 2011 at 1:32 PM, Ray Slakinski wrote:

> Was all working before, but we ran out of file handles and ended up restarting the nodes. No yaml changes have occurred. 
> 
> Ray Slakinski
> 
> On 2011-07-13, at 12:55 PM, Sasha Dolgy <sdolgy@gmail.com (mailto:sdolgy@gmail.com)> wrote:
> 
> > any firewall changes? ping is fine ... but if you can't get from
> > node(a) to nodes(n) on the specific ports...
> > 
> > On Wed, Jul 13, 2011 at 6:47 PM, samal <samal@wakya.in (mailto:samal@wakya.in)> wrote:
> > > Check seed ip is same in all node and should not be loopback ip on cluster.
> > > 
> > > On Wed, Jul 13, 2011 at 8:40 PM, Ray Slakinski <ray.slakinski@gmail.com (mailto:ray.slakinski@gmail.com)>
> > > wrote:
> > > > 
> > > > One of our nodes, which happens to be the seed thinks its Up and all the
> > > > other nodes are down. However all the other nodes thinks the seed is down
> > > > instead. The logs for the seed node show everything is running as it should
> > > > be. I've tried restarting the node, turning on/off gossip and thrift and
> > > > nothing seems to get the node to see the rest of its ring as up and running.
> > > > I have also tried restarting one of the other nodes, which had no affect on
> > > > the situation. Below is the ring outputs for the seed and one other node in
> > > > the ring, plus a ping to show that the seed can ping the other node.
> > > > 
> > > > # bin/nodetool -h 0.0.0.0 ring
> > > > Address Status State Load Owns Token
> > > >  141784319550391026443072753096570088105
> > > > 127.0.0.1 Up Normal 4.61 GB 16.67% 0
> > > > xx.xxx.30.210 Down Normal ? 16.67% 28356863910078205288614550619314017621
> > > > xx.xx.90.87 Down Normal ? 16.67% 56713727820156410577229101238628035242
> > > > xx.xx.22.236 Down Normal ? 16.67% 85070591730234615865843651857942052863
> > > > xx.xx.97.96 Down Normal ? 16.67% 113427455640312821154458202477256070484
> > > > xx.xxx.17.122 Down Normal ? 16.67% 141784319550391026443072753096570088105
> > > > 
> > > > 
> > > > # ping xx.xxx.30.210
> > > > PING xx.xxx.30.210 (xx.xxx.30.210) 56(84) bytes of data.
> > > > 64 bytes from xx.xxx.30.210: icmp_req=1 ttl=61 time=0.299 ms
> > > > 64 bytes from xx.xxx.30.210: icmp_req=2 ttl=61 time=0.287 ms
> > > > ^C
> > > > --- xx.xxx.30.210 ping statistics ---
> > > > 2 packets transmitted, 2 received, 0% packet loss, time 999ms
> > > > rtt min/avg/max/mdev = 0.287/0.293/0.299/0.006 ms
> > > > 
> > > > 
> > > > # bin/nodetool -h xx.xxx.30.210 ring
> > > > Address Status State Load Owns Token
> > > >  141784319550391026443072753096570088105
> > > > xx.xxx.23.40 Down Normal ? 16.67% 0
> > > > xx.xxx.30.210 Up Normal 10.58 GB 16.67%
> > > > 28356863910078205288614550619314017621
> > > > xx.xx.90.87 Up Normal 10.47 GB 16.67%
> > > > 56713727820156410577229101238628035242
> > > > xx.xx.22.236 Up Normal 9.63 GB 16.67%
> > > > 85070591730234615865843651857942052863
> > > > xx.xx.97.96 Up Normal 10.68 GB 16.67%
> > > > 113427455640312821154458202477256070484
> > > > xx.xxx.17.122 Up Normal 10.18 GB 16.67%
> > > > 141784319550391026443072753096570088105
> > > > 
> > > > --
> > > > Ray Slakinski
> > 
> > 
> > 
> > -- 
> > Sasha Dolgy
> > sasha.dolgy@gmail.com (mailto:sasha.dolgy@gmail.com)

Re: One node down but it thinks its fine...

Posted by Ray Slakinski <ra...@gmail.com>.

Was all working before, but we ran out of file handles and ended up restarting the nodes. No yaml changes have occurred. 

Ray Slakinski

On 2011-07-13, at 12:55 PM, Sasha Dolgy <sd...@gmail.com> wrote:

> any firewall changes?  ping is fine ... but if you can't get from
> node(a) to nodes(n) on the specific ports...
> 
> On Wed, Jul 13, 2011 at 6:47 PM, samal <sa...@wakya.in> wrote:
>> Check seed ip is same in all node and should not be loopback ip on cluster.
>> 
>> On Wed, Jul 13, 2011 at 8:40 PM, Ray Slakinski <ra...@gmail.com>
>> wrote:
>>> 
>>> One of our nodes, which happens to be the seed thinks its Up and all the
>>> other nodes are down. However all the other nodes thinks the seed is down
>>> instead. The logs for the seed node show everything is running as it should
>>> be. I've tried restarting the node, turning on/off gossip and thrift and
>>> nothing seems to get the node to see the rest of its ring as up and running.
>>> I have also tried restarting one of the other nodes, which had no affect on
>>> the situation. Below is the ring outputs for the seed and one other node in
>>> the ring, plus a ping to show that the seed can ping the other node.
>>> 
>>> # bin/nodetool -h 0.0.0.0 ring
>>> Address Status State Load Owns Token
>>>  141784319550391026443072753096570088105
>>> 127.0.0.1 Up Normal 4.61 GB 16.67% 0
>>> xx.xxx.30.210 Down Normal ? 16.67% 28356863910078205288614550619314017621
>>> xx.xx.90.87 Down Normal ? 16.67% 56713727820156410577229101238628035242
>>> xx.xx.22.236 Down Normal ? 16.67% 85070591730234615865843651857942052863
>>> xx.xx.97.96 Down Normal ? 16.67% 113427455640312821154458202477256070484
>>> xx.xxx.17.122 Down Normal ? 16.67% 141784319550391026443072753096570088105
>>> 
>>> 
>>> # ping xx.xxx.30.210
>>> PING xx.xxx.30.210 (xx.xxx.30.210) 56(84) bytes of data.
>>> 64 bytes from xx.xxx.30.210: icmp_req=1 ttl=61 time=0.299 ms
>>> 64 bytes from xx.xxx.30.210: icmp_req=2 ttl=61 time=0.287 ms
>>> ^C
>>> --- xx.xxx.30.210 ping statistics ---
>>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>>> rtt min/avg/max/mdev = 0.287/0.293/0.299/0.006 ms
>>> 
>>> 
>>> # bin/nodetool -h xx.xxx.30.210 ring
>>> Address Status State Load Owns Token
>>>  141784319550391026443072753096570088105
>>> xx.xxx.23.40 Down Normal ? 16.67% 0
>>> xx.xxx.30.210 Up Normal 10.58 GB 16.67%
>>> 28356863910078205288614550619314017621
>>> xx.xx.90.87 Up Normal 10.47 GB 16.67%
>>> 56713727820156410577229101238628035242
>>> xx.xx.22.236 Up Normal 9.63 GB 16.67%
>>> 85070591730234615865843651857942052863
>>> xx.xx.97.96 Up Normal 10.68 GB 16.67%
>>> 113427455640312821154458202477256070484
>>> xx.xxx.17.122 Up Normal 10.18 GB 16.67%
>>> 141784319550391026443072753096570088105
>>> 
>>> --
>>> Ray Slakinski
>>> 
>>> 
>> 
>> 
> 
> 
> 
> -- 
> Sasha Dolgy
> sasha.dolgy@gmail.com

Re: One node down but it thinks its fine...

Posted by Sasha Dolgy <sd...@gmail.com>.

any firewall changes?  ping is fine ... but if you can't get from
node(a) to nodes(n) on the specific ports...

On Wed, Jul 13, 2011 at 6:47 PM, samal <sa...@wakya.in> wrote:
> Check seed ip is same in all node and should not be loopback ip on cluster.
>
> On Wed, Jul 13, 2011 at 8:40 PM, Ray Slakinski <ra...@gmail.com>
> wrote:
>>
>> One of our nodes, which happens to be the seed thinks its Up and all the
>> other nodes are down. However all the other nodes thinks the seed is down
>> instead. The logs for the seed node show everything is running as it should
>> be. I've tried restarting the node, turning on/off gossip and thrift and
>> nothing seems to get the node to see the rest of its ring as up and running.
>> I have also tried restarting one of the other nodes, which had no affect on
>> the situation. Below is the ring outputs for the seed and one other node in
>> the ring, plus a ping to show that the seed can ping the other node.
>>
>> # bin/nodetool -h 0.0.0.0 ring
>> Address Status State Load Owns Token
>>  141784319550391026443072753096570088105
>> 127.0.0.1 Up Normal 4.61 GB 16.67% 0
>> xx.xxx.30.210 Down Normal ? 16.67% 28356863910078205288614550619314017621
>> xx.xx.90.87 Down Normal ? 16.67% 56713727820156410577229101238628035242
>> xx.xx.22.236 Down Normal ? 16.67% 85070591730234615865843651857942052863
>> xx.xx.97.96 Down Normal ? 16.67% 113427455640312821154458202477256070484
>> xx.xxx.17.122 Down Normal ? 16.67% 141784319550391026443072753096570088105
>>
>>
>> # ping xx.xxx.30.210
>> PING xx.xxx.30.210 (xx.xxx.30.210) 56(84) bytes of data.
>> 64 bytes from xx.xxx.30.210: icmp_req=1 ttl=61 time=0.299 ms
>> 64 bytes from xx.xxx.30.210: icmp_req=2 ttl=61 time=0.287 ms
>> ^C
>> --- xx.xxx.30.210 ping statistics ---
>> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
>> rtt min/avg/max/mdev = 0.287/0.293/0.299/0.006 ms
>>
>>
>> # bin/nodetool -h xx.xxx.30.210 ring
>> Address Status State Load Owns Token
>>  141784319550391026443072753096570088105
>> xx.xxx.23.40 Down Normal ? 16.67% 0
>> xx.xxx.30.210 Up Normal 10.58 GB 16.67%
>> 28356863910078205288614550619314017621
>> xx.xx.90.87 Up Normal 10.47 GB 16.67%
>> 56713727820156410577229101238628035242
>> xx.xx.22.236 Up Normal 9.63 GB 16.67%
>> 85070591730234615865843651857942052863
>> xx.xx.97.96 Up Normal 10.68 GB 16.67%
>> 113427455640312821154458202477256070484
>> xx.xxx.17.122 Up Normal 10.18 GB 16.67%
>> 141784319550391026443072753096570088105
>>
>> --
>> Ray Slakinski
>>
>>
>
>



-- 
Sasha Dolgy
sasha.dolgy@gmail.com

Re: One node down but it thinks its fine...

Posted by samal <sa...@wakya.in>.

Check seed ip is same in all node and should not be loopback ip on cluster.

On Wed, Jul 13, 2011 at 8:40 PM, Ray Slakinski <ra...@gmail.com>wrote:

> One of our nodes, which happens to be the seed thinks its Up and all the
> other nodes are down. However all the other nodes thinks the seed is down
> instead. The logs for the seed node show everything is running as it should
> be. I've tried restarting the node, turning on/off gossip and thrift and
> nothing seems to get the node to see the rest of its ring as up and running.
> I have also tried restarting one of the other nodes, which had no affect on
> the situation. Below is the ring outputs for the seed and one other node in
> the ring, plus a ping to show that the seed can ping the other node.
>
> # bin/nodetool -h 0.0.0.0 ring
> Address Status State Load Owns Token
>  141784319550391026443072753096570088105
> 127.0.0.1 Up Normal 4.61 GB 16.67% 0
> xx.xxx.30.210 Down Normal ? 16.67% 28356863910078205288614550619314017621
> xx.xx.90.87 Down Normal ? 16.67% 56713727820156410577229101238628035242
> xx.xx.22.236 Down Normal ? 16.67% 85070591730234615865843651857942052863
> xx.xx.97.96 Down Normal ? 16.67% 113427455640312821154458202477256070484
> xx.xxx.17.122 Down Normal ? 16.67% 141784319550391026443072753096570088105
>
>
> # ping xx.xxx.30.210
> PING xx.xxx.30.210 (xx.xxx.30.210) 56(84) bytes of data.
> 64 bytes from xx.xxx.30.210: icmp_req=1 ttl=61 time=0.299 ms
> 64 bytes from xx.xxx.30.210: icmp_req=2 ttl=61 time=0.287 ms
> ^C
> --- xx.xxx.30.210 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 999ms
> rtt min/avg/max/mdev = 0.287/0.293/0.299/0.006 ms
>
>
> # bin/nodetool -h xx.xxx.30.210 ring
> Address Status State Load Owns Token
>  141784319550391026443072753096570088105
> xx.xxx.23.40 Down Normal ? 16.67% 0
> xx.xxx.30.210 Up Normal 10.58 GB 16.67%
> 28356863910078205288614550619314017621
> xx.xx.90.87 Up Normal 10.47 GB 16.67%
> 56713727820156410577229101238628035242
> xx.xx.22.236 Up Normal 9.63 GB 16.67%
> 85070591730234615865843651857942052863
> xx.xx.97.96 Up Normal 10.68 GB 16.67%
> 113427455640312821154458202477256070484
> xx.xxx.17.122 Up Normal 10.18 GB 16.67%
> 141784319550391026443072753096570088105
>
> --
> Ray Slakinski
>
>
>