You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Keith Thornhill <ke...@raptr.com> on 2010/05/20 07:05:17 UTC

Ring out of sync, cassandra_UnavailableException being thrown

in a 5 node cluster, i noticed in our client error log that one of the
nodes was consistently throwing cassandra_UnavailableException during
a read operation.

looking into jmx, it was obvious that one of the node's view of the
ring was out of sync.

$ nodetool -host 192.168.20.150 ring
Address       Status     Load          Range
           Ring

139508497374977076191526400448759597506
192.168.20.156Up         5.73 GB
733665530305941485083898696792520436       |<--|
192.168.20.158Up         3.41 GB
9629533262984150011756238989685472219      |   ^
192.168.20.154Up         2.44 GB
31048334058970902242412812423471654868     v   |
192.168.20.150Up         4.89 GB
105769574715070648260922426249777160699    |   ^
192.168.20.152Up         5.24 GB
139508497374977076191526400448759597506    |-->|

$ nodetool -host 192.168.20.158 ring
Address       Status     Load          Range
           Ring
192.168.20.158Up         3.41 GB
9629533262984150011756238989685472219      |<--|

looking at the CF stats on that node, it is obvious that reads and
writes are happening, but i have to assume that those are coming from
proxy connections via the other nodes.

when restarting that node, the error logs in the other cluster nodes
show that they detect the server going away and then coming back into
the ring.

INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:39,448
OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:55,475
OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
INFO [GMFD:1] 2010-05-19 21:27:56,481 Gossiper.java (line 582) Node
/192.168.20.158 has restarted, now UP again
INFO [GMFD:1] 2010-05-19 21:27:56,482 StorageService.java (line 538)
Node /192.168.20.158 state jump to normal

any ideas on how to kick that node and remind it of its buddies?

thanks!
-keith

Re: Ring out of sync, cassandra_UnavailableException being thrown

Posted by Jonathan Ellis <jb...@gmail.com>.

Were you bootstrapping or otherwise moving nodes around?

I don't think anyone's tracked this bug down farther than "if you
restart the entire cluster, it goes away."

On Wed, May 19, 2010 at 10:05 PM, Keith Thornhill <ke...@raptr.com> wrote:
> in a 5 node cluster, i noticed in our client error log that one of the
> nodes was consistently throwing cassandra_UnavailableException during
> a read operation.
>
> looking into jmx, it was obvious that one of the node's view of the
> ring was out of sync.
>
> $ nodetool -host 192.168.20.150 ring
> Address       Status     Load          Range
>           Ring
>
> 139508497374977076191526400448759597506
> 192.168.20.156Up         5.73 GB
> 733665530305941485083898696792520436       |<--|
> 192.168.20.158Up         3.41 GB
> 9629533262984150011756238989685472219      |   ^
> 192.168.20.154Up         2.44 GB
> 31048334058970902242412812423471654868     v   |
> 192.168.20.150Up         4.89 GB
> 105769574715070648260922426249777160699    |   ^
> 192.168.20.152Up         5.24 GB
> 139508497374977076191526400448759597506    |-->|
>
> $ nodetool -host 192.168.20.158 ring
> Address       Status     Load          Range
>           Ring
> 192.168.20.158Up         3.41 GB
> 9629533262984150011756238989685472219      |<--|
>
> looking at the CF stats on that node, it is obvious that reads and
> writes are happening, but i have to assume that those are coming from
> proxy connections via the other nodes.
>
> when restarting that node, the error logs in the other cluster nodes
> show that they detect the server going away and then coming back into
> the ring.
>
> INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:39,448
> OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
> INFO [WRITE-/192.168.20.158] 2010-05-19 21:27:55,475
> OutboundTcpConnection.java (line 102) error writing to /192.168.20.158
> INFO [GMFD:1] 2010-05-19 21:27:56,481 Gossiper.java (line 582) Node
> /192.168.20.158 has restarted, now UP again
> INFO [GMFD:1] 2010-05-19 21:27:56,482 StorageService.java (line 538)
> Node /192.168.20.158 state jump to normal
>
> any ideas on how to kick that node and remind it of its buddies?
>
> thanks!
> -keith
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com