You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Mariusz Gronczewski (JIRA)" <ji...@apache.org> on 2013/01/14 13:34:12 UTC
[jira] [Updated] (CASSANDRA-5154) Gossip sends removed node which
causes restarted nodes to constantly create new threads
[ https://issues.apache.org/jira/browse/CASSANDRA-5154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mariusz Gronczewski updated CASSANDRA-5154:
-------------------------------------------
Description:
Our cassandra cluster had 14 nodes but it was mostly idle so about 2 weeks ago we removed 3 of them (via standard decommision) & moved tokens to balance load.
Since then no node was restarted but last week after restarting 2 of them we observed that both of them spawn threads ( WRITE-/1.2.3.4 where 1.2.3.4 is one of removed nodes IPs ) till they hit limit ( which is 800 on our system) and then cassandra dies. Not restarted nodes do not do that. There are no outgoing connections to those dead nodes
I noticed dead nodes are still in nodetool gossipinfo on non-restarted nodes but not on restarted ones so it seems they are not propertly removed from gossip.
Would rolling restart work for fixing this or is full cluster stop-start required ?
trace from hanging threads:
{code}
"WRITE-/1.2.3.4" daemon prio=10 tid=0x00007f5fe8194000 nid=0x2fb2 waiting on
condition [0x00007f6020de0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007536a1160> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:104)
{code}
was:
Our cassandra cluster had 14 nodes but it was mostly idle so about 2 weeks ago we removed 3 of them (via standard decommision) & moved tokens to balance load.
Since then no node was restarted but last week after restarting 2 of them we observed that both of them spawn threads ( WRITE-/1.2.3.4 where 1.2.3.4 is one of removed nodes IPs ) till they hit limit ( which is 800 on our system) and then cassandra dies. Not restarted nodes do not do that. There are no outgoing connections to those dead nodes
I noticed dead nodes are still in nodetool gossipinfo on non-restarted nodes but not on restarted ones so it seems they are not propertly removed from gossip.
Would rolling restart work for fixing this or is full cluster stop-start required ?
> Gossip sends removed node which causes restarted nodes to constantly create new threads
> ---------------------------------------------------------------------------------------
>
> Key: CASSANDRA-5154
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5154
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 1.1.7
> Environment: centos 6, JVM 1.6.0_37
> Reporter: Mariusz Gronczewski
>
> Our cassandra cluster had 14 nodes but it was mostly idle so about 2 weeks ago we removed 3 of them (via standard decommision) & moved tokens to balance load.
> Since then no node was restarted but last week after restarting 2 of them we observed that both of them spawn threads ( WRITE-/1.2.3.4 where 1.2.3.4 is one of removed nodes IPs ) till they hit limit ( which is 800 on our system) and then cassandra dies. Not restarted nodes do not do that. There are no outgoing connections to those dead nodes
> I noticed dead nodes are still in nodetool gossipinfo on non-restarted nodes but not on restarted ones so it seems they are not propertly removed from gossip.
> Would rolling restart work for fixing this or is full cluster stop-start required ?
> trace from hanging threads:
> {code}
> "WRITE-/1.2.3.4" daemon prio=10 tid=0x00007f5fe8194000 nid=0x2fb2 waiting on
> condition [0x00007f6020de0000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007536a1160> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
> at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
> at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:104)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira