You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org> on 2011/10/13 21:07:12 UTC
[jira] [Commented] (CASSANDRA-3226) Single faulty node brings down
entire cluster. No reads/writes possible
[ https://issues.apache.org/jira/browse/CASSANDRA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126815#comment-13126815 ]
Jonathan Ellis commented on CASSANDRA-3226:
-------------------------------------------
Any update, Thibaut?
> Single faulty node brings down entire cluster. No reads/writes possible
> -----------------------------------------------------------------------
>
> Key: CASSANDRA-3226
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3226
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.8.5
> Environment: linux
> Reporter: Thibaut
> Attachments: jstack
>
>
> No client is able to read anything from the entire cluster anymore. This occured a few times so far, but I can't reproduce it.
> Looks like an OOM directly after starting up the node? Restarting the node "solves" the issue. I also have to kill the node with -9 because normal kill won't kill the node.
> Healty nodes:
> *.13:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 15237 416868
> Responses n/a 0 126721
> *.14.
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 15387 437325
> Responses n/a 0 131066
> *.15:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 15530 368771
> Responses n/a 0 145168
> etc... The pending commands at the healty nodes all increase.
> Faulty node before restart:
> /software/cassandra/bin/nodetool -h localhost info
> Token : f33
> Gossip active : true
> Load : 130.67 GB
> Generation No : 1316197687
> Uptime (seconds) : 137051
> Heap Memory (MB) : 3580.22 / 3614.00
> Data Center : datacenter1
> Rack : rack1
> Exceptions : 108
> /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 0 29696566
> Responses n/a 560 26650981
> Log excerpt:
> INFO [GossipStage:3] 2011-09-18 09:16:46,254 Gossiper.java (line 713) Node /192.168.0.11 has restarted, now UP again
> INFO [GossipStage:3] 2011-09-18 09:16:46,255 Gossiper.java (line 681) InetAddress /192.168.0.11 is now UP
> INFO [GossipStage:3] 2011-09-18 09:16:46,255 StorageService.java (line 815) Node /192.168.0.11 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:46,257 StorageService.java (line 815) Node /192.168.0.11 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:54,984 StorageService.java (line 815) Node /192.168.0.6 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:54,984 Gossiper.java (line 681) InetAddress /192.168.0.6 is now UP
> INFO [GossipStage:3] 2011-09-18 09:16:56,262 StorageService.java (line 815) Node /192.168.0.18 state jump to normal
> INFO [GossipStage:3] 2011-09-18 09:16:56,263 Gossiper.java (line 681) InetAddress /192.168.0.18 is now UP
> INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 713) Node /192.168.0.1 has restarted, now UP again
> INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 681) InetAddress /192.168.0.1 is now UP
> INFO [GossipStage:3] 2011-09-18 09:17:06,272 StorageService.java (line 815) Node /192.168.0.1 state jump to normal
> INFO [HintedHandoff:1] 2011-09-18 09:20:49,846 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.8
> INFO [HintedHandoff:1] 2011-09-18 09:20:49,847 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.8
> INFO [HintedHandoff:1] 2011-09-18 09:21:45,430 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.7
> INFO [HintedHandoff:1] 2011-09-18 09:21:45,696 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.7
> INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.20
> INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.20
> INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.9
> INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.9
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.6
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
> at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.3
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.17
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.11
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.6
> INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
> at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.18
> INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.18
> ERROR [Thread-554] 2011-09-18 09:24:43,853 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-554,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
> at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> ERROR [Thread-542] 2011-09-18 09:24:43,924 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-542,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
> at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
> at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
> at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 323) Started hinted handoff for endpoint /192.168.0.1
> INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 379) Finished hinted handoff of 0 rows to endpoint /192.168.0.1
> ERROR [Thread-535] 2011-09-18 09:25:03,995 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-535,5,main]
> If I restart the node (kill -9), everything works fine again. No OOM!.
> /software/scripts# /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name Active Pending Completed
> Commands n/a 0 131351
> Responses n/a 0 190697
> /software/scripts# /software/cassandra/bin/nodetool -h localhost info
> Token : f33
> Gossip active : true
> Load : 88.54 GB
> Generation No : 1316334645
> Uptime (seconds) : 494
> Heap Memory (MB) : 1582.95 / 3614.00
> Data Center : datacenter1
> Rack : rack1
> Exceptions : 0
> /software/scripts# /software/cassandra/bin/nodetool -h localhost ring
> Address DC Rack Status State Load Owns Token
> ffffffffffffffff
> 192.168.0.1 datacenter1 rack1 Up Normal 85.2 GB 0.02% 0cc
> 192.168.0.2 datacenter1 rack1 Up Normal 86.94 GB 0.02% 199
> 192.168.0.3 datacenter1 rack1 Up Normal 85.24 GB 0.02% 266
> 192.168.0.4 datacenter1 rack1 Up Normal 86.38 GB 0.02% 333
> 192.168.0.5 datacenter1 rack1 Up Normal 86.96 GB 0.02% 400
> 192.168.0.6 datacenter1 rack1 Up Normal 86.17 GB 0.02% 4cc
> 192.168.0.7 datacenter1 rack1 Up Normal 83.88 GB 0.02% 599
> 192.168.0.8 datacenter1 rack1 Up Normal 84.42 GB 0.02% 666
> 192.168.0.9 datacenter1 rack1 Up Normal 85.06 GB 0.02% 733
> 192.168.0.10 datacenter1 rack1 Up Normal 83.08 GB 0.02% 7ff
> 192.168.0.11 datacenter1 rack1 Up Normal 86.22 GB 0.02% 8cc
> 192.168.0.12 datacenter1 rack1 Up Normal 85.94 GB 0.02% 999
> 192.168.0.13 datacenter1 rack1 Up Normal 85.01 GB 0.02% a66
> 192.168.0.14 datacenter1 rack1 Up Normal 86.5 GB 0.02% b33
> 192.168.0.15 datacenter1 rack1 Up Normal 83.33 GB 0.02% c00
> 192.168.0.16 datacenter1 rack1 Up Normal 84.41 GB 0.02% ccc
> 192.168.0.17 datacenter1 rack1 Up Normal 86.97 GB 28.51% d99
> 192.168.0.18 datacenter1 rack1 Up Normal 112.63 GB 41.88% e66
> 192.168.0.19 datacenter1 rack1 Up Normal 88.56 GB 29.27% f33
> 192.168.0.20 datacenter1 rack1 Up Normal 85.83 GB 0.02% ffffffffffffffff
> Interestingly, after restart, the node load (from nodetool info) is reduced). Any ideas? The node doesn't seem to have any hardware memory issues.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira