You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by fa...@idioplatform.com on 2014/12/05 10:50:32 UTC
Cassandra memory & joining issues

Hello,


A recent incident has brought to light that we have potentially two problems.
1. A node can start going up and down possibly due to memory issues.
2. We can't bring in new nodes


Here is an account of the incident.


3 vnode cluster setup (A, B & C). Cassandra version 2.0.10


1. We get an alert that a node is down (SD alert at 12:33)
2. We turn off the app that uses cassandra most heavily
3. Node A is down & CPU is high & it goes in repeating cycles of Garbage Collection which take a long time.
 INFO [ScheduledTasks:1] 2014-12-01 12:22:05,691 GCInspector.java (line 116) GC for ParNew: 2160 ms for 2 collections, 2847691776 used; max is 3911188480
    INFO [ScheduledTasks:1] 2014-12-01 12:22:06,658 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 13545 ms for 1 collections, 2801612640 used; max is 3911188480
    INFO [ScheduledTasks:1] 2014-12-01 12:22:48,250 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 15891 ms for 1 collections, 3620884464 used; max is 3911188480
    INFO [ScheduledTasks:1] 2014-12-01 12:23:07,925 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 16789 ms for 1 collections, 3696864640 used; max is 3911188480
    INFO [ScheduledTasks:1] 2014-12-01 12:23:26,338 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 16777 ms for 1 collections, 3733452048 used; max is 3911188480
    INFO [ScheduledTasks:1] 2014-12-01 12:23:46,990 GCInspector.java (line 116) GC for ParNew: 2783 ms for 5 collections, 3782932912 used; max is 3911188480
    INFO [ScheduledTasks:1] 2014-12-01 12:23:46,990 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 17203 ms for 1 collections, 3783141880 used; max is 3911188480
    ..........
    INFO [ScheduledTasks:1] 2014-12-01 12:30:35,256 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 30084 ms for 2 collections, 3892036536 used; max is 3911188480
4. Datastax Opscenter reports it as down
5. It keeps going down and up, logs showing it restarting and immediately going into the GC cycles again. 
6. Restarting the node manually did not help


At this point we decide to replace A, so;
1. Bring up a new node (D) with cassandra off
2. We stop node A
3. Before we start cassandra on node D, node B stops responding to 'nodetool status'
4. Node C reports B as up and after a while reports it as down
5. We turn on node A and it still has high CPU, though is not dropping out, logs do not show any long cycles of GC anymore.
6. We turn off node B
7. We start cassandra on node D and get this error on startup:
    ERROR [main] 2014-12-01 14:04:48,332 CassandraDaemon.java (line 513) Exception encountered during startup
    java.lang.IllegalStateException: unable to find sufficient sources for streaming range (2337155766868590732,2355076515890621387]
      at org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:201)
      at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:125)
      at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:72)
      at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:994)
      at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:797)
      at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)
      at org.apache.cassandra.service.StorageService.initServer(StorageService.java:502)
      at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
      at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
      at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
     INFO [StorageServiceShutdownHook] 2014-12-01 14:04:48,338 Gossiper.java (line 1279) Announcing shutdown
8. We restart node B and it joins the cluster fine.


Any help/pointers so that we can understand what happened and prevent it from happening in the future would be appreciated.


Thanks,

Farouk

—
Sent from Mailbox