You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jake Maizel <ja...@soundcloud.com> on 2011/11/10 10:07:04 UTC

Nodes Flapping in the RIng

We have a new 6-node cluster running 0.6.13 (Due to some client side issues
we need to be on 0.6x for time being) that we are injecting data into and
ran into some issues with nodes going down and then up quickly in the
ring.  All nodes are effected and we have rules out the network layer.

It happens on all nodes and seems related to GC or mtable flushes.  We had
things stable but after a series of data migrations we saw some swapping so
we tuned to max heap down and this helped with swapping but the flapping
still persists.

The systems have 6-cores and 24 GB ram, max heap is at 12G.   We are using
the Parallel GC colector for throughput.

Our run file for starting cassandra looks like this:

exec 2>&1

ulimit -n 262144

cd /opt/cassandra-0.6.13

exec chpst -u cassandra java \
  -ea \
  -Xms4G \
  -Xmx12G \
  -XX:TargetSurvivorRatio=90 \
  -XX:+PrintGCDetails \
  -XX:+AggressiveOpts \
  -XX:+UseParallelGC \
  -XX:+CMSParallelRemarkEnabled \
  -XX:SurvivorRatio=128 \
  -XX:MaxTenuringThreshold=0 \
  -Djava.rmi.server.hostname=10.20.3.155 \
  -Dcom.sun.management.jmxremote.port=8080 \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcassandra-foreground=yes \
  -Dstorage-config=/etc/cassandra \
  -cp '/etc/cassandra:/opt/cassandra-0.6.13/lib/*' \
  org.apache.cassandra.thrift.CassandraDaemon <&-

Our storage conf like this for the mem/disk stuff:


<!--======================================================================-->
  <!-- Memory, Disk, and Performance
-->

<!--======================================================================-->
  <DiskAccessMode>mmap</DiskAccessMode>
  <RowWarningThresholdInMB>4</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>

  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>64</FlushIndexBufferSizeInMB>

  <ColumnIndexSizeInKB>16</ColumnIndexSizeInKB>

  <MemtableThroughputInMB>64</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>

  <ConcurrentReads>12</ConcurrentReads>
  <ConcurrentWrites>32</ConcurrentWrites>

  <CommitLogSync>periodic</CommitLogSync>
  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>

  <GCGraceSeconds>864000</GCGraceSeconds>

  <DoConsistencyChecksBoolean>true</DoConsistencyChecksBoolean>
</Storage>

Any thoughts on this would be really interesting.

-- 
Jake Maizel
Head of Network Operations
Soundcloud

Mail & GTalk: jake@soundcloud.com
Skype: jakecloud

Rosenthaler strasse 13, 101 19, Berlin, DE