You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Mike Heffner (JIRA)" <ji...@apache.org> on 2014/07/09 04:29:05 UTC

[jira] [Created] (CASSANDRA-7522) Bootstrapping a single node spikes cluster-wide p95 latencies

Mike Heffner created CASSANDRA-7522:
---------------------------------------

             Summary: Bootstrapping a single node spikes cluster-wide p95 latencies
                 Key: CASSANDRA-7522
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7522
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: AWS, i2.2xlarge HVM instances
            Reporter: Mike Heffner


We've recently run some tests with Cassandra 2.0.9, largely because we are interested in the streaming improvements in the 2.0.x series, see: CASSANDRA-5726. However, our results so far show that even with 2.0.x, streaming impacts are still quite large and hard to control for.

Our test environment was a 9 node, 2.0.9 ring running on AWS on i2.2xlarge HVM instances using Oracle JVM 1.7.0.55. Each node is set to use vnodes with 256 tokens each. We tested expanding this ring to a 12 node ring. We bootstrapped each node with different throttle settings set around the ring:

1st node:
* no throttle, stream/compaction throughput = 0

2nd node:
* stream throughput = 200
* compaction throughput = 256

3rd node:
* stream throughput = 50
* compaction throughput = 65

This is a graph of p95 write latencies (ring was not taking reads) showing each node bootstrapping left to right. The p95 latencies go from about 200ms -> ~500ms.

http://snapshots.librato.com/instrument/5j9l3qiq-7462.png

The write latencies appear to be largely driven by CPU as shown by:

http://snapshots.librato.com/instrument/xsfb688i-7463.png

Network graphs show that the joining nodes follow approximately the same bandwidth pattern:

http://snapshots.librato.com/instrument/ljvkvg6y-7464.png

What are the expected performance behaviors during bootstraping / ring expansion? The storage loads in this test were fairly small so the duration of the spikes was short, at a much larger production load we would need to sustain these spikes for hours. The throttle controls did not seem to help as far as we could tell.

These are our current config changes:

{code}
-concurrent_reads: 32
-concurrent_writes: 32
+concurrent_reads: 64
+concurrent_writes: 64

-memtable_flush_queue_size: 4
+memtable_flush_queue_size: 5

-rpc_server_type: sync
+rpc_server_type: hsha

-#concurrent_compactors: 1
+concurrent_compactors: 6

-cross_node_timeout: false
+cross_node_timeout: true
-# phi_convict_threshold: 8
+phi_convict_threshold: 12

-endpoint_snitch: SimpleSnitch
+endpoint_snitch: Ec2Snitch

-internode_compression: all
+internode_compression: none
{code}

Heap settings:

{code}
export MAX_HEAP_SIZE="10G"
export HEAP_NEWSIZE="2G"
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)