You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Mike Heffner (JIRA)" <ji...@apache.org> on 2014/07/09 04:29:05 UTC
[jira] [Created] (CASSANDRA-7522) Bootstrapping a single node
spikes cluster-wide p95 latencies
Mike Heffner created CASSANDRA-7522:
---------------------------------------
Summary: Bootstrapping a single node spikes cluster-wide p95 latencies
Key: CASSANDRA-7522
URL: https://issues.apache.org/jira/browse/CASSANDRA-7522
Project: Cassandra
Issue Type: Bug
Components: Core
Environment: AWS, i2.2xlarge HVM instances
Reporter: Mike Heffner
We've recently run some tests with Cassandra 2.0.9, largely because we are interested in the streaming improvements in the 2.0.x series, see: CASSANDRA-5726. However, our results so far show that even with 2.0.x, streaming impacts are still quite large and hard to control for.
Our test environment was a 9 node, 2.0.9 ring running on AWS on i2.2xlarge HVM instances using Oracle JVM 1.7.0.55. Each node is set to use vnodes with 256 tokens each. We tested expanding this ring to a 12 node ring. We bootstrapped each node with different throttle settings set around the ring:
1st node:
* no throttle, stream/compaction throughput = 0
2nd node:
* stream throughput = 200
* compaction throughput = 256
3rd node:
* stream throughput = 50
* compaction throughput = 65
This is a graph of p95 write latencies (ring was not taking reads) showing each node bootstrapping left to right. The p95 latencies go from about 200ms -> ~500ms.
http://snapshots.librato.com/instrument/5j9l3qiq-7462.png
The write latencies appear to be largely driven by CPU as shown by:
http://snapshots.librato.com/instrument/xsfb688i-7463.png
Network graphs show that the joining nodes follow approximately the same bandwidth pattern:
http://snapshots.librato.com/instrument/ljvkvg6y-7464.png
What are the expected performance behaviors during bootstraping / ring expansion? The storage loads in this test were fairly small so the duration of the spikes was short, at a much larger production load we would need to sustain these spikes for hours. The throttle controls did not seem to help as far as we could tell.
These are our current config changes:
{code}
-concurrent_reads: 32
-concurrent_writes: 32
+concurrent_reads: 64
+concurrent_writes: 64
-memtable_flush_queue_size: 4
+memtable_flush_queue_size: 5
-rpc_server_type: sync
+rpc_server_type: hsha
-#concurrent_compactors: 1
+concurrent_compactors: 6
-cross_node_timeout: false
+cross_node_timeout: true
-# phi_convict_threshold: 8
+phi_convict_threshold: 12
-endpoint_snitch: SimpleSnitch
+endpoint_snitch: Ec2Snitch
-internode_compression: all
+internode_compression: none
{code}
Heap settings:
{code}
export MAX_HEAP_SIZE="10G"
export HEAP_NEWSIZE="2G"
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)