You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Michael Shuler (JIRA)" <ji...@apache.org> on 2014/07/09 16:02:05 UTC

[jira] [Resolved] (CASSANDRA-7522) Bootstrapping a single node spikes cluster-wide p95 latencies

     [ https://issues.apache.org/jira/browse/CASSANDRA-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Shuler resolved CASSANDRA-7522.
---------------------------------------

    Resolution: Not a Problem

Closing observation - this would be good for the mailing list.

> Bootstrapping a single node spikes cluster-wide p95 latencies
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-7522
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7522
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: AWS, i2.2xlarge HVM instances
>            Reporter: Mike Heffner
>
> We've recently run some tests with Cassandra 2.0.9, largely because we are interested in the streaming improvements in the 2.0.x series, see: CASSANDRA-5726. However, our results so far show that even with 2.0.x, streaming impacts are still quite large and hard to control for.
> Our test environment was a 9 node, 2.0.9 ring running on AWS on i2.2xlarge HVM instances using Oracle JVM 1.7.0.55. Each node is set to use vnodes with 256 tokens each. We tested expanding this ring to a 12 node ring. We bootstrapped each node with different throttle settings set around the ring:
> 1st node:
> * no throttle, stream/compaction throughput = 0
> 2nd node:
> * stream throughput = 200
> * compaction throughput = 256
> 3rd node:
> * stream throughput = 50
> * compaction throughput = 65
> This is a graph of p95 write latencies (ring was not taking reads) showing each node bootstrapping left to right. The p95 latencies go from about 200ms -> ~500ms.
> http://snapshots.librato.com/instrument/5j9l3qiq-7462.png
> The write latencies appear to be largely driven by CPU as shown by:
> http://snapshots.librato.com/instrument/xsfb688i-7463.png
> Network graphs show that the joining nodes follow approximately the same bandwidth pattern:
> http://snapshots.librato.com/instrument/ljvkvg6y-7464.png
> What are the expected performance behaviors during bootstraping / ring expansion? The storage loads in this test were fairly small so the duration of the spikes was short, at a much larger production load we would need to sustain these spikes for hours. The throttle controls did not seem to help as far as we could tell.
> These are our current config changes:
> {code}
> -concurrent_reads: 32
> -concurrent_writes: 32
> +concurrent_reads: 64
> +concurrent_writes: 64
> -memtable_flush_queue_size: 4
> +memtable_flush_queue_size: 5
> -rpc_server_type: sync
> +rpc_server_type: hsha
> -#concurrent_compactors: 1
> +concurrent_compactors: 6
> -cross_node_timeout: false
> +cross_node_timeout: true
> -# phi_convict_threshold: 8
> +phi_convict_threshold: 12
> -endpoint_snitch: SimpleSnitch
> +endpoint_snitch: Ec2Snitch
> -internode_compression: all
> +internode_compression: none
> {code}
> Heap settings:
> {code}
> export MAX_HEAP_SIZE="10G"
> export HEAP_NEWSIZE="2G"
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)