You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ryan McGuire (JIRA)" <ji...@apache.org> on 2014/07/18 02:39:04 UTC

[jira] [Comment Edited] (CASSANDRA-7567) when the commit_log disk for a single node is overwhelmed the entire cluster slows down

    [ https://issues.apache.org/jira/browse/CASSANDRA-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065819#comment-14065819 ] 

Ryan McGuire edited comment on CASSANDRA-7567 at 7/18/14 12:37 AM:
-------------------------------------------------------------------

I have reproduced this on a 10 node m3.xlarge cluster on EC2.

 * Run stress against nodes 1-5
 * On node 9, run the dd command.

With RF=1:
 * Watch stress completely time out.

With RF=3:
 * Observe intermittent 12s latency

However, the intermittent latency subsides, which leads me to believe that the cluster is polling that node and backing off progressively as it sees that it's unavailable.


was (Author: enigmacurry):
I have reproduced this on a 10 node m3.xlarge cluster on EC2.

 * Run stress against nodes 1-5
 * On node 9, run the dd command.
 * Watch stress completely time out.




> when the commit_log disk for a single node is overwhelmed the entire cluster slows down
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7567
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7567
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: debian 7.5, bare metal, 14 nodes, 64CPUs, 64GB RAM, commit_log disk sata, data disk SSD, vnodes, leveled compaction strategy
>            Reporter: David O'Dell
>         Attachments: write_request_latency.png
>
>
> We've run into a situation where a single node out of 14 is experiencing high disk io. This can happen when a node is being decommissioned or after it joins the ring and runs into the bug cassandra-6621.
> When this occurs the write latency for the entire cluster spikes.
> From 0.3ms to 170ms.
> To simulate this simply run dd on the commit_log disk (dd if=/dev/zero of=/tmp/foo bs=1024) and you will see that instantly all nodes in the cluster have slowed down.
> BTW overwhelming the data disk does not have this same effect.
> Also I've tried this where the overwhelmed node isn't being connected directly from the client and it still has the same effect.



--
This message was sent by Atlassian JIRA
(v6.2#6252)