You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Simon Zhou (JIRA)" <ji...@apache.org> on 2017/06/02 05:44:05 UTC

[jira] [Comment Edited] (CASSANDRA-6908) Dynamic endpoint snitch destabilizes cluster under heavy load

    [ https://issues.apache.org/jira/browse/CASSANDRA-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034112#comment-16034112 ] 

Simon Zhou edited comment on CASSANDRA-6908 at 6/2/17 5:43 AM:
---------------------------------------------------------------

We got similar issue and thus I worked out a simple patch (attached) to decouple scores for iowait and sampled read latency. From my observation, there are two issues:
1. The iowait score of one node changes frequently and the gaps among the scores for different nodes are usually far beyond the default 1.1 threshold.
2. The (median) latency scores don't vary too much but the differences may still be more than 1.1x. Also some nodes from local datacenter have 0 latency scores. I understand that the nodes in the remote datacenter may not have latency data since local_quorum or local_one is being used. The issue for remote data center has actually been fixed in CASSANDRA-13074 (we're running 3.0.13).

There are the numbers I got (formatted) from a two-datacenter cluster (10 nodes in each datacenter), with my patch. The ip addresses have been obfuscated.

{code}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=DynamicEndpointSnitch LatencyScores
06/01/2017 23:30:36 +0000 org.archive.jmx.Client LatencyScores: {
/node1=0.7832167832167832
/node2=0.0
/node3=1.0
/node4=0.0
/node5=0.0
/node6=0.43356643356643354
/node7=0.4825174825174825
/node8=0.0
/node9=0.8881118881118881
/node10=0.0
/node11=0.9440559440559441
/node12=0.0
/node13=0.0
/node14=0.0
/node15=0.0
/node16=0.0}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=DynamicEndpointSnitch LatencyScores
06/01/2017 23:30:45 +0000 org.archive.jmx.Client LatencyScores: {
/node1=0.0
/node2=1.0
/node3=0.0
/node4=0.0
/node5=0.43356643356643354
/node6=0.4825174825174825
/node7=0.0
/node8=0.8881118881118881
/node9=0.0
/node10=0.9440559440559441
/node11=0.0
/node12=0.0
/node13=0.0
/node15=0.0
/node16=0.0
/node17=0.7832167832167832
}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=DynamicEndpointSnitch IOWaitScores
06/01/2017 23:30:54 +0000 org.archive.jmx.Client IOWaitScores: {
/node1=5.084033489227295
/node2=4.024896621704102
/node3=4.54736852645874
/node4=4.947588920593262
/node5=3.4599156379699707
/node6=4.0653815269470215
/node7=6.989473819732666
/node8=3.371259927749634
/node9=5.800169467926025
/node10=3.2855939865112305
/node11=5.631399154663086
/node12=5.484004974365234
/node13=0.9635525941848755
/node14=1.5043878555297852
/node15=6.481481552124023
/node16=3.751563310623169}
{code}

Yes we can workaround the issue by increasing the badness_threshold. But the problems are:
1. The default threshold doesn't work well.
2. iowait (percentage) is not a good measurement of end to end latency, not only because it changes frequently, from second to second, but also it's just a low level metric that doesn't reflect the whole picture, which should also include GC/safepoint pauses, thread scheduling delays, etc.
3. Instead of using median read latency, can we use maybe p95 latency as a better factor when calculating scores? I haven't experimented this yet.

[~brandon.williams] what do you think? [~kohlisankalp] Looks like we have some fix (or improvements?) in 4.0 but you mentioned in a meeting that DES could be improved. I'd also like get your ideas on this. I can work on this if we can agree on something.


was (Author: szhou):
We got similar issue and thus I worked out a simple patch (attached) to decouple scores for iowait and sampled read latency. From my observation, there are two issues:
1. The iowait score of one node changes frequently and the gaps among the scores for different nodes are usually far beyond the default 1.1 threshold.
2. The (median) latency scores don't vary too much however some nodes have 0 latency scores, even with the fix for CASSANDRA-13074 (we're running 3.0.13).

There are the numbers I got (formatted) with my attached patch:
{code}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=DynamicEndpointSnitch LatencyScores
06/01/2017 23:30:36 +0000 org.archive.jmx.Client LatencyScores: {
/node1=0.7832167832167832
/node2=0.0
/node3=1.0
/node4=0.0
/node5=0.0
/node6=0.43356643356643354
/node7=0.4825174825174825
/node8=0.0
/node9=0.8881118881118881
/node10=0.0
/node11=0.9440559440559441
/node12=0.0
/node13=0.0
/node14=0.0
/node15=0.0
/node16=0.0}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=DynamicEndpointSnitch LatencyScores
06/01/2017 23:30:45 +0000 org.archive.jmx.Client LatencyScores: {/10.165.10.5=0.7832167832167832
/node1=0.0
/node2=1.0
/node3=0.0
/node4=0.0
/node5=0.43356643356643354
/node6=0.4825174825174825
/node7=0.0
/node8=0.8881118881118881
/node9=0.0
/node10=0.9440559440559441
/node11=0.0
/node12=0.0
/node13=0.0
/node15=0.0
/node16=0.0}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.apache.cassandra.db:type=DynamicEndpointSnitch IOWaitScores
06/01/2017 23:30:54 +0000 org.archive.jmx.Client IOWaitScores: {
/node1=5.084033489227295
/node2=4.024896621704102
/node3=4.54736852645874
/node4=4.947588920593262
/node5=3.4599156379699707
/node6=4.0653815269470215
/node7=6.989473819732666
/node8=3.371259927749634
/node9=5.800169467926025
/node10=3.2855939865112305
/node11=5.631399154663086
/node12=5.484004974365234
/node13=0.9635525941848755
/node14=1.5043878555297852
/node15=6.481481552124023
/node16=3.751563310623169}
{code}

Yes we can workaround the issue by increasing the badness_threshold. But the problems are:
1. The default threshold doesn't work well.
2. iowait (percentage) is not a good measurement of end to end latency, not only because it changes frequently, from second to second, but also it's just a low level metric that doesn't reflect the whole picture, which should also include GC/safepoint pauses, thread scheduling delays, etc.
3. Instead of using median read latency, can we use maybe p95 latency as a better factor when calculating scores? I haven't experimented this yet.

[~brandon.williams] what do you think? [~kohlisankalp] Looks like we have some fix (or improvements?) in 4.0 but you mentioned in a meeting that DES could be improved. I'd also like get your ideas on this. I can work on this if we can agree on something.

> Dynamic endpoint snitch destabilizes cluster under heavy load
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-6908
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Configuration
>            Reporter: Bartłomiej Romański
>            Assignee: Brandon Williams
>         Attachments: 0001-Decouple-IO-scores-and-latency-scores-from-DynamicEn.patch, as-dynamic-snitch-disabled.png
>
>
> We observe that with dynamic snitch disabled our cluster is much more stable than with dynamic snitch enabled.
> We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 GB RAM, 2x480 GB SSD). We mostly do reads (about 300k/s).
> We use Astyanax on client side with TOKEN_AWARE option enabled. It automatically direct read queries to one of the nodes responsible the given token.
> In that case with dynamic snitch disabled Cassandra always handles read locally. With dynamic snitch enabled Cassandra very often decides to proxy the read to some other node. This causes much higher CPU usage and produces much more garbage what results in more often GC pauses (young generation fills up quicker). By "much higher" and "much more" I mean 1.5-2x.
> I'm aware that higher dynamic_snitch_badness_threshold value should solve that issue. The default value is 0.1. I've looked at scores exposed in JMX and the problem is that our values seemed to be completely random. They are between usually 0.5 and 2.0, but changes randomly every time I hit refresh.
> Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something like that, but the result will be similar to simply disabling the dynamic switch at all (that's what we done).
> I've tried to understand what's the logic behind these scores and I'm not sure if I get the idea...
> It's a sum (without any multipliers) of two components:
> - ratio of recent given node latency to recent average node latency
> - something called 'severity', what, if I analyzed the code correctly, is a result of BackgroundActivityMonitor.getIOWait() - it's a ratio of "iowait" CPU time to the whole CPU time as reported in /proc/stats (the ratio is multiplied by 100)
> In our case the second value is something around 0-2% but varies quite heavily every second.
> What's the idea behind simply adding this two values without any multipliers (e.g the second one is in percentage while the first one is not)? Are we sure this is the best possible way of calculating the final score?
> Is there a way too force Cassandra to use (much) longer samples? In our case we probably need that to get stable values. The 'severity' is calculated for each second. The mean latency is calculated based on some magic, hardcoded values (ALPHA = 0.75, WINDOW_SIZE = 100). 
> Am I right that there's no way to tune that without hacking the code?
> I'm aware that there's dynamic_snitch_update_interval_in_ms property in the config file, but that only determines how often the scores are recalculated not how long samples are taken. Is that correct?
> To sum up, It would be really nice to have more control over dynamic snitch behavior or at least have the official option to disable it described in the default config file (it took me some time to discover that we can just disable it instead of hacking with dynamic_snitch_badness_threshold=1000).
> Currently for some scenarios (like ours - optimized cluster, token aware client, heavy load) it causes more harm than good.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org