You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "C. Scott Andreas (JIRA)" <ji...@apache.org> on 2018/11/17 17:40:00 UTC
[jira] [Updated] (CASSANDRA-13261) Improve speculative retry to
avoid being overloaded
[ https://issues.apache.org/jira/browse/CASSANDRA-13261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
C. Scott Andreas updated CASSANDRA-13261:
-----------------------------------------
Component/s: Coordination
> Improve speculative retry to avoid being overloaded
> ---------------------------------------------------
>
> Key: CASSANDRA-13261
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13261
> Project: Cassandra
> Issue Type: Improvement
> Components: Coordination
> Reporter: Simon Zhou
> Assignee: Simon Zhou
> Priority: Major
> Attachments: CASSANDRA-13261-v1.patch
>
>
> In CASSANDRA-13009, I was suggested to separate the 2nd part of my patch as an improvement.
> This is to avoid Cassandra being overloaded when using CUSTOM speculative retry parameter. Steps to reason/repro this with 3.0.10:
> 1. Use custom speculative retry threshold like this:
> cqlsh> alter TABLE to_repair1.users0 with speculative_retry='10ms';
> 2. SpeculatingReadExecutor will be used, according to this piece of code in AbstractReadExecutor:
> {code}
> if (retry.equals(SpeculativeRetryParam.ALWAYS))
> return new AlwaysSpeculatingReadExecutor(keyspace, cfs, command, consistencyLevel, targetReplicas);
> else // PERCENTILE or CUSTOM.
> return new SpeculatingReadExecutor(keyspace, cfs, command, consistencyLevel, targetReplicas);
> {code}
> 3. When RF=3 and LOCAL_QUORUM is used, the below code (from SpeculatingReadExecutor#maybeTryAdditionalReplicas) won't be able to protect Cassandra from being overloaded, even though the inline comment suggests such intention:
> {code}
> // no latency information, or we're overloaded
> if (cfs.sampleLatencyNanos > TimeUnit.MILLISECONDS.toNanos(command.getTimeout()))
> return;
> {code}
> The reason is that cfs.sampleLatencyNanos is assigned as
> retryPolicy.threshold() which is 10ms in step #1 above, at line 405 of ColumnFamilyStore. However pretty often the timeout is the default one 5000ms.
> As the name suggests, sampleLatencyNanos should be used to keep sampled latency, not something configured "statically". My proposal:
> a. Introduce option -Dcassandra.overload.threshold to allow customizing overload threshold. The default threshold would be DatabaseDescriptor.getRangeRpcTimeout().
> b. Assign sampled P99 latency to cfs.sampleLatencyNanos. For overload detection, we just compare cfs.sampleLatencyNanos with the customizable threshold above.
> c. Use retryDelayNanos (instead of cfs.sampleLatencyNanos) for waiting time before retry (see line 282 of AbstractReadExecutor). This is the value from table setting (PERCENTILE or CUSTOM).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org