You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "C. Scott Andreas (JIRA)" <ji...@apache.org> on 2018/11/17 17:40:00 UTC

[jira] [Updated] (CASSANDRA-13261) Improve speculative retry to avoid being overloaded

     [ https://issues.apache.org/jira/browse/CASSANDRA-13261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

C. Scott Andreas updated CASSANDRA-13261:
-----------------------------------------
    Component/s: Coordination

> Improve speculative retry to avoid being overloaded
> ---------------------------------------------------
>
>                 Key: CASSANDRA-13261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>            Reporter: Simon Zhou
>            Assignee: Simon Zhou
>            Priority: Major
>         Attachments: CASSANDRA-13261-v1.patch
>
>
> In CASSANDRA-13009, I was suggested to separate the 2nd part of my patch as an improvement.
> This is to avoid Cassandra being overloaded when using CUSTOM speculative retry parameter. Steps to reason/repro this with 3.0.10:
> 1. Use custom speculative retry threshold like this:
> cqlsh> alter TABLE to_repair1.users0 with speculative_retry='10ms';
> 2. SpeculatingReadExecutor will be used, according to this piece of code in AbstractReadExecutor:
> {code}
>         if (retry.equals(SpeculativeRetryParam.ALWAYS))
>             return new AlwaysSpeculatingReadExecutor(keyspace, cfs, command, consistencyLevel, targetReplicas);
>         else // PERCENTILE or CUSTOM.
>             return new SpeculatingReadExecutor(keyspace, cfs, command, consistencyLevel, targetReplicas);
> {code}
> 3. When RF=3 and LOCAL_QUORUM is used, the below code (from SpeculatingReadExecutor#maybeTryAdditionalReplicas) won't be able to protect Cassandra from being overloaded, even though the inline comment suggests such intention:
> {code}
>             // no latency information, or we're overloaded
>             if (cfs.sampleLatencyNanos > TimeUnit.MILLISECONDS.toNanos(command.getTimeout()))
>                 return;
> {code}
> The reason is that cfs.sampleLatencyNanos is assigned as 
> retryPolicy.threshold() which is 10ms in step #1 above, at line 405 of ColumnFamilyStore. However pretty often the timeout is the default one 5000ms.
> As the name suggests, sampleLatencyNanos should be used to keep sampled latency, not something configured "statically". My proposal:
> a. Introduce option -Dcassandra.overload.threshold to allow customizing overload threshold. The default threshold would be DatabaseDescriptor.getRangeRpcTimeout().
> b. Assign sampled P99 latency to cfs.sampleLatencyNanos. For overload detection, we just compare cfs.sampleLatencyNanos with the customizable threshold above.
> c. Use retryDelayNanos (instead of cfs.sampleLatencyNanos) for waiting time before retry (see line 282 of AbstractReadExecutor). This is the value from table setting (PERCENTILE or CUSTOM).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org