You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Simon Zhou (JIRA)" <ji...@apache.org> on 2017/02/23 23:26:44 UTC
[jira] [Created] (CASSANDRA-13261) Improve speculative retry to
avoid being overloaded
Simon Zhou created CASSANDRA-13261:
--------------------------------------
Summary: Improve speculative retry to avoid being overloaded
Key: CASSANDRA-13261
URL: https://issues.apache.org/jira/browse/CASSANDRA-13261
Project: Cassandra
Issue Type: Improvement
Reporter: Simon Zhou
Assignee: Simon Zhou
In CASSANDRA-13009, I was suggested to separate the 2nd part of my patch as an improvement.
This is to avoid Cassandra being overloaded when using CUSTOM speculative retry parameter. Steps to reason/repro this with 3.0.10:
1. Use custom speculative retry threshold like this:
cqlsh> alter TABLE to_repair1.users0 with speculative_retry='10ms';
2. SpeculatingReadExecutor will be used, according to this piece of code in AbstractReadExecutor:
{code}
if (retry.equals(SpeculativeRetryParam.ALWAYS))
return new AlwaysSpeculatingReadExecutor(keyspace, cfs, command, consistencyLevel, targetReplicas);
else // PERCENTILE or CUSTOM.
return new SpeculatingReadExecutor(keyspace, cfs, command, consistencyLevel, targetReplicas);
{code}
3. When RF=3 and LOCAL_QUORUM is used, the below code (from SpeculatingReadExecutor#maybeTryAdditionalReplicas) won't be able to protect Cassandra from being overloaded, even though the inline comment suggests such intention:
{code}
// no latency information, or we're overloaded
if (cfs.sampleLatencyNanos > TimeUnit.MILLISECONDS.toNanos(command.getTimeout()))
return;
{code}
The reason is that cfs.sampleLatencyNanos is assigned as
retryPolicy.threshold() which is 10ms in step #1 above, at line 405 of ColumnFamilyStore. However pretty often the timeout is the default one 5000ms.
As the name suggests, sampleLatencyNanos should be used to keep sampled latency, not something configured "statically". My proposal:
a. Introduce option -Dcassandra.overload.threshold to allow customizing overload threshold. The default threshold would be DatabaseDescriptor.getRangeRpcTimeout().
b. Assign sampled P99 latency to cfs.sampleLatencyNanos. For overload detection, we just compare cfs.sampleLatencyNanos with the customizable threshold above.
c. Use retryDelayNanos (instead of cfs.sampleLatencyNanos) for waiting time before retry (see line 282 of AbstractReadExecutor). This is the value from table setting (PERCENTILE or CUSTOM).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)