You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Aleksandr Sorokoumov (Jira)" <ji...@apache.org> on 2021/10/06 17:31:00 UTC
[jira] [Commented] (CASSANDRA-16334) Replica failure causes timeout on multi-DC write

    [ https://issues.apache.org/jira/browse/CASSANDRA-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425115#comment-17425115 ] 

Aleksandr Sorokoumov commented on CASSANDRA-16334:
--------------------------------------------------

This bug happens in the [AbstractWriteResponseHandler#onFailure|https://github.com/apache/cassandra/blob/2e2db4dc40c4935305b9a2d5d271580e96dabe42/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L252-L265]:

{code}
    @Override
    public void onFailure(InetAddressAndPort from, RequestFailureReason failureReason)
    {
        logger.trace("Got failure from {}", from);

        int n = waitingFor(from)
                ? failuresUpdater.incrementAndGet(this)
                : failures;

        failureReasonByEndpoint.put(from, failureReason);

        if (blockFor() + n > candidateReplicaCount())
            signal();
    }
{code}

In the reproduction steps, {{INSERT INTO TEST}} uses CL {{LOCAL_ONE}}. Accordingly, [DatacenterWriteResponseHandler#waitingFor|https://github.com/apache/cassandra/blob/2e2db4dc40c4935305b9a2d5d271580e96dabe42/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java#L59-L63] only waits for the local nodes:

{code}
private final Predicate<InetAddressAndPort> waitingFor = InOurDcTester.endpoints();
....
@Override
protected boolean waitingFor(InetAddressAndPort from)
{
    return waitingFor.test(from);
}
{code}

[AbstractWriteResponseHandler#candidateReplicaCount()|https://github.com/apache/cassandra/blob/2e2db4dc40c4935305b9a2d5d271580e96dabe42/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L205-L213] in the condition above, however, counts live and down replicas in ALL DCs as valid candidates:

{code}
protected int candidateReplicaCount()
{
    return replicaPlan.liveAndDown().size();
}
{code}


As a result, even after all local nodes respond with {{FAILURE_RSP}}, the coordinator waits for responses from nodes in other DCs... but never counts them in.


There is more! Following the timeout or request failure, the coordinator creates hints for the nodes in other DCs which it will try to deliver forever.

> Replica failure causes timeout on multi-DC write
> ------------------------------------------------
>
>                 Key: CASSANDRA-16334
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16334
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination, Messaging/Internode
>            Reporter: Paulo Motta
>            Assignee: Aleksandr Sorokoumov
>            Priority: Normal
>
> Inserting a mutation larger than {{max_mutation_size_in_kb}} correctly throws a write error on a single DC keyspace with RF=3:
> {noformat}
> cassandra.WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed - received 0 responses and 3 failures: UNKNOWN from /127.0.0.3:7000, UNKNOWN from /127.0.0.2:7000, UNKNOWN from /127.0.0.1:7000" info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 3}
> {noformat}
> The same insert wrongly causes a timeout on a keyspace with 2 dcs (RF=3 each):
> {noformat}
> cassandra.WriteTimeout: Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0}
> {noformat}
> Reproduction steps:
> {noformat}
> # Setup cluster
> ccm create -n 3:3 test
> for i in {1..6}; do echo 'max_mutation_size_in_kb: 1000' >> ~/.ccm/test/node$i/conf/cassandra.yaml; done
> ccm start
> # Create schema
> ccm node1 cqlsh
> CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3};
> CREATE TABLE test.test (key int PRIMARY KEY, val blob);
> exit;
> # Insert data
> python
> from cassandra.cluster import Cluster
> session = cluster.connect('test')
> blob = f = open("2mbBlob", "rb").read().hex()
> session.execute("INSERT INTO test (key, val) VALUES (1, textAsBlob('" + blob + "'))")
> {noformat}
> Reproduced in 3.11, trunk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org