You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Tom van der Woerdt (JIRA)" <ji...@apache.org> on 2019/07/22 18:58:00 UTC

[jira] [Created] (CASSANDRA-15243) removenode can cause QUORUM write queries to fail

Tom van der Woerdt created CASSANDRA-15243:
----------------------------------------------

             Summary: removenode can cause QUORUM write queries to fail
                 Key: CASSANDRA-15243
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15243
             Project: Cassandra
          Issue Type: Bug
          Components: Consistency/Coordination
            Reporter: Tom van der Woerdt


Looks like nobody found this yet ([google|https://www.google.com/search?q="cassandra"+"removenode"+"quorum"+site%3Aissues.apache.org]) so this may be a ticking time bomb for some... :(

This happened to me earlier today. On a Cassandra 3.11.4 cluster with three DCs, one DC had three servers fail due to unexpected external circumstances. Replication was NTS configured with 2:2:2.

Cassandra dealt with the failures just fine - great! However, they failed in a way that would make bringing them back impossible, so I tried to remove them using 'removenode'.

Suddenly, the application started experiencing a large number of QUORUM write timeouts. My first reflex was to lower the streaming throughput and compaction throughput, since timeouts indicated some overload was happening. No luck, though.

I tried a bunch of other things to reroute queries away from the affected datacenter, like changing the Severity field on the dynamic snitch. Still, no luck.

After a while I noticed one strange thing: the WriteTimeoutException listed that five replicas were required, instead of the four you would expect to see in a 2:2:2 replication configuration. I shrugged it off as some weird inconsistency that was probably because of the use of batches.

Skip ahead a bit, I decided to let the streams run again and just wait the issue out, since nothing I did was working, and maybe just letting the streams finish would resolve this overload. Magically, as soon as the streams finished, the errors stopped.

----

There are two issues here, both in AbstractWriteResponseHandler.java.

h3. Cassandra sometimes waits for too many replicas on writes

In [totalBlockFor|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L124] Cassandra will *always* include pending nodes in `blockfor`. In the case of a quorum query on a 2:2:2 replication factor, with two replicas in one DC down, this results in a blockfor of 5. If the pending replica is then also down (as can happen in a case where removenode is used and not all destination hosts are up), only 4 of the 5 hosts are available, and quorum queries will never succeed.

h3. UnavailableException not thrown

While debugging this, I spent all my time focusing on this issue as if it was a timeout. However, Cassandra was doing queries that could never succeed, because insufficient hosts were available. Throwing an UnavailableException would have been more helpful. The issue here is caused by [assureSufficientLiveNodes|https://github.com/apache/cassandra/blob/71cb0616b7710366a8cd364348c864d656dc5542/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java#L155] which merely concats the lists of available nodes, and won't consider the special-case behavior of a pending node that's down.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org