You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Steven Levitt <st...@knewton.com> on 2016/06/20 17:00:11 UTC

Counter update write timeouts with Datastax Driver/Native protocol, not with Astyanax/Thrift

I've posted the following to the Datastax Java Driver user forum, but no
one has responded, so I thought I'd try here, too.

We have a service that writes to a few legacy (pre-CQL) counter column
families in a Cassandra 2.1.11 cluster. We've been trying to migrate this
service from Astyanax to the Datastax Java Driver (version 2.1.10.1). We've
been testing the new version in a "shadow" deployment in a production
environment, using the same Cassandra cluster as the production version,
but writing to a testing-only keyspace.

Occasionally, unlogged batches of counter updates in the same partition
will fail with the following error from the coordinator:

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra
timeout during write query at consistency ONE (1 replica were required but
only 0 acknowledged the write)

We've only observed these errors in the service version that uses the
Datastax Driver, not the version that uses Astyanax.

These batches are written with CL=LOCAL_QUORUM; the CL in the error message
doesn't match.This resembles the symptoms of the issue described in
CASSANDRA-10041 "timeout during write query at consistency ONE" when
updating counter at consistency QUORUM and 2 of 3 nodes alive

In that issue, the error occurs when a node is abruptly terminated.
However, we've also seen the error occur when all Cassandra nodes appeared
to be healthy.

There are a few possible explanations for why the errors only occur with
the Datastax driver, but I'm not sure which is correct:
a) There is a problem with how we're using the Datastax Driver to compose
batches of counter updates
b) There is a difference in the between the implementation of counter
updates in the Native protocol from the Thrift protocol such that the error
is reported to native clients, but not to Thrift clients.
c) There is a difference between the keyspace/column family definition of
the production and testing keyspaces.
d) The Astyanax/Thrift version is getting the error but is ignoring it for
some reason.

I doubt (c) is the reason; we've made an effort to ensure that the keyspace
and CF configurations are the same. Also, (d) seems unlikely because we've
seen other errors (such as unavailable exceptions) reported correctly. So,
I'm betting that either (a) or (b) is the reason.

Would someone please suggest which of these explanations is likely to be
correct, and what we might do to avoid the problem?

-- 
- Steven