You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Joel Knighton (JIRA)" <ji...@apache.org> on 2015/09/04 01:01:46 UTC

[jira] [Commented] (CASSANDRA-10143) Apparent counter overcount during certain network partitions

    [ https://issues.apache.org/jira/browse/CASSANDRA-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729988#comment-14729988 ] 

Joel Knighton commented on CASSANDRA-10143:
-------------------------------------------

Looks like we've got some external reproduction here: [Cassandra 2.1 Counters: Testing Consistency During nodes Failures|http://datastrophic.io/evaluating-cassandra-2-1-counters-consistency/]

> Apparent counter overcount during certain network partitions
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-10143
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10143
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Joel Knighton
>            Assignee: Aleksey Yeschenko
>             Fix For: 2.1.x, 2.2.x, 3.0.x
>
>
> This issue is reproducible in this [Jepsen Test|https://github.com/riptano/jepsen/blob/f45f5320db608d48de2c02c871aecc4910f4d963/cassandra/test/cassandra/counter_test.clj#L16].
> The test starts a five-node cluster and issues increments by one against a single counter. It then checks that the counter is in the range [OKed increments, OKed increments + Write Timeouts] at each read. Increments are issued at CL.ONE and reads at CL.ALL.  Throughout the test, network failures are induced that create halved network partitions. A halved network partition splits the cluster into three connected nodes and two connected nodes, randomly.
> This test started failing; bisects showed that it was actually a test change that caused this failure. When the network partitions are induced in a cycle of 15s healthy/45s partitioned or 20s healthy/45s partitioned, the test failes. When network partitions are induced in a cycle of 15s healthy/60s partitioned, 20s healthy/45s partitioned, or 20s healthy/60s partitioned, the test passes.
> There is nothing unusual in the logs of the nodes for the failed tests. The results are very reproducible.
> One noticeable trend is that more reads seem to get serviced during the failed tests.
> Most testing has been done in 2.1.8 - the same issue appears to be present in 2.2/3.0/trunk, but I haven't spent as much time reproducing.
> Ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)