You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jon Meredith (JIRA)" <ji...@apache.org> on 2019/05/23 13:58:00 UTC
[jira] [Commented] (CASSANDRA-15138) A cluster (RF=3) not
recovering after two nodes are stopped
[ https://issues.apache.org/jira/browse/CASSANDRA-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846734#comment-16846734 ]
Jon Meredith commented on CASSANDRA-15138:
------------------------------------------
Thanks for the detailed steps in the report. Which versions of Cassandra have you reproduced the issue with?
> A cluster (RF=3) not recovering after two nodes are stopped
> -----------------------------------------------------------
>
> Key: CASSANDRA-15138
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15138
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Membership
> Reporter: Hiroyuki Yamada
> Priority: Normal
>
> I faced a weird issue when recovering a cluster after two nodes are stopped.
> It is easily reproduce-able and looks like a bug or an issue to fix.
> The following is a step to reproduce it.
> === STEP TO REPRODUCE ===
> * Create a 3-node cluster with RF=3
> - node1(seed), node2, node3
> * Start requests to the cluster with cassandra-stress (it continues
> until the end)
> - what we did: cassandra-stress mixed cl=QUORUM duration=10m
> -errors ignore -node node1,node2,node3 -rate threads\>=16
> threads\<=256
> - (It doesn't have to be this many threads. Can be 1)
> * Stop node3 normally (with systemctl stop or kill (without -9))
> - the system is still available as expected because the quorum of nodes is
> still available
> * Stop node2 normally (with systemctl stop or kill (without -9))
> - the system is NOT available as expected after it's stopped.
> - the client gets `UnavailableException: Not enough replicas
> available for query at consistency QUORUM`
> - the client gets errors right away (so few ms)
> - so far it's all expected
> * Wait for 1 mins
> * Bring up node2 back
> - {color:#ff0000}The issue happens here.{color}
> - the client gets ReadTimeoutException` or WriteTimeoutException
> depending on if the request is read or write even after the node2 is
> up
> - the client gets errors after about 5000ms or 2000ms, which are
> request timeout for write and read request
> - what node1 reports with `nodetool status` and what node2 reports
> are not consistent. (node2 thinks node1 is down)
> - It takes very long time to recover from its state
> === STEPS TO REPRODUCE ===
> Some additional important information to note:
> * If we don't start cassandra-stress, it doesn't cause the issue.
> * Restarting node1 and it recovers its state right after it's restarted
> * Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000
> or something) fixes the issue
> * If we `kill -9` the nodes, then it doesn't cause the issue.
> * Hints seems not related. I tested with hints disabled, it didn't make any difference.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org