You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jeffrey F. Lukman (JIRA)" <ji...@apache.org> on 2018/09/26 21:27:00 UTC
[jira] [Commented] (CASSANDRA-12438) Data inconsistencies with lightweight transactions, serial reads, and rejoining node

    [ https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629434#comment-16629434 ] 

Jeffrey F. Lukman commented on CASSANDRA-12438:
-----------------------------------------------

Hi all,

 

Our team from UCARE University of Chicago, have been able to reproduce this bug consistently with our model checker.
Here are the workload and scenario of the bug:

Workload: 3 node-cluster (let's call them node X, Y, and Z), 1 Crash, 1 Reboot events, with 3 client requests (where node X will be the coordinator node for all client requests.

Scenario:
 # Start the 3 nodes and setup the CONSISTENCY = ONE.
 # Inject client request 1 as described in this bug description:
Insert <id, V_a1, V_b1, ... , version=1> (along with many others)
 # But before any PREPARE messages have been sent by the node X, node Z has crashed.
 # Client request 1 is successfully committed in node X and Y.
 # Reboot node Z.
 # Inject client request 2 & 3 as described in this bug description:
Update <id, V_a2, V_b1, ... , version=2> (along with others for which A=V_a1)
Update <id, V_a3, V_b1, ... , version=3> (along with many others for which A=V_a2)
(**Although Update 3 can also be ignored if we want to simplify the bug scenario)
 # If client request-2 finished first without being interfered by client request-3, then we expect to see:
<id, V_a2, V_b1, ... , version=2>
If the client request-3 interfere client request-2 or is executed before client request-2 for any reason, then we expect to see:
<id, V_a3, V_b1, ... , version=3>
 # But our model checker shows that, if we do a read request to node Z, then we will see:<id, V_a2, version=2> // some fields are null
But if we do a read request to node X or Y, then we will get a complete result.
Which means we end up in an inconsistency view among the nodes.

If we run this scenario with CONSISTENCY.ALL we will not see this bug to happen.

We are happy to assist you guys to debug this issue.

> Data inconsistencies with lightweight transactions, serial reads, and rejoining node
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-12438
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Steven Schaefer
>            Priority: Major
>
> I've run into some issues with data inconsistency in a situation where a single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns, named let's say A-F,id, and version. LWTs are used to make the inserts and updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... , version=1, then another process will pick up rows with for example A=V_a1 and subsequently update A to V_a2 and version=2. Yet another process will watch for A=V_a2 to then make a second update to the same column, and set version to 3, with end result being <id, V_a3, V_b1, ... , V_f1, version=3> There's a secondary index on this A column (there's only a few possible values for A so not worried about the cardinality issue), though I've reproed with the new SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can still happen. When I bring up the downed node, sometimes I get really weird state back which ultimately crashes the client system that's talking to Cassandra. 
> When reading I always select all the columns, but there is a conditional where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read is for processing any rows with V_a2, and ultimately updating to V_a3 when complete. While periodically polling for A=V_a2 it is of course possible for the poller to to observe the old V_a2 value while the other parts of the client system process and make the update to V_a3, and that's generally ok because of the LWTs used for updates, an occassionaly wasted reprocessing run ins't a big deal, but when reading at serial I always expect to get the original values for columns that were never updated too. If a paxos update is in progress then I expect that completed before its value(s) returned. But instead, the read seems to be seeing the partial commit of the LWT, returning the old V_2a value for the changed column, but no values whatsoever for the other columns. From the example above, instead of getting <id, V_a3 V_b1, ... , version=3>, or even the older <id, V_a2, V_b1, ..., version=2> (either of which I expect and are ok), I get only <id, V_a2, version=2>, so the rest of the columns end up null, which I never expect. However this isn't persistent, Cassandra does end up consistent, which I see via sstabledump and cqlsh after the fact.
> In my client system logs I record the insert / updates, and this inconsistency happens around the same time as the update from V_a2 to V_a3, hence my comment about Cassandra seeing a partial commit. So that leads me to suspect that perhaps due to the where clause in my read query for A=V_a2, perhaps one of the original good nodes already has the new V_a3 value, so it doesn't return this row for the select query, but the other good node and the one that was down still have the old value V_a2, so those 2 nodes return what they have. The one that was down doesn't yet have the original insert, just the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify), which would explain where <id, V_a2, version=2> comes from, that's all it knows about. However since it's a serial quorum read, I'd expect some sort of exception as neither of the remaining 2 nodes with A=V_a2 would be able to come to a quorum on the values for all the columns, as I'd expect the other good node to return <id, V_a2, V_b1, ..., version=2>
> I know at some point nodetool repair should be run on this node, but I'm concerned about a window of time between when the node comes back up and repair starts/completes. It almost seems like if a node goes down the safest bet is to remove it from the cluster and rebuild, instead of simply restarting the node? However I haven't tested that to see if it runs into a similar situation.
> It is of course possible to work around the inconsistency for now by detecting and ignoring it in the client system, but if there is indeed a bug I hope we can identify it and ultimately resolve it.
> I'm also curious if this relates to CASSANDRA-12126, and also CASSANDRA-11219 may be relevant.
> I've been reproducing with a combination of manually stopping one node, running a test script I have to trigger the client system to insert data, then manually restarting the node and waiting. It's  consistently inconsistent, reproducing on most attempts
> Summary timeline:
> {noformat}
> 1.  Shut down third node of three.
> 2.  Insert <id, V_a1, V_b1, ... , version=1> (along with many others)
> 3.  Start the third node. (start occurs concurrently with 4 & 5)
> 4.  Update <id, V_a2, V_b1, ... , version=2> (along with others for which A=V_a1)
> 5.  Update <id, V_a3, V_b1, ... , version=3> (along with many others for which A=V_a2)
> 5.  Read
>     a.	Expected: <id, V_a2, V_b1, ... , version=2> OR <id, V_a3, V_b1, ... , version=3>
>     b.	Actual: <id, V_a2, version=2> // some fields are null
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org