You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "anaud (Jira)" <ji...@apache.org> on 2020/09/21 14:50:00 UTC
[jira] [Comment Edited] (ZOOKEEPER-2832) Data Inconsistency occurs if follower has uncommitted transaction in the log while synchronizing with the leader that has the lower last processed zxid

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199423#comment-17199423 ] 

anaud edited comment on ZOOKEEPER-2832 at 9/21/20, 2:49 PM:
------------------------------------------------------------

I cannot reproduce the bug with the given test patch in the master branch on my environment, but it does for 3.4.9.

I think in 3.4.9, the root cause was not truncating in syncWithLeader method of Learner.java where the else if block started with the line  "else if (qp.getType() == Leader.SNAP) {".

Reusing the original bug reproduction steps described, the node B executes this lines of codes at step 3 and it will still have the transaction 21 in the log even if it is synced with C at zxid 12. 

I think the fix for this bug in 3.4.9 may be simply invoking "zk.getZKDatabase().trunateLog(lastLeaderZxid)" in the else if statement only if "lastZxid > lastLeaderZxid" is true.

It seems the else if block in the master branch is changed quite a bit since 3.4.9. So, I am not sure if this observation applies to the master branch as well (and I am not sure if what I pointed out is indeed the root cause and the right way to fix this)  

Can you confirm?


was (Author: anaud):
I cannot reproduce the bug with the given test patch in the master branch on my environment, but it does for 3.4.9.

I think in 3.4.9, the root cause was not truncating in syncWithLeader method of Learner.java where the else if block started with the line  "else if (qp.getType() == Leader.SNAP) {".

Reusing the original bug reproduction steps described, the node B executes this lines of codes at step 3 and it will still have the transaction 21 in the log even if it is synced with C at zxid 12. 

I think the fix for this bug in 3.4.9 may be simply invoking "zk.getZKDatabase().trunateLog(lastLeaderZxid)" in the else if statement only if "lastZxid > lastLeaderZxid" is true.

It seems the else if block in the master branch is changed quite a bit since 3.4.9. So, I am not sure if this observation applies to the master branch as well (and I am not sure if what I pointed out is indeed the root cause and the right way to fix this)  

> Data Inconsistency occurs if follower has uncommitted transaction in the log while synchronizing with the leader that has the lower last processed zxid
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2832
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2832
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.9
>            Reporter: Beom Heyn Kim
>            Priority: Major
>             Fix For: 3.4.10
>
>         Attachments: zookeeper-2832.patch
>
>
> Synchronization code may fail to truncate an uncommitted transaction in the follower’s transaction log. Here is a scenario:
>  
> Initial condition:
> Start the ensemble with three nodes A, B and C with C being the leader
> The current epoch is 1
> For simplicity of the example, let’s say zxid is a two digit number, with epoch being the first digit
> Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively
> The zxid is 12 -- 11 for creating key0 and 12 for creating key1. (For simplicity of the example, the zxid gets increased only by transactions directly changing the data of znodes.)
> All the nodes have seen the change 12 and have persistently logged it
> Shut down all
>  
> Step 1
> Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), with zxid 21 is issued. The leader B writes it to the log but Node A is shutdown before writing it to the log. Then, the leader B is also shut down. The change 21 is applied only to B but not to A or C.
>  
> Step 2
> Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C (i.e. 20 > 12). So, Node A becomes the leader. Yet, the last processed zxid is 12 for both Node A and C. So, they are in sync already. Node A sends an empty DIFF to Node C. Node C takes a snapshot and creates snapshot.12. Then, A and C are shut down. Now, C has the higher zxid than Node B.
>  
> Step 3
> Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B (i.e. 30 > 21). So, Node C becomes the leader. Node B and C has the different last processed zxid (i.e. 21 vs 12), and the LinkedList object ‘proposals’ is empty. Thus, Node C sends SNAP to Node B. Node B takes a clean snapshot and creates snapshot.12 as the zxid 12 is the last processed zxid of the leader C. (Note the newly created snapshot on B is assigned the lower zxid then the change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41 is issued. Both B and C apply the change 41 into their logs. (Note that now B and C have the same last processed zxid) Then, B and C are shut down.
>  
> Step 4
> Start Node B and C. Epoch becomes 5.  Node B and C use their local log and snapshot files to restore their in-memory data tree. Node B has 1000 as the value of key0, because it’s latest valid snapshot is snapshot.12 and there was a later transaction with zxid 21 in its log. Yet, Node C has 0 as the value of key0, because the change 21 was never written on C. Node C is the leader. Node B and C have the same last processed zxid, i.e. 41. So, they are considered to be in sync already, and Node C sends an empty DIFF to Node B. So, the synchronization completes with the initially restored in-memory data tree on B and C.
>  
> Problem
> The value of key0 on B is 1000, while the value of the key0 on Node C is 0. The LearnerHandler.run on C at Step 3, 	never sends TRUNC but just SNAP. So, the change 21 was never truncated on B. Also, at step 4, since B uses the snapshot of the lower zxid to restore its in-memory data tree, the change 21 could get into the data tree. Then, the leader C at the step 4 did not send SNAP, because the change 41 made to both B and C makes the leader C think the B and C are already in sync. Thus, data inconsistency occurs.
>  
> The attached test case can deterministically reproduce the bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)