You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2018/01/31 15:40:00 UTC

[jira] [Comment Edited] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347026#comment-16347026 ] 

Robert Joseph Evans edited comment on ZOOKEEPER-2845 at 1/31/18 3:39 PM:
-------------------------------------------------------------------------

I have a fix that I will be posting shortly.  I need to clean up the patch and make sure that I get pull requests ready for all of the branches that ZOOKEEPER-2926 went into.

 

The following table describes the situation that allows a node to get into an inconsistent state.

 
|| ||N1||N2||N3||
|Start with cluster in sync N1 is leader|0x0 0x5|0x0 0x5|0x0 0x5|
|N2 and N3 go down|0x0 0x5| | |
|Proposal to N1 (fails with no quorum)|0x0 0x6| | |
|N2 and N3 return, but N1 is restarting.  N2 elected leader| |0x1 0x0|0x1 0x0|
|A proposal is accepted| |0x1 0x1|0x1 0x1|
|N1 returns and is trying to sync with the new leader N2|0x0 0x6|0x1 0x1|0x1 0x1|

 

At this point the code in {{LearnerHandler.syncFollower}} takes over to bring N1 into sync with N2 the new leader.

That code checks the following in order
 # Is there a {{forceSync}}? Not in this case
 # Are the two zxids in sync already?  No {{0x0 0x6 != 0x1 0x1}}
 # is the peer zxid > the local zxid (and peer didn't just rotate to a new epoch)? No {{0x0 0x6 < 0x1 0x1}}
 # is the peer zxid in between the max committed log and the min committed log?  In this case yes it is, but it shouldn't be.  The max committed log is {{0x1 0x1}}.  The min committed log is {{0x0 0x5}} or something likely below it because it is based off of distance in the edit log.  The issue is that once the epoch changes, {{0x0}} to {{0x1}}, the leader has no idea if the edits are in its edit log without explicitly checking for them.

 

The reason that ZOOKEEPER-2926 exposed this is because previously when a leader was elected the in memory DB was dropped and everything was reread from disk.  When this happens the {{0x0 0x6}} proposal was lost.  But it is not guaranteed to be lost in all cases.  In theory a snapshot could be taken triggered by that proposal, either on the leader, or on a follower that also received the proposal, but does not join the new quorum in time.   As such ZOOKEEPER-2926 really just extended the window of an already existing race.  But it extended it almost indefinitely so it is much more likely to happen.

 

My fix is to update {{LearnerHandler.syncFollower}} to only send a {{DIFF}} if the epochs are the same.  If they are not the same we don't know if something we inserted that we don't know about.

 


was (Author: revans2):
I have a fix that I will be posting shortly.  I need to clean up the patch and make sure that I get pull requests ready for all of the branches that ZOOKEEPER-2926 went into.

 

The following table describes the situation that allows a node to get into an inconsistent state.

 
|| ||N1||N2||N3||
|Start with cluster in sync N1 is leader|0x0 0x5|0x0 0x5|0x0 0x5|
|N2 and N3 go down|0x0 0x5| | |
|Proposal to N1 (fails with no quorum)|0x0 0x6| | |
|N2 and N3 return, but N1 is restarting.  N2 elected leader| |0x1 0x0|0x1 0x0|
|A proposal is accepted| |0x1 0x1|0x1 0x1|
|N1 returns and is trying to sync with the new leader N2|0x0 0x6|0x1 0x1|0x1 0x1|

 

At this point the code in {{LearnerHandler.syncFollower}} takes over to bring N1 into sync with N2 the new leader.

That code checks the following in order
 # Is there a {{forceSync}}? Not in this case
 # Are the two zxids in sync already?  No {{0x0 0x6 != 0x1 0x1}}
 # is the peer zxid > the local zxid (and peer didn't just rotate to a new epoch)? No {{0x0 0x6 < 0x1 0x1}}
 # is the peer zxid in between the max committed log and the min committed log?  In this case yes it is, but it shouldn't be.  The max committed log is {{0x1 0x1}}.  The min committed log is {{0x0 0x5}} or something likely below it because it is based off of distance in the edit log.  The issue is that once the epoch changes, {{0x0}} to {{0x1}}, the leader has no idea if the edits are in its edit log without explicitly checking for them.

 

The reason that ZOOKEEPER-2926 exposed this is because previously when a leader was elected the in memory DB was dropped and everything was reread from disk.  When this happens the {{0x0 0x6}} proposal was lost.  But it is not guaranteed to be lost in all cases.  In theory a snapshot could be taken triggered by that proposal, either on the leader, or on a follower that also allied the proposal, but does not join the new quorum in time.   As such ZOOKEEPER-2926 really just extended the window of an already existing race.  But it extended it almost indefinitely so it is much more likely to happen.

 

My fix is to update {{LearnerHandler.syncFollower}} to only send a {{DIFF}} if the epochs are the same.  If they are not the same we don't know if something we inserted that we don't know about.

 

> Data inconsistency issue due to retain database in leader election
> ------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2845
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.10, 3.5.3, 3.6.0
>            Reporter: Fangmin Lv
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time during leader election. In ZooKeeper ensemble, it's possible that the snapshot is ahead of txn file (due to slow disk on the server, etc), or the txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will be drained during shutdown, the snapshot and txn file will keep consistent before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by doing consensus between snapshot and txn files before leader election, will submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)