You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Enrico Olivelli (Jira)" <ji...@apache.org> on 2020/11/23 11:54:00 UTC

[jira] [Resolved] (ZOOKEEPER-3642) Data inconsistency when the leader crashes right after sending SNAP sync

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrico Olivelli resolved ZOOKEEPER-3642.
----------------------------------------
    Fix Version/s: 3.7.0
       Resolution: Fixed

Issue resolved by pull request 1515
[https://github.com/apache/zookeeper/pull/1515]

> Data inconsistency when the leader crashes right after sending SNAP sync
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3642
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.6.0, 3.5.5, 3.5.6, 3.7.0
>         Environment: Linux 4.19.29 x86_64
>            Reporter: Alex Mirgorodskiy
>            Assignee: Fangmin Lv
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.7.0
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> If the leader crashes after sending a SNAP sync to a learner, but before sending the NEWLEADER message, the learner will not save the snapshot to disk. But it will advance its lastProcessedZxid to that from the snapshot (call it Zxid X)
> A new leader will get elected, and it will resync our learner again immediately. But this time, it will use the incremental DIFF method, starting from Zxid X. A DIFF-based resync does not trigger snapshots, so the learner is still holding the original snapshot purely in memory. If the learner restarts after that, it will silently lose all the data up to Zxid X.
> An easy way to reproduce is to insert System.exit into LearnerHandler.java right before sending the NEWLEADER message (on the one instance that is currently running the leader, but not the others):
> {noformat}
>              LOG.debug("Sending NEWLEADER message to " + sid);
> +            if (leader.self.getId() == 1 && sid == 3) {
> +               LOG.debug("Bail when server.1 resyncs server.3");
> +               System.exit(0);
> +            }
> {noformat}
> If I remember right, the repro steps are as follows. Run with that patch in a 4-instance ensemble where server.3 is an Observer, the rest are voting members, and server.1 is the current Leader. Start server.3 after the other instances are up. It will get the initial snapshot from server.1 and server.1 will stop immediately because of the patch. Say, server.2 takes over as the new Leader. Server.3 will receive a Diff resync from server.2, but will skip persisting the snapshot. A subsequent restart of server.3 will make that instance come up with a blank data tree.
> The above steps assumed that server.3 is an Observer, but it can presumably happen for voting members too. Just need a 5-instance ensemble.
> Our workaround is to take the snapshot unconditionally on receiving NEWLEADER:
> {noformat}
> -                   if (snapshotNeeded) {
> +                   // Take the snapshot unconditionally. The first leader may have crashed
> +                   // after sending us a SNAP, but before sending NEWLEADER. The second leader will
> +                   // send us a DIFF, and we'd still like to take a snapshot, even though
> +                   // the upstream code used to skip it.
> +                   if (true || snapshotNeeded) {
>                         zk.takeSnapshot();
>                     }
> {noformat}
> This is what 3.4.x series used to do. But I assume it is not the ideal fix, since it essentially disables the "snapshotNeeded" optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)