You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/08/10 18:42:00 UTC
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122076#comment-16122076 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
-------------------------------------------

GitHub user enixon opened a pull request:

    https://github.com/apache/zookeeper/pull/333

    ZOOKEEPER-2872: Interrupted snapshot sync causes data loss

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/enixon/zookeeper snap-sync

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #333
    
----
commit 39bd1a3eb9171a014845fff97648341cbfb40a11
Author: Brian Nixon <ni...@fb.com>
Date:   2017-08-01T20:25:51Z

    ZOOKEEPER-2872: Interrupted snapshot sync causes data loss

----


> Interrupted snapshot sync causes data loss
> ------------------------------------------
>
>                 Key: ZOOKEEPER-2872
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.10, 3.5.3, 3.6.0
>            Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data tree while remaining members of good standing with the ensemble and continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to catch up.
> 3. The machine powers off before the snapshot is synced to disc and after some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts availability.
> In this scenario, any commits from epoch N that the observer did not receive before it died the first time will never be exposed to the observer and no part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a simple fix, fsync-ing the snapshots received from the leader will avoid the case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)