You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Hongchao Deng (JIRA)" <ji...@apache.org> on 2015/01/30 02:05:34 UTC

[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298005#comment-14298005 ] 

Hongchao Deng commented on ZOOKEEPER-2099:
------------------------------------------

In Scenario Step 7, do you mean
{code}
Host H2 recovers and connects to H1
{code}

> Using txnlog to sync a learner can corrupt the learner's datatree
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2099
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0, 3.6.0
>            Reporter: Santeri (Santtu) Voutilainen
>         Attachments: ZOOKEEPER-2099-repro.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send the learner a DIFF that does NOT contain all the transactions between the learner's zxid and that of the leader's zxid thus resulting in a corruption datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a SNAP and the zxid requested by the learner must still exist in the current leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the necessary records in the in-mem committed log, AND b) the size of the required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's txnlogs have records up to and including Z1 as well as Z2 and up.  It does NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H2 to sync and sends Z1 in its FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because although it does not have the necessary records in its in-memory commit log, it does have Z1 in its txnlog and the size of the log is less than the limit.  H1 ends up with a different size calculation than H3 because H1 is missing all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on the type of transactions that occurred between Z1 and Z2 it may not hit any errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory contents from the affected hosts and have them resync with the leader at which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync from the txnlog by changing the database size limit to 0.  This is a code change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be persisted on-disk so that the txnlogs with the gap cannot be used to provide a DIFF even after restart.  A couple ways in which the state could be persisted:
> ** Write a file (for example: loggap.<zxid>) in the data dir indicating that the host was sync'd with a SNAP and thus txnlogs might be missing. Presence of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" marker. Readers of the txnlog would then check for presence of this record when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)