You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Camille Fournier (Commented) (JIRA)" <ji...@apache.org> on 2011/11/01 16:25:32 UTC

[jira] [Commented] (ZOOKEEPER-1136) NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the twiki

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141234#comment-13141234 ] 

Camille Fournier commented on ZOOKEEPER-1136:
---------------------------------------------

This change causes a concurrency bug. Specifically:
1. Follower rejoins, gets snap from leader
2. Follower gets NEWLEADER message and takes a snapshot
3. Follower gets some additional tranactions forwarded from leader, applies these directly to data tree
4. Follower gets an UPTODATE message, does not take a snapshot
5. Follower starts following, writes some new transactions to its log, and is killed before it takes another snapshot
6. Follower restarts and gets a DIFF from the leader

The transactions that came in between NEWLEADER and UPTODATE are lost because they never go anywhere but the internal data tree, and if that tree isn't snapshotted and the follower restarts with only a DIFF, the follower will lose these transactions.

I think the proper thing to do is snapshot after UPTODATE, but I'm not sure why we changed this to snapshot after NEWLEADER instead. The wiki doesn't seem to explain that clearly. If one of you could check on https://issues.apache.org/jira/browse/ZOOKEEPER-1264 and let me know the reasoning, that would be helpful.
                
> NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the twiki
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1136
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1136
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Benjamin Reed
>            Assignee: Benjamin Reed
>            Priority: Blocker
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1136.patch, ZOOKEEPER-1136.patch, ZOOKEEPER-1136.patch
>
>
> the NEW_LEADER message was sent at the beginning of the sync phase in Zab pre1.0, but it must be at the end in Zab 1.0. if the protocol is 1.0 or greater we need to queue rather than send the packet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira