You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Flavio Junqueira (Commented) (JIRA)" <ji...@apache.org> on 2012/03/14 22:38:39 UTC

[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229656#comment-13229656 ] 

Flavio Junqueira commented on ZOOKEEPER-1277:
---------------------------------------------

Ok, I only looked at propose() as you suggested, Pat. That method sounds right: it forces a leader election when we reach the limit. However, I'm not sure how we guarantee that Zab will work correctly under this exception. It is an invariant of the protocol that a follower won't go back to a previous epoch; if we roll over, then followers will have to go back to a previous epoch, no? How do we make sure that it doesn't break the protocol implementation? 
                
> servers stop serving when lower 32bits of zxid roll over
> --------------------------------------------------------
>
>                 Key: ZOOKEEPER-1277
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Patrick Hunt
>            Assignee: Patrick Hunt
>            Priority: Critical
>             Fix For: 3.3.6
>
>         Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this.
> I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira