You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Abraham Fine (JIRA)" <ji...@apache.org> on 2017/05/25 23:28:04 UTC

[jira] [Commented] (ZOOKEEPER-2791) Quorum doesn't recover after zxid rollover

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025553#comment-16025553 ] 

Abraham Fine commented on ZOOKEEPER-2791:
-----------------------------------------

Hi [~mheffner]-

Thanks for reporting this issue and uploading logs.

I have been trying to reproduce the issue with both 3.3.6 and 3.4.8 and have been unsuccessful. I have been reproducing the issue by changing `if ((request.zxid & 0xffffffffL) == 0xffffffffL) {` to `if ((request.zxid & 0xffffffffL) == SOME_SMALLER_VALUE) {` to force a leader election, and in my testing, ZooKeeper has handled it properly.

I was wondering if you had additional logs that showed what was happening while the cluster is down. As far as I can tell the uploaded logs cover only a second and are from only 2 machines. Would it be possible logs for the first few minutes after the rollover from all the machines in the cluster? It would be great to see all of the leader election messages that are being exchanged. 

Thanks,
Abe

> Quorum doesn't recover after zxid rollover
> ------------------------------------------
>
>                 Key: ZOOKEEPER-2791
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2791
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum
>    Affects Versions: 3.3.6, 3.4.8
>         Environment: Ubuntu 14.04.4 LTS, AWS EC2, 5 node ensembles
>            Reporter: Mike Heffner
>            Assignee: Abraham Fine
>
> When zxid rolls over the ensemble is unable to recover without manually restarting the cluster. The leader enters shutdown() state when zxid rolls over, but the remaining four nodes in the ensemble are not able to re-elect a new leader. This state has persisted for at least 15 minutes before an operator manually restarted the cluster and the ensemble recovered.
> Config:
> --------
> tickTime=2000
> initLimit=10
> syncLimit=5
> dataDir=/raid0/zookeeper
> clientPort=2181
> maxClientCnxns=100
> autopurge.snapRetainCount=14
> autopurge.purgeInterval=24
> leaderServes: True
> server.7=172.26.134.88:2888:3888
> server.6=172.26.136.143:2888:3888
> server.5=172.26.135.103:2888:3888
> server.4=172.26.134.16:2888:3888
> server.9=172.26.135.19:2888:3888
> Logs:
> https://gist.github.com/mheffner/d615d358d4a360ae56a0d0a280040640



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)