You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by "Mike Lundy (JIRA)" <ji...@apache.org> on 2015/04/15 02:03:02 UTC

[jira] [Created] (ZOOKEEPER-2167) Restarting current leader node sometimes results in a permanent loss of quorum

Mike Lundy created ZOOKEEPER-2167:
-------------------------------------

             Summary: Restarting current leader node sometimes results in a permanent loss of quorum
                 Key: ZOOKEEPER-2167
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2167
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.4.6
            Reporter: Mike Lundy
         Attachments: fails-to-rejoin-quorum.gz

I'm seeing an issue where a restart of the current leader node results in a long-term / permanent loss of quorum (I've only waited 30 minutes, but it doesn't look like it's making any progress). Restarting the same instance _again_ seems to resolve the problem.

To me, this looks a lot like the issue described in https://issues.apache.org/jira/browse/ZOOKEEPER-1026, but I'm filing this separately for the moment in case I am wrong.

Notes on the attached log:
1) If you search for XXX in the log, you'll see where I've annotated it to include where the process was told to terminate, when it is reported to have completed that, and then the same for the start
2) To save you the trouble of figuring it out, here's the zkid <=> ip mapping:
zid=1, ip=10.20.0.19
zid=2, ip=10.20.0.18
zid=3, ip=10.20.0.20
zid=4, ip=10.20.0.21
zid=5, ip=10.20.0.22
3) It's important to note that this is log is during the process of a rolling service restart to remove an instance; in this case, zid #2 / 10.20.0.18 is the one being removed, so if you see a conspicuous silence from that service, that's why. 
4) I've been unable to reproduce this problem _except_ during cluster size changes, so I suspect that may be related; it's also important to note that this test is going from 5 -> 4 (which means, since we remove one and then do a rolling restart, we are actually temporarily dropping to 3). I know is not a recommended thing (this is more of a stress test). We have seen this same problem on larger cluster sizes, it just seems easier to reproduce it on smaller sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)