You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@zookeeper.apache.org by "Saswati (Jira)" <ji...@apache.org> on 2020/08/07 17:02:00 UTC

[jira] [Created] (ZOOKEEPER-3909) Zookeeper Unable to Join the Cluster after it is Restarted

Saswati created ZOOKEEPER-3909:
----------------------------------

             Summary: Zookeeper Unable to Join the Cluster after it is Restarted 
                 Key: ZOOKEEPER-3909
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3909
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.5.7
         Environment: All Environments 
            Reporter: Saswati


When we restart a zookeeper, it doesn't successfully join the cluster and start serving clients. We see the zookeeper services starts successfully, but it stays ideal and throws the message: "This ZooKeeper instance is not currently serving requests"

The Zookeeper cluster size is 5. Whenever we feel the need of restarting the zookeepers, we do one at a time. There are two ways we restart the zookeepers,
 # just stop the services and start it back up again.
 # stop the services, replace the host, and start it back up again.

And, in both the cases we see the same issue.

-----------

When investigated the zookeepers logs, we see the below errors/warnings,

"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN  org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
[java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted epoch, xy"

-------------------------

But, when we check the current epoch of the leader is always same as the accepted epoch.

------------------------

Also, when we get the Zxid of every quorum member, they have the same first byte; only the last two numbers change, so we can safely assume that they are in sync, I guess.

Somehow this zookeeper that we re restarting sees an advancing of the epoch and shuts down as a follower.

--------------

The current solution we have at the moment for this issue is,

stop the zookeeper services --> rename the current zookeeper data directory (version-2) --> start it backup again.

It immediately joins the cluster as a follower as it doesn't have any idea of the epoch and start serving clients. 

----------



--
This message was sent by Atlassian Jira
(v8.3.4#803005)