You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Saswati (Jira)" <ji...@apache.org> on 2020/08/07 17:02:00 UTC
[jira] [Created] (ZOOKEEPER-3909) Zookeeper Unable to Join the
Cluster after it is Restarted
Saswati created ZOOKEEPER-3909:
----------------------------------
Summary: Zookeeper Unable to Join the Cluster after it is Restarted
Key: ZOOKEEPER-3909
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3909
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.5.7
Environment: All Environments
Reporter: Saswati
When we restart a zookeeper, it doesn't successfully join the cluster and start serving clients. We see the zookeeper services starts successfully, but it stays ideal and throws the message: "This ZooKeeper instance is not currently serving requests"
The Zookeeper cluster size is 5. Whenever we feel the need of restarting the zookeepers, we do one at a time. There are two ways we restart the zookeepers,
# just stop the services and start it back up again.
# stop the services, replace the host, and start it back up again.
And, in both the cases we see the same issue.
-----------
When investigated the zookeepers logs, we see the below errors/warnings,
"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
[java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted epoch, xy"
-------------------------
But, when we check the current epoch of the leader is always same as the accepted epoch.
------------------------
Also, when we get the Zxid of every quorum member, they have the same first byte; only the last two numbers change, so we can safely assume that they are in sync, I guess.
Somehow this zookeeper that we re restarting sees an advancing of the epoch and shuts down as a follower.
--------------
The current solution we have at the moment for this issue is,
stop the zookeeper services --> rename the current zookeeper data directory (version-2) --> start it backup again.
It immediately joins the cluster as a follower as it doesn't have any idea of the epoch and start serving clients.
----------
--
This message was sent by Atlassian Jira
(v8.3.4#803005)