You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Rajan Dhabalia (Jira)" <ji...@apache.org> on 2020/07/28 19:02:00 UTC

[jira] [Created] (ZOOKEEPER-3902) ZooKeeper cluster did not perform leader reelection when leader host crashed

Rajan Dhabalia created ZOOKEEPER-3902:
-----------------------------------------

             Summary: ZooKeeper cluster did not perform leader reelection when leader host crashed
                 Key: ZOOKEEPER-3902
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3902
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
            Reporter: Rajan Dhabalia


I saw this issue in one of the zookeeper cluster where LEADER host crashed due to h/w issue and all the follower hosts immediately closed client connection with leader host but zk-cluster did not trigger leader reelection until we manually restarted few zookeeper host after some time.

Snapshot size: 200MB
ZooKeeper version: 3.4.13


Below logs are from Leader and one of the follower host, when Leader host was crashed.

1. Leader host: zookeeper-4.us-west.com crashed at 16:21:09 and no application logs.
2. Follower host: zookeeper-2.us-west.com immediately closed client-connection with zookeeper-4.us-west.com at 16:21:09
3. None of the Follower hosts triggered LEADER reelection and don't see any such logs
4. All the zk-clients started getting zookeeper session timeouts
5. after 27 minutes at 16:48:30, Follower host found out the issue in quorum and logged
*"Have smaller server identifier, so dropping the connection"*
6. But it still did not trigger leader reelection until we manually restarted zk-process in few zk hosts after 5 mins
7. and finally after restarting process in few zk hosts manually: zk cluster did leader reelection and created the quorum

*Logs:*

*Leader zookeeper: zookeeper-4.us-west.com*
{noformat}
Leader zookeeper: zookeeper-4.us-west.com
Host crashed at : 16:21:10,095 and last log was `16:21:10,095`. and then it came back after an hour: `17:25:27` and joined as FOLLOWER.

16:21:09,965 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /0.0.0.1:45186
16:21:09,965 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@908] - Processing mntr command from /0.0.0.1:45186
16:21:10,095 - INFO [Thread-194783:NIOServerCnxn@1056] - Closed socket connection for client /0.0.0.1:45186 (no session established for client)
17:25:27,262 - INFO [main:QuorumPeerConfig@136] - Reading configuration from: 
{noformat}

 

*Follower zookeeper: zookeeper-2.us-west.com*
{noformat}
16:21:09,743 - INFO [Thread-78288:NIOServerCnxn@1056] - Closed socket connection for client /4.4.4.4:52216 (zookeeper-4.us-west.com-IP) (no session established for client)
:
# ZK host didn't trigger the LEADER-REELECTION even despite zookeeper-4.us-west.com crashed and lost connection with that host at : `16:21:09,743`
# Client was keep connecting and keep disconnecting due to zk session timeout
16:48:29,961 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /0.0.0.2:43938
16:48:29,965 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client attempting to establish new session at /0.0.0.2:43938
16:48:29,977 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection from /0.0.0.3:54630
16:48:29,981 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@949] - Client attempting to establish new session at /0.0.0.3:54630
:
# After 27 Minutes: Saw first log from QuorumCnxManager and zk host was figuring out if needs leaer-relection
16:48:30,475 - INFO [WorkerSender[myid=2]:QuorumCnxManager@347] - Have smaller server identifier, so dropping the connection: (3, 2)
16:48:30,475 - INFO [WorkerReceiver[myid=2]:FastLeaderElection@595] - Notification: 1 (message format version), 2 (n.leader), 0x6d00328f73 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x6d (n.peerEpoch) LOOKING (my state)
16:48:30,476 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@908] - Processing mntr command from /2.2.2.2:55136
16:48:30,476 - INFO [WorkerReceiver[myid=2]:FastLeaderElection@595] - Notification: 1 (message format version), 4 (n.leader), 0x6c00b36a5a (n.zxid), 0x35e (n.round), LOOKING (n.state), 1 (n.sid), 0x6c (n.peerEpoch) LOOKING (my state)
16:48:30,476 - INFO [prod1-zk2.messaging.gq1.yahoo.com/2.2.2.2:2183:QuorumCnxManager$Listener@743] - Received connection request /3.3.3.3:34092
16:48:30,476 - INFO [WorkerSender[myid=2]:QuorumCnxManager@347] - Have smaller server identifier, so dropping the connection: (4, 2)
:
# After manually restarting follower host, finally follower hosts 
16:50:53,986 - INFO [main:QuorumPeerConfig@398] - Defaulting to majority quorums
16:50:53,991 - INFO [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
16:50:53,991 - INFO [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
16:50:53,992 - INFO [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
16:50:54,003 - INFO [main:QuorumPeerMain@130] - Starting quorum peer
16:50:54,026 - INFO [main:FileSnap@86] - Reading snapshot
:
16:55:04,377 - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@65] - FOLLOWING - LEADER ELECTION TOOK - 233617
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)