You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Andrew January <aj...@evertz.com> on 2018/09/14 15:57:35 UTC

Issue electing leader, suspect delay in processing notifications

I have a 3 node cluster on 3.4.12. One of the nodes (node 2) was taken down for maintenance.
The remaining two nodes seem to have issues gaining quorum.

I think the issues is that node 1 is taking 5 seconds to process notifications.

From the logs, node 1 and node 3 both agree that node 3 should be leader.
Node 1 sends out it's notification that it's following node 3, then immediately tries to connect to it to follow.
Because node 3 takes 5 seconds to pick up the notification that node 1 accepts it as leader, it isn't accepting connections to be followed.
This means node 1 times out trying to connect to node 3 before node 3 becomes leader.
Node 1 starts a new election. Node 3 times out waiting for acknowledgement that it's the leader. The process repeats.

My evidence that node 1 is taking 5 seconds to process notifications is:
1) Node 3 starts a new election, which broadcasts a notification with a new round to itself and node 1
2) Node 1 gets the message almost immediately
3) Node 3 doesn't get its own message until 5 seconds later


node3: 2018-09-14 09:28:50,534 - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  3, proposed zxid=0x600001f52
node1: 2018-09-14 09:28:50,535 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e0 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:28:55,539 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e0 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)

node3: 2018-09-14 09:29:25,544 - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  3, proposed zxid=0x600001f52
node1: 2018-09-14 09:29:25,546 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e1 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:29:30,547 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e1 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)

node3: 2018-09-14 09:30:00,772 - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  3, proposed zxid=0x600001f52
node1: 2018-09-14 09:30:00,774 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e2 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:30:05,777 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e2 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)

Similarly, when node 1 tries to start an election:
1) Node 1 starts a new election, which has a new round
2) Node 1 gets the message almost immediately
3) Node 3 doesn't get the message until 5 seconds later

node1: 2018-09-14 09:31:24,853 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  1, proposed zxid=0x600001ee4
node1: 2018-09-14 09:31:25,653 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e5 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:31:30,654 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e5 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LEADING (my state)

node1: 2018-09-14 09:32:00,072 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  1, proposed zxid=0x600001ee4
node1: 2018-09-14 09:32:00,680 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e6 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:32:05,685 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e6 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LEADING (my state)

I've looked through the code but I can't see any reason it would have a consistent 5 second delay when processing notifications. Any ideas?
This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender (as shown above). Kindly do not reproduce, print or forward any material received in error, please delete it immediately. Evertz UK Limited (Company No. 3458137) is incorporated in England and Wales and has its registered office at 100 Berkshire Place, Wharfedale Road, Winnersh, Wokingham, Berkshire, United Kingdom, RG41 5RD. Evertz Singapore Pte Limited (Company No. 200817005N) is incorporated in Singapore and has its registered office at One Marina Boulevard, #28-00. Singapore 018989.

RE: Issue electing leader, suspect delay in processing notifications

Posted by Andrew January <aj...@evertz.com>.
Interestingly, the delay went away at the same time as node2 was brought back up.

2018-09-14 09:33:30,944 - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  3, proposed zxid=0x600001f52
2018-09-14 09:33:30,945 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e8 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
2018-09-14 09:33:30,946 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e8 (n.round), LEADING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
2018-09-14 09:33:30,947 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e8 (n.round), LOOKING (n.state), 2 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)

-----Original Message-----
From: Andrew January [mailto:ajanuary@evertz.com]
Sent: 14 September 2018 16:58
To: user@zookeeper.apache.org
Subject: Issue electing leader, suspect delay in processing notifications

I have a 3 node cluster on 3.4.12. One of the nodes (node 2) was taken down for maintenance.
The remaining two nodes seem to have issues gaining quorum.

I think the issues is that node 1 is taking 5 seconds to process notifications.

From the logs, node 1 and node 3 both agree that node 3 should be leader.
Node 1 sends out it's notification that it's following node 3, then immediately tries to connect to it to follow.
Because node 3 takes 5 seconds to pick up the notification that node 1 accepts it as leader, it isn't accepting connections to be followed.
This means node 1 times out trying to connect to node 3 before node 3 becomes leader.
Node 1 starts a new election. Node 3 times out waiting for acknowledgement that it's the leader. The process repeats.

My evidence that node 1 is taking 5 seconds to process notifications is:
1) Node 3 starts a new election, which broadcasts a notification with a new round to itself and node 1
2) Node 1 gets the message almost immediately
3) Node 3 doesn't get its own message until 5 seconds later


node3: 2018-09-14 09:28:50,534 - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  3, proposed zxid=0x600001f52
node1: 2018-09-14 09:28:50,535 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e0 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:28:55,539 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e0 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)

node3: 2018-09-14 09:29:25,544 - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  3, proposed zxid=0x600001f52
node1: 2018-09-14 09:29:25,546 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e1 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:29:30,547 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e1 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)

node3: 2018-09-14 09:30:00,772 - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  3, proposed zxid=0x600001f52
node1: 2018-09-14 09:30:00,774 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e2 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:30:05,777 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 3 (n.leader), 0x600001f52 (n.zxid), 0x4e2 (n.round), LOOKING (n.state), 3 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)

Similarly, when node 1 tries to start an election:
1) Node 1 starts a new election, which has a new round
2) Node 1 gets the message almost immediately
3) Node 3 doesn't get the message until 5 seconds later

node1: 2018-09-14 09:31:24,853 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  1, proposed zxid=0x600001ee4
node1: 2018-09-14 09:31:25,653 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e5 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:31:30,654 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e5 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LEADING (my state)

node1: 2018-09-14 09:32:00,072 - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@813] - New election. My id =  1, proposed zxid=0x600001ee4
node1: 2018-09-14 09:32:00,680 - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e6 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LOOKING (my state)
node3: 2018-09-14 09:32:05,685 - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message format version), 1 (n.leader), 0x600001ee4 (n.zxid), 0x4e6 (n.round), LOOKING (n.state), 1 (n.sid), 0x6 (n.peerEpoch) LEADING (my state)

I've looked through the code but I can't see any reason it would have a consistent 5 second delay when processing notifications. Any ideas?
This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender (as shown above). Kindly do not reproduce, print or forward any material received in error, please delete it immediately. Evertz UK Limited (Company No. 3458137) is incorporated in England and Wales and has its registered office at 100 Berkshire Place, Wharfedale Road, Winnersh, Wokingham, Berkshire, United Kingdom, RG41 5RD. Evertz Singapore Pte Limited (Company No. 200817005N) is incorporated in Singapore and has its registered office at One Marina Boulevard, #28-00. Singapore 018989.
This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender (as shown above). Kindly do not reproduce, print or forward any material received in error, please delete it immediately. Evertz UK Limited (Company No. 3458137) is incorporated in England and Wales and has its registered office at 100 Berkshire Place, Wharfedale Road, Winnersh, Wokingham, Berkshire, United Kingdom, RG41 5RD. Evertz Singapore Pte Limited (Company No. 200817005N) is incorporated in Singapore and has its registered office at One Marina Boulevard, #28-00. Singapore 018989.