You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Ivo Vrdoljak (Jira)" <ji...@apache.org> on 2023/11/22 19:26:00 UTC

[jira] [Created] (ZOOKEEPER-4771) Fast leader election taking too long

Ivo Vrdoljak created ZOOKEEPER-4771:
---------------------------------------

             Summary: Fast leader election taking too long
                 Key: ZOOKEEPER-4771
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4771
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection
    Affects Versions: 3.4.10
            Reporter: Ivo Vrdoljak
         Attachments: zookeeper10.log, zookeeper11.log, zookeeper12.log, zookeeper20.log, zookeeper21.log

Hello ZooKeeper Community,
 
Background:
We are using ZooKeeper version 3.4.10. in our system and we have 5 Zookeeper servers running, that are distributed across 2 clusters of servers.
In the first cluster, we have 3 Zookeeper servers, each deployed on its own machine, and in the second cluster we have 2 Zookeeper servers, also each on its own machine. Zookeeper servers that are distributed on the same cluster communicate through the local network, and with the servers on the remote cluster through an external network.
The situation is the following:

 
{code:java}
Cluster 1
Zookeeper server 10
Zookeeper server 11
Zookeeper server 12 -> Leader
Cluster 2
Zookeeper server 20
Zookeeper server 21
{code}
 

Problem:
We have an issue with Fast Leader Election when we kill the ZooKeeper leader process.
After the leader (server 12) is killed and leader election starts, we can see in Zookeeper logs that voting notifications are exchanged from each Zookeeper server that remained alive towards all the others. Notification on Zookeeper servers located in the same cluster (communicating over the local network) are successfully exchanged. The problem seems to be for Zookeeper server sending votes over external network as according to the logs they are only sent in one direction.
 
Logs from zookeeper server 10:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Initializing leader election protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Updating proposal: 10 (newleader), 0xe00009c97 (newzxid), 12 (oldleader), 0xd00001380 (oldzxid)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: INFO  New election. My id =  10, proposed zxid=0xe00009c97
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 10 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 10 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 10 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 10 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Sending Notification: 10 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 10 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=10, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:15 sc_2_1 BC[myid: 10]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
Logs from zookeeper server 20:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: INFO  LOOKING
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Initializing leader election protocol...
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 20 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=11, proposed leader=11, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=20, proposed leader=20, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 20 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 21 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 11 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 12 (recipient), 20 (myid), 0xe (n.peerEpoch)
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=21, proposed leader=21, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11
Nov 22 10:31:14 sc_2_1 BC[myid: 20]: DEBUG Adding vote: from=10, proposed leader=10, proposed zxid=0xe00009c97, proposed election epoch=0x11{code}
These are just some of the lines from the log relevant to the issue and complete logs from all the servers are attached as well as the server configuration.

As it can be seen from the logs the server with ID 20 is sending notifications to all cluster members but in the logs from server 10 there is no indication that a notification from server 20 is received. However, it is obvious from the logs that server 20 received notifications from server 10 at the same time. This part is a bit strange since the same socket is used for communication in both directions.
The first suspect was the network, but after investigation by following tcpdump we could not find any packages leaving from the cluster where the Zk server 20 is located which is sending notifications.

About almost half a minute later, the server realizes that there is no connection to Zk servers 10 and 11 and then reopens channels towards them. After that, the notifications are exchanged and the quorum is formed quickly after:
{code:java}
Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 11, my id = 20, error = #012java.net.SocketException: Connection timed out (Read failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source) ~[?:1.8.0_261]#012#011at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
Nov 22 10:31:40 sc_2_1 BC[myid: 20]: WARN  Connection broken for id 10, my id = 20, error = #012java.net.SocketException: Connection timed out (Read failed)#012#011at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.socketRead(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.net.SocketInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.fill(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.BufferedInputStream.read(Unknown Source) ~[?:1.8.0_261]#012#011at java.io.DataInputStream.readInt(Unknown Source) ~[?:1.8.0_261]#012#011at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) [zookeeper-3.4.10.jar:3.4.10-5f5438876cc9050ce4380a2780ce07d063aed8d7]
.....
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  Created server with tickTime 1500 minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/version-2 snapdir /local/version-2
Nov 22 10:31:41 sc_2_1 BC[myid: 20]: INFO  FOLLOWING - LEADER ELECTION TOOK - 27333{code}
We have two main questions related to this issue if you could help:

1) Is there any way to know for sure that the vote notification was actually sent other than the following log:
{code:java}
Nov 22 10:31:13 sc_2_1 BC[myid: 20]: DEBUG Sending Notification: 21 (n.leader), 0xe00009c97 (n.zxid), 0x11 (n.round), 10 (recipient), 20 (myid), 0xe (n.peerEpoch){code}
Or is there any chance that the receiving server would discard the notification for some reason and not even log the event, since we cannot see anything in the logs even with debug enabled?

2) Are there any options to reduce this timeout since half a minute for leader election is too much for our highly available product with only 5 zookeeper servers? Is there any configuration parameter that could be tuned?

Also, do you know if the same or similar issue has been already reported or fixed in some of the later versions of the product? We could not find something similar in Zookeeper Jira.

 
Thanks and regards,
Ivo
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)