You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Luke Chen (Jira)" <ji...@apache.org> on 2023/07/26 12:11:00 UTC

[jira] [Created] (ZOOKEEPER-4724) follower can't connect to the right leader and quorum failed to form

Luke Chen created ZOOKEEPER-4724:
------------------------------------

             Summary: follower can't connect to the right leader and quorum failed to form
                 Key: ZOOKEEPER-4724
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4724
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.6.4
            Reporter: Luke Chen


When entering "following - discovery" state, the follower will connect to the leader node to reach a quorum. But recently, a user faced the issue that the follower can't connect to the right leader and quorum failed to form. From the log, I can see the follower is trying to connect to itself (0.0.0.0:2888), instead of the leader. After 5 retries, a new election started, and all the things happen again, that is, the node becomes a follower, and try to connect to itself, and again, and again...

 

The log is like this:
{code:java}
2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS (org.apache.zookeeper.server.quorum.Learner) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery (org.apache.zookeeper.server.quorum.QuorumPeer) [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init limit=10000, connecting to /0.0.0.0:2888 (org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888]
java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
    at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
    at java.base/java.net.Socket.connect(Socket.java:633)
    at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
    at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292)
    at org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408)
    at org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code}
 

One thing I found, is this issue happened after "Restarting leader election" on the follower node. Not sure if it is related.

 

*The configuration and setup:*
 # 2 zookeeper nodes
 # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround slow DNS in k8s issue. That is, 
For node 1, we have:

{code:java}
server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code}
For node 2, we have:

{code:java}
server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code}

Logs:

[zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt]
[zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)