You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Arun Subramanian R (Jira)" <ji...@apache.org> on 2021/06/11 10:50:00 UTC
[jira] [Created] (ZOOKEEPER-4316) Leader election fails due to
SocketTimeoutException in QuorumCnxManager
Arun Subramanian R created ZOOKEEPER-4316:
---------------------------------------------
Summary: Leader election fails due to SocketTimeoutException in QuorumCnxManager
Key: ZOOKEEPER-4316
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4316
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.5.7, 3.4.12
Environment: cat /etc/os-release
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"
docker version
Client:
Version: 20.10.6-ce
API version: 1.41
Go version: go1.13.15
Git commit: 8728dd246c3a
Built: Tue Apr 27 09:45:18 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.6-ce
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8728dd246c3a
Built: Fri Apr 9 22:06:18 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
runc:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.1.3_catatonit
GitCommit:
zookeeper version - 3.5.7
Reporter: Arun Subramanian R
Attachments: docker-entrypoint.sh, zoo.cfg, zoo_3.5.7.yml
I have a 3 node zookeeper cluster deployed as a stack using docker swarm.
Deploying this stack causes zookeeper to fail with a SocketTimeoutException during leader election with the following log
{noformat}
2021-06-11 03:59:34,607 [myid:2] - WARN [QuorumPeer[myid=2]/0.0.0.0:2181:QuorumCnxManager@584] - Cannot open channel to 3 at election address zoo3/10.0.11.5:3888
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:610)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:838)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:957){noformat}
The docker overlay network itself appears to be sound. A netstat on one of the nodes outputs
{noformat}
bash-4.4# netstat -tuln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:3888 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:42941 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.11:35453 0.0.0.0:* LISTEN
udp 0 0 127.0.0.11:55009 0.0.0.0:*{noformat}
showing the 3888 port is open. but a tcpdump only shows send and re-transmissions and there are no responses in port 3888.
Suspecting the issue maybe due to a short timeout or small number of retries, I have tried increasing the cnxTimeout to 300000 and electionPortBindRetry to 0 (infinite), but even after 13 hrs of continuous running and retrying election the same error persists
I have attached the stack.yml, the custom docker-entrypoint.sh that we override on top of the official container to enable running from a root host user, and the zoo.cfg file from inside the container.
Any help in identifying the underlying issue or mis-configuration, or any configuration parameter that may help solve the issue is deeply appreciated.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)