You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Arun Subramanian R (Jira)" <ji...@apache.org> on 2021/06/11 10:50:00 UTC

[jira] [Created] (ZOOKEEPER-4316) Leader election fails due to SocketTimeoutException in QuorumCnxManager

Arun Subramanian R created ZOOKEEPER-4316:
---------------------------------------------

             Summary: Leader election fails due to SocketTimeoutException in QuorumCnxManager
                 Key: ZOOKEEPER-4316
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4316
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.5.7, 3.4.12
         Environment: cat /etc/os-release
NAME="SLES"
VERSION="12-SP5"
VERSION_ID="12.5"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp5"

docker version
Client:
 Version: 20.10.6-ce
 API version: 1.41
 Go version: go1.13.15
 Git commit: 8728dd246c3a
 Built: Tue Apr 27 09:45:18 2021
 OS/Arch: linux/amd64
 Context: default
 Experimental: true

Server:
 Engine:
 Version: 20.10.6-ce
 API version: 1.41 (minimum version 1.12)
 Go version: go1.13.15
 Git commit: 8728dd246c3a
 Built: Fri Apr 9 22:06:18 2021
 OS/Arch: linux/amd64
 Experimental: false
 containerd:
 Version: v1.4.4
 GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
 Version: 1.0.0-rc93
 GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
 Version: 0.1.3_catatonit
 GitCommit:

zookeeper version - 3.5.7
            Reporter: Arun Subramanian R
         Attachments: docker-entrypoint.sh, zoo.cfg, zoo_3.5.7.yml

I have a 3 node zookeeper cluster deployed as a stack using docker swarm. 
Deploying this stack causes zookeeper to fail with a SocketTimeoutException during leader election with the following log



 
{noformat}
2021-06-11 03:59:34,607 [myid:2] - WARN  [QuorumPeer[myid=2]/0.0.0.0:2181:QuorumCnxManager@584] - Cannot open channel to 3 at election address zoo3/10.0.11.5:3888
java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
       at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:610)
        at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:838)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:957){noformat}
The docker overlay network itself appears to be sound. A netstat on one of the nodes outputs
{noformat}
bash-4.4# netstat -tuln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:2181            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3888            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:42941           0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.11:35453        0.0.0.0:*               LISTEN
udp        0      0 127.0.0.11:55009        0.0.0.0:*{noformat}
showing the 3888 port is open. but a tcpdump only shows send and re-transmissions and there are no responses in port 3888.
Suspecting the issue maybe due to a short timeout or small number of retries, I have tried increasing the cnxTimeout to 300000 and electionPortBindRetry to 0 (infinite), but even after 13 hrs of continuous running and retrying election the same error persists

I have attached the stack.yml, the custom docker-entrypoint.sh that we override on top of the official container to enable running from a root host user, and the zoo.cfg file from inside the container.

Any help in identifying the underlying issue or mis-configuration, or any configuration parameter that may help solve the issue is deeply appreciated.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)