You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "ko christ (Jira)" <ji...@apache.org> on 2020/06/25 12:48:00 UTC
[jira] [Created] (ZOOKEEPER-3871) Dockerized Zookeeper clients fail on Zookeeper leader changes

ko christ created ZOOKEEPER-3871:
------------------------------------

             Summary: Dockerized Zookeeper clients fail on Zookeeper leader changes
                 Key: ZOOKEEPER-3871
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3871
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.5.8, 3.6.1, 3.5.5
            Reporter: ko christ


h2. Description

In a nutshell, my dockerized Zookeeper installation stops working on cluster leader changes.

The cluster responds to 4-letter commands but when I force a leader change, the clients timeout like forever. A workaround is to run follow up restarts which resolve the issue, usually when the leader returns to the previous state. This affects the high availability of the cluster.
h2. Example

For example, assuming that a 3-node ZK cluster has the following initial state (*State A*). All Zookeeper clients work fine in this state.
||ZK 1||ZK 2||ZK 3||
|follower|follower|*leader*|

 

and a restart occurs and Zookeeper ends up to this (*State B*)
||ZK 1||ZK 2||ZK 3||
|follower|*leader*|follower|

In State B, all client attempts fail to connect and they timeout, like forever. Follow up leader restarts may resolve the issue, usually (but not always) due to a *return to the previous state A*. 
h2. Affected versions

I have verified that this bug using
 * *{{3.5.5}}*
 * *{{3.5.8}}*
 * *{{3.6.1}}*

h2. Reproduce

{color:#de350b}Note: On all the examples above replace tortoise with your hostname.{color}

Deploy a 3-node Zookeeper cluster (could be 5-node) using the official 3.5.8 image.
{code:java}
docker run -d --name=zkcl01 -p 1493:1493 -p 1494:1494 -p 1495:1495 -h tortoise-zkcl01 -e HOSTNAME=tortoise -e ZOO_PORT=1493 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=0.0.0.0:1495:1494;1493 server.2=tortoise:1498:1497;1496 server.3=tortoise:1501:1500;1499" -e ZOO_MY_ID=1 zookeeper:3.5.8
docker run -d --name=zkcl02 -p 1496:1496 -p 1497:1497 -p 1498:1498 -h tortoise-zkcl02 -e HOSTNAME=tortoise -e ZOO_PORT=1496 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 server.2=0.0.0.0:1498:1497;1496 server.3=tortoise:1501:1500;1499" -e ZOO_MY_ID=2 zookeeper:3.5.8
docker run -d --name=zkcl03 -p 1499:1499 -p 1500:1500 -p 1501:1501 -h tortoise-zkcl03 -e HOSTNAME=tortoise -e ZOO_PORT=1499 -e ZOO_LOG4J_PROP="INFO,CONSOLE,ROLLINGFILE" -e ZOO_4LW_COMMANDS_WHITELIST=srvr,ruok,mntr,stat -e ZOO_STANDALONE_ENABLED=False -e ZOO_SERVERS="server.1=tortoise:1495:1494;1493 server.2=tortoise:1498:1497;1496 server.3=0.0.0.0:1501:1500;1499" -e ZOO_MY_ID=3 zookeeper:3.5.8
{code}
 

Monitor cluster's state with the 4-letter {{srvr}} command
{code:java}
watch -n 1 'for i in 1493 1496 1499; do echo $i; echo srvr | nc tortoise $i ; echo; done'{code}
 

Verify that you can connect to the cluster successfully using any client (zkCli.sh in this case)
{code:java}
docker exec -ti zkcl01 bin/zkCli.sh -server tortoise:1493,tortoise:1496,tortoise:1499 ls /
...
...
WatchedEvent state:SyncConnected type:None path:null
[zookeeper]{code}
 

Stop/Start the leader node (based on {{srvr}} output from the previous step) in order to force a leader change.
{code:java}
docker stop zkcl03; sleep 15; docker start zkcl03{code}
 

Verify that the client now fails to connect and they timeout.
{code:java}
docker exec -ti zkcl01 bin/zkCli.sh -server tortoise:1493,tortoise:1496,tortoise:1499 ls /
...
...
closing socket connection and attempting reconnect
KeeperErrorCode = ConnectionLoss for /{code}
 

Finally, -restart- stop/sleep/start the leader a few more times only to verify that the client succeeds usually when the leader goes back to the initial state.

 

This must be a bug unless there is a misconfiguration that I am missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)