You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Lara (JIRA)" <ji...@apache.org> on 2019/07/30 11:57:00 UTC

[jira] [Created] (ZOOKEEPER-3478) Leader restart shuts down all the followers

Lara created ZOOKEEPER-3478:
-------------------------------

             Summary: Leader restart shuts down all the followers
                 Key: ZOOKEEPER-3478
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3478
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.4.10
            Reporter: Lara


Hello ZooKeeper Community,

Could you please help me with at least clarifying a few doubts related to ZooKeeper 3.4.10?
 We have 2 servers in our system, one with 2 Zookeeper servers and the one with 3 - meaning that in case of failure of the server with 3 Zookeeper servers, the quorum cannot be achieved.

*Server 11*
 Zookeeper server 10
 Zookeeper server 11
 Zookeeper server 12

*Server 12*
 Zookeeper server 20
 Zookeeper server 21 -> Leader at the beginning of the procedure

As we were changing something in the configuration, it was needed to restart our servers, and to keep the quorum up, we restarted servers one by one (first on the one with 3 servers and then the other with 2 servers).
 During the restart of the one with 3 servers, the quorum was not lost - since we restarted one by one.
 Then we tried to restart the servers on the other one where we have 2 Servers deployed, one by one also. 
 The restart was executed in a small amount of time. After we restarted the first server 20 (follower) it joined the quorum with no errors, as expected. 
 *After we restarted the Leader server (21), all followers started to shut down!*

We had the same log on all the followers, but here is the example from the follower 20:
{panel}
Jun 27 14:49:31 [myid: 20]: WARN Connection broken for id 21, my id = 20, error =
 Jun 27 14:49:31 javaOFException
 Jun 27 14:49:31 at java.io.DataInputStream.readInt(Unknown Source)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
 Jun 27 14:49:31 [myid: 20]: INFO Accepted socket connection from /192.168.1.116:18532
 Jun 27 14:49:31 [myid: 20]: WARN Exception when following the leader
 Jun 27 14:49:31 OFException
 Jun 27 14:49:31 at java.io.DataInputStream.readInt(Unknown Source)
 Jun 27 14:49:31 at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
 Jun 27 14:49:31 at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
 Jun 27 14:49:31 [myid: 20]: WARN Connection request from old client /192.168.1.116:18532; will be dropped if server is in r-o mode
 Jun 27 14:49:31 [myid: 20]: INFO Notification: 1 (message format version), 12 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 12 (n.sid), 0x66 (n.peerEpoch) FOLLOWING (my state)
 Jun 27 14:49:31 [myid: 20]: WARN Interrupting SendWorker
 Jun 27 14:49:31 [myid: 20]: INFO Client attempting to renew session 0xa6b9dc92aa60200 at /192.168.1.116:18532
 Jun 27 14:49:31 [myid: 20]: INFO shutdown called
 Jun 27 14:49:31 java.lang.Exception: shutdown Follower
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
 Jun 27 14:49:31 [myid: 20]: INFO Revalidating client: 0xa6b9dc92aa60200
 Jun 27 14:49:31 [myid: 20]: WARN Interrupted while waiting for message on queue
 Jun 27 14:49:31 java.InterruptedException
 Jun 27 14:49:31 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source)
 Jun 27 14:49:31 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
 Jun 27 14:49:31 at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
 Jun 27 14:49:31 at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
{panel}
*Is it expected that Leader in case of its restart triggers shut down of all its followers?* 
 This seems to me as an unexpected behavior, but maybe I'm wrong.

 

After this step and the servers are up again, 20 tries to become a Leader and server 21 accepts it and tries to follow the new Leader.
 20 received ACK messages from itself and from 21. 
 There are also notifications sent about a new Leader to all other Zookeeper servers:
{panel}
Jun 27 14:49:31 [myid: 20]: INFO LEADING
 Jun 27 14:49:31 [myid: 20]: INFO Created server with tickTime 1500 minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/cudb/BCServer/version-2 snapdir /local/cudb/BCServer/version-2
 Jun 27 14:49:31 [myid: 20]: INFO LEADING - LEADER ELECTION TOOK - 213
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 20 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 12 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 20 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 10 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 20 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 11 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x1 (n.round), LOOKING (n.state), 21 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 21 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Follower sid: 21 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@466717f0
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 12 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 12 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 11 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 10 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 11 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
 Jun 27 14:49:32 [myid: 20]: INFO Notification: 1 (message format version), 21 (n.leader), 0x66000012c7 (n.zxid), 0x19 (n.round), LOOKING (n.state), 10 (n.sid), 0x66 (n.peerEpoch) LEADING (my state)
{panel}
 

From the servers 10, 11, 12 (located on the server with 3 ZooKeeper servers) it can be seen they all entered the state 'FOLLOWING' and from this step we would expect the Leader to start leading, and followers to start following:
{panel}
Jun 27 14:49:32 [myid: 12]: INFO FOLLOWING
 Jun 27 14:49:32 [myid: 12]: INFO Created server with tickTime 1500 minSessionTimeout 3000 maxSessionTimeout 30000 datadir /local/cudb/BCServer/version-2 snapdir /local/cudb/BCServer/version-2
 Jun 27 14:49:32 [myid: 12]: INFO FOLLOWING - LEADER ELECTION TOOK - 1217
{panel}
 

But, servers from our first system (10,11,12) are not able to connect to new Leader (20), and it seems they are trying to connect to old Leader (21) (assuming this is the case since they are all using port 4512 which corresponds to Server 21). This log can be seen on all servers where we have 3 ZooKeeper servers deployed (10,11,12):
{panel}
Jun 27 14:49:38 [myid: 12]: WARN Unexpected exception, tries=0, connecting to /192.168.1.116:4512
 Jun 27 14:49:38 java.net.SocketTimeoutException: connect timed out
 Jun 27 14:49:38 at java.net.PlainSocketImpl.socketConnect(Native Method)
 Jun 27 14:49:38 at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
 Jun 27 14:49:38 at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
 Jun 27 14:49:38 at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
 Jun 27 14:49:38 at java.net.SocksSocketImpl.connect(Unknown Source)
 Jun 27 14:49:38 at java.net.Socket.connect(Unknown Source)
 Jun 27 14:49:38 at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:231)
 Jun 27 14:49:38 at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
 Jun 27 14:49:38 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
{panel}
*Is my understanding correct, or this port is not indicating they are trying to connect to the wrong server?*

 

Even if the Leader restart provoked the restart of all its followers, seems that in this step all other servers should start to connect to the new leader and form a quorum of followers.
 Instead of that scenario, a few seconds later, timeout occurs while waiting for epoch for quorum (the followers never start following although they all received notifications, and they try to connect to old leader) and the 'new' Leader shuts down again:
{panel}
Jun 27 14:49:39 [myid: 20]: WARN Unexpected exception
 *Jun 27 14:49:39 java.lang.InterruptedException: Timeout while waiting for epoch from quorum*
 Jun 27 14:49:39 at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:896)
 Jun 27 14:49:39 at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:389)
 Jun 27 14:49:39 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:950)
 Jun 27 14:49:39 [myid: 20]: INFO Shutting down
 Jun 27 14:49:39 [myid: 20]: INFO Shutdown called
 Jun 27 14:49:39 java.lang.Exception: shutdown Leader! reason: Forcing shutdown
 Jun 27 14:49:39 at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:517)
 Jun 27 14:49:39 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:956)
 Jun 27 14:49:39 [myid: 20]: INFO exception while shutting down acceptor: java.net.SocketException: Socket closed
 Jun 27 14:49:39 [myid: 20]: INFO LOOKING
 Jun 27 14:49:39 [myid: 20]: INFO New election. My id = 20, proposed zxid=0x66000012c7
 Jun 27 14:49:39 [myid: 20]: ERROR Unexpected exception causing shutdown
 Jun 27 14:49:39 java.InterruptedException
 Jun 27 14:49:39 at java.lang.Object.wait(Native Method)
 Jun 27 14:49:39 at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:892)
 Jun 27 14:49:39 at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:358)
 Jun 27 14:49:39 [myid: 20]: INFO Notification: 1 (message format version), 20 (n.leader), 0x66000012c7 (n.zxid), 0x1a (n.round), LOOKING (n.state), 20 (n.sid), 0x66 (n.peerEpoch) LOOKING (my state)
 Jun 27 14:49:39 [myid: 20]: WARN ******* GOODBYE /10.22.0.2:55268 ********
{panel}
 

After 3 unsuccessfull retries from servers 10,11,12, since the quorum can not be achieved, connection times out and followers started to shut down again, After they are up, another election is triggered and new LEADER is now located on the first node (Server that becomes a new leader is 12):
{panel}
Jun 27 14:50:07 [myid: 12]: ERROR Unexpected exception
 Jun 27 14:50:07 java.net.SocketTimeoutException: connect timed out
 Jun 27 14:50:07 at java.net.PlainSocketImpl.socketConnect(Native Method)
 Jun 27 14:50:07 at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
 Jun 27 14:50:07 at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
 Jun 27 14:50:07 at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
 Jun 27 14:50:07 at java.net.SocksSocketImpl.connect(Unknown Source)
 Jun 27 14:50:07 at java.net.Socket.connect(Unknown Source)
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:231)
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
 Jun 27 14:50:07 [myid: 12]: WARN Exception when following the leader
 Jun 27 14:50:07 java.net.SocketTimeoutException: connect timed out
 Jun 27 14:50:07 at java.net.PlainSocketImpl.socketConnect(Native Method)
 Jun 27 14:50:07 at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
 Jun 27 14:50:07 at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
 Jun 27 14:50:07 at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
 Jun 27 14:50:07 at java.net.SocksSocketImpl.connect(Unknown Source)
 Jun 27 14:50:07 at java.net.Socket.connect(Unknown Source)
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:231)
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
 Jun 27 14:50:07 [myid: 12]: INFO shutdown called
 Jun 27 14:50:07 java.lang.Exception: shutdown Follower
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
 Jun 27 14:50:07 at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
 Jun 27 14:50:07 [myid: 12]: INFO Shutting down
 Jun 27 14:50:07 [myid: 12]: INFO LOOKING
 Jun 27 14:50:07 [myid: 12]: INFO New election. My id = 12, proposed zxid=0x66000012c7
 Jun 27 14:50:08 [myid: 12]: INFO LEADING
{panel}
 

After this, all other Zookeeper servers normally start to follow the new leader and everything starts to work just fine.

 

Could you please help me and answer the following questions:
 - is it expected behavior that Leader shutdowns all other servers (followers) during after its own restart?
 -> if this is expected, could you please explain in which situations we can expect this behavior and why?
 - if there was a notification sent about the new leader (20) to all other servers, why they were still connecting to old leader?
 - do you have any recommendations on how to 'fix' this behavior?

Any help will be highly appreciated.
 Thanks in advance!

Kind regards,
 Lara



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)