You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Patrick Kleindienst (JIRA)" <ji...@apache.org> on 2017/04/25 10:19:04 UTC

[jira] [Created] (ZOOKEEPER-2766) Leader dies with java.io.EOFException

Patrick Kleindienst created ZOOKEEPER-2766:
----------------------------------------------

             Summary: Leader dies with java.io.EOFException
                 Key: ZOOKEEPER-2766
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2766
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection, quorum
    Affects Versions: 3.5.3
         Environment: CentOS-7, Docker version 17.03.1-ce
            Reporter: Patrick Kleindienst


When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing the following behavior:
Two nodes (let's say node 2 and 3) out of the three start their own quorum, and finally one of them is elected the new leader (node 3), while the other one becomes the follower (node 2). Node 1 seems to be able to establish a connection to node 3 (elected leader) in my case, but this seems to fail for node 2. 
Node 1 shows the following in its logs:

 2017-04-25 09:24:02,806 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055] - LOOKING
2017-04-25 09:24:02,808 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894] - New election. My id =  1, proposed zxid=0x0
2017-04-25 09:24:02,811 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@688] - Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (
n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:24:02,817 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCnxManager@457] - Cannot open channel to 2 at election address /9.152.171.98:3888
java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
        at java.lang.Thread.run(Thread.java:745)
2017-04-25 09:24:02,822 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumCnxManager@278] - Have smaller server identifier, so dropping the connection: (3, 1)
2017-04-25 09:24:03,025 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457] - Cannot open channel to 2 at election address /9.152.171.98:3888

However, that's not all, since the quorum consisting of node 2 and 3 does not work properly. The nodes' logs tell that leader election between these two works fine.
Here's what node 3 (leader) says:

 2017-04-25 09:09:33,842 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:33,844 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:33,851 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:34,051 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
2017-04-25 09:09:34,052 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143] - LEADING
2017-04-25 09:09:34,055 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63] - TCP NoDelay set to: true
2017-04-25 09:09:34,055 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83] - zookeeper.leader.maxConcurrentSnapshots = 10
2017-04-25 09:09:34,056 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85] - zookeeper.leader.maxConcurrentSnapshotTimeout = 5


And here's the output node 2 (follower) provides:

2017-04-25 09:09:31,875 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (
n.peerEPoch), LOOKING (my state)0 (n.config version)
2017-04-25 09:09:32,077 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
ica.2,name2=LeaderElection]
2017-04-25 09:09:32,077 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131] - FOLLOWING
2017-04-25 09:09:32,082 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88] - TCP NoDelay set to: true

So far, so good. But seconds later the connection between node 2 and 3 seems to get lost, causing leader node 3 to report an EOFExeption. If I understand the logs correctly, node 2 (follower) properly closes the connection (sending "Goodbye"), whilst node 3 says that the socket is still open. 

2017-04-25 09:09:34,190 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414] - LEADING - LEADER ELECTION TOOK - 138 MS
2017-04-25 09:09:34,197 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320] - Snapshotting: 0x0 to /data/version-2/snapshot.0
2017-04-25 09:09:35,076 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@382] - Follower sid: 2 : info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
2017-04-25 09:09:35,113 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@683] - Synchronizing with Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0 peerLastZxid=0x
0
2017-04-25 09:09:35,114 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@727] - Sending DIFF zxid=0x0 for peer sid: 2
2017-04-25 09:09:35,133 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258] - Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and setting last processed zxid: 0x100000000
2017-04-25 09:09:35,169 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255] - Configuring CommitProcessor with 2 worker threads.
2017-04-25 09:09:35,179 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64] - Using checkIntervalMs=60000 maxPerMinute=10000
2017-04-25 09:09:35,196 [myid:3] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@919] - Connection broken for id 2, my id = 3, error = 
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
2017-04-25 09:09:35,196 [myid:3] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@922] - Interrupting SendWorker
2017-04-25 09:09:35,197 [myid:3] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for message on queue
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
2017-04-25 09:09:35,197 [myid:3] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@845] - Send worker leaving thread  id 2 my id = 3
2017-04-25 09:09:35,204 [myid:3] - ERROR [LearnerHandler-/9.152.171.98:51328:LearnerHandler@604] - Unexpected exception causing shutdown while sock still open
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
2017-04-25 09:09:35,204 [myid:3] - WARN  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@619] - ******* GOODBYE /9.152.171.98:51328 ********
2017-04-25 09:09:37,181 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626] - Shutting down
2017-04-25 09:09:37,182 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced with sids: [ [3] ]
        at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
        at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)

Unfortunately, node 2 does not provide any additional information on what exactly is going on. After leader election, the only thing it reports is this:

2017-04-25 09:09:32,091 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68] - FOLLOWING - LEADER ELECTION TOOK - 13 MS
2017-04-25 09:09:32,094 [myid:2] - WARN  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273] - Unexpected exception, tries=0, remaining init limit=9999, connecting to /9.152.171.12:288
8
java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
        at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
        at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
2017-04-25 09:09:33,142 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369] - Getting a diff from the leader 0x0
2017-04-25 09:09:33,146 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516] - Learner received NEWLEADER message
2017-04-25 09:09:33,207 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499] - Learner received UPTODATE message
2017-04-25 09:09:33,220 [myid:2] - WARN  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446] - Restarting Leader Election
2017-04-25 09:09:33,221 [myid:2] - INFO  [/0.0.0.0:3888:QuorumCnxManager$Listener@665] - Leaving listener
2017-04-25 09:09:33,222 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for message on queue
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
2017-04-25 09:09:33,222 [myid:2] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@919] - Connection broken for id 3, my id = 2, error = 
java.net.SocketException: Socket closed
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.net.SocketInputStream.read(SocketInputStream.java:224)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)

As far as I get that, node 2 wants to start a new leader election, but fails to establish a connection to the other nodes. It tries over and over, finally ending up in a timeout. Unfortunately, this doesn't give me any hint on what exactly breaks up the connection between the follower node (node 2) and the leader node (node 3) and why it can be re-established. 

It might also be relevant that I'm running ZooKeeper in Docker containers, using the host network option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)