You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Patrick Kleindienst (JIRA)" <ji...@apache.org> on 2017/04/25 10:20:04 UTC

[jira] [Updated] (ZOOKEEPER-2766) Quorum fails with java.io.EOFException

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Kleindienst updated ZOOKEEPER-2766:
-------------------------------------------
    Summary: Quorum fails with java.io.EOFException  (was: Leader dies with java.io.EOFException)

> Quorum fails with java.io.EOFException
> --------------------------------------
>
>                 Key: ZOOKEEPER-2766
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2766
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum
>    Affects Versions: 3.5.3
>         Environment: CentOS-7, Docker version 17.03.1-ce
>            Reporter: Patrick Kleindienst
>
> When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing the following behavior:
> Two nodes (let's say node 2 and 3) out of the three start their own quorum, and finally one of them is elected the new leader (node 3), while the other one becomes the follower (node 2). Node 1 seems to be able to establish a connection to node 3 (elected leader) in my case, but this seems to fail for node 2. 
> Node 1 shows the following in its logs:
>  2017-04-25 09:24:02,806 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055] - LOOKING
> 2017-04-25 09:24:02,808 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894] - New election. My id =  1, proposed zxid=0x0
> 2017-04-25 09:24:02,811 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@688] - Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:24:02,817 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCnxManager@457] - Cannot open channel to 2 at election address /9.152.171.98:3888
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
>         at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
>         at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-04-25 09:24:02,822 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumCnxManager@278] - Have smaller server identifier, so dropping the connection: (3, 1)
> 2017-04-25 09:24:03,025 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457] - Cannot open channel to 2 at election address /9.152.171.98:3888
> However, that's not all, since the quorum consisting of node 2 and 3 does not work properly. The nodes' logs tell that leader election between these two works fine.
> Here's what node 3 (leader) says:
>  2017-04-25 09:09:33,842 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:33,844 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:33,851 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:34,051 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
> 2017-04-25 09:09:34,052 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143] - LEADING
> 2017-04-25 09:09:34,055 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63] - TCP NoDelay set to: true
> 2017-04-25 09:09:34,055 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83] - zookeeper.leader.maxConcurrentSnapshots = 10
> 2017-04-25 09:09:34,056 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85] - zookeeper.leader.maxConcurrentSnapshotTimeout = 5
> And here's the output node 2 (follower) provides:
> 2017-04-25 09:09:31,875 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:32,077 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
> ica.2,name2=LeaderElection]
> 2017-04-25 09:09:32,077 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131] - FOLLOWING
> 2017-04-25 09:09:32,082 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88] - TCP NoDelay set to: true
> So far, so good. But seconds later the connection between node 2 and 3 seems to get lost, causing leader node 3 to report an EOFExeption. If I understand the logs correctly, node 2 (follower) properly closes the connection (sending "Goodbye"), whilst node 3 says that the socket is still open. 
> 2017-04-25 09:09:34,190 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414] - LEADING - LEADER ELECTION TOOK - 138 MS
> 2017-04-25 09:09:34,197 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320] - Snapshotting: 0x0 to /data/version-2/snapshot.0
> 2017-04-25 09:09:35,076 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@382] - Follower sid: 2 : info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
> 2017-04-25 09:09:35,113 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@683] - Synchronizing with Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0 peerLastZxid=0x
> 0
> 2017-04-25 09:09:35,114 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@727] - Sending DIFF zxid=0x0 for peer sid: 2
> 2017-04-25 09:09:35,133 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258] - Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and setting last processed zxid: 0x100000000
> 2017-04-25 09:09:35,169 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255] - Configuring CommitProcessor with 2 worker threads.
> 2017-04-25 09:09:35,179 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64] - Using checkIntervalMs=60000 maxPerMinute=10000
> 2017-04-25 09:09:35,196 [myid:3] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@919] - Connection broken for id 2, my id = 3, error = 
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> 2017-04-25 09:09:35,196 [myid:3] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@922] - Interrupting SendWorker
> 2017-04-25 09:09:35,197 [myid:3] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for message on queue
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:35,197 [myid:3] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@845] - Send worker leaving thread  id 2 my id = 3
> 2017-04-25 09:09:35,204 [myid:3] - ERROR [LearnerHandler-/9.152.171.98:51328:LearnerHandler@604] - Unexpected exception causing shutdown while sock still open
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> 2017-04-25 09:09:35,204 [myid:3] - WARN  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@619] - ******* GOODBYE /9.152.171.98:51328 ********
> 2017-04-25 09:09:37,181 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626] - Shutting down
> 2017-04-25 09:09:37,182 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632] - Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced with sids: [ [3] ]
>         at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
>         at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)
> Unfortunately, node 2 does not provide any additional information on what exactly is going on. After leader election, the only thing it reports is this:
> 2017-04-25 09:09:32,091 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68] - FOLLOWING - LEADER ELECTION TOOK - 13 MS
> 2017-04-25 09:09:32,094 [myid:2] - WARN  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273] - Unexpected exception, tries=0, remaining init limit=9999, connecting to /9.152.171.12:288
> 8
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> 2017-04-25 09:09:33,142 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369] - Getting a diff from the leader 0x0
> 2017-04-25 09:09:33,146 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516] - Learner received NEWLEADER message
> 2017-04-25 09:09:33,207 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499] - Learner received UPTODATE message
> 2017-04-25 09:09:33,220 [myid:2] - WARN  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446] - Restarting Leader Election
> 2017-04-25 09:09:33,221 [myid:2] - INFO  [/0.0.0.0:3888:QuorumCnxManager$Listener@665] - Leaving listener
> 2017-04-25 09:09:33,222 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for message on queue
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:33,222 [myid:2] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@919] - Connection broken for id 3, my id = 2, error = 
> java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at java.net.SocketInputStream.read(SocketInputStream.java:224)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> As far as I get that, node 2 wants to start a new leader election, but fails to establish a connection to the other nodes. It tries over and over, finally ending up in a timeout. Unfortunately, this doesn't give me any hint on what exactly breaks up the connection between the follower node (node 2) and the leader node (node 3) and why it can be re-established. 
> It might also be relevant that I'm running ZooKeeper in Docker containers, using the host network option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)