You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2015/01/13 17:54:35 UTC

[jira] [Resolved] (ZOOKEEPER-2106) Error when reading from leader causes JVM to hang

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans resolved ZOOKEEPER-2106.
--------------------------------------------
    Resolution: Invalid

> Error when reading from leader causes JVM to hang
> -------------------------------------------------
>
>                 Key: ZOOKEEPER-2106
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2106
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.5
>            Reporter: Robert Joseph Evans
>            Priority: Critical
>
> I tried looking through existing JIRA for something like this, but the closest I came was ZOOKEEPER-2104.  It looks very similar, but I don't know if it really is the same thing.  Essentially we had a 5 node ensemble for a large storm cluster.  For a few of the nodes at the same time they get an error that looks like.
> {code}
> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@762] - Connection broken for id 2, my id = 4, error = 
> java.io.EOFException
>       at java.io.DataInputStream.readInt(DataInputStream.java:392)
>       at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
> WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
> WARN  [SendWorker:2:QuorumCnxManager$SendWorker@679] - Interrupted while waiting for message on queue
> java.lang.InterruptedException
>      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>       at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>       at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
>       at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
>       at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
>      at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
> WARN  [SendWorker:2:QuorumCnxManager$SendWorker@688] - Send worker leaving thread
> WARN  [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@89] - Exception when following the leader
> java.net.SocketException: Connection reset
>      at java.net.SocketInputStream.read(SocketInputStream.java:189)
>      at java.net.SocketInputStream.read(SocketInputStream.java:121)
>      at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>      at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>      at java.io.DataInputStream.readInt(DataInputStream.java:387)
>      at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>      at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>      at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>      at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
>      at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>       at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>      at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
> {code}
> After that all of the connections are shut down
> {code}
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket connection for client ...
> {code}
> but it does not manage to have the JVM shut down
> {code}
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerZooKeeperServer@139] - Shutting down
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:ZooKeeperServer@419] - shutting down
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerRequestProcessor@105] - Shutting down
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:CommitProcessor@181] - Shutting down
> INFO  [FollowerRequestProcessor:4:FollowerRequestProcessor@95] - FollowerRequestProcessor exited loop!
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:FinalRequestProcessor@415] - shutdown of request processor complete
> INFO  [CommitProcessor:4:CommitProcessor@150] - CommitProcessor exited loop!
> WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket connection for client /... (no session established for client)
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:SyncRequestProcessor@175] - Shutting down
> INFO  [SyncThread:4:SyncRequestProcessor@155] - SyncRequestProcessor exited!
> INFO  [QuorumPeer[myid=4]/0.0.0.0:50512:QuorumPeer@670] - LOOKING
> {code}
> after that all connections to that node initiate, and then are shut down with ZooKeeperServer not running.  It seems to stay in this state indefinitely until the process is manually restarted.  After that it recovers.
> We have seen this happen on multiple servers at the same time resulting in the entire ensemble being unusable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)