You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "David Lao (JIRA)" <ji...@apache.org> on 2015/12/15 00:17:46 UTC

[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056947#comment-15056947 ] 

David Lao commented on ZOOKEEPER-2104:
--------------------------------------

I'm seeing the same issue. Additional observations: when the ensemble gets into this state, the leader and followers appears to be replicating massive amount of data although the curator clients were mostly idle. The on disk foot print for the log and snapshot are 65MB and 500MB respectively and there are plenty of memory and disk space on the servers. 

> Sudden crash of all nodes in the cluster
> ----------------------------------------
>
>                 Key: ZOOKEEPER-2104
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>            Reporter: Benjamin Jaton
>
> In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying "ZooKeeper is not running" messages.
> Not retry seems to be happening after that.
> This a request to understand what happened and probably to improve the logs when it does.
> See logs below:
> NODE1:
> -- no log for several days before this --
> 2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:1 took 11024ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2015-01-04 16:18:22,380 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:23,384 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,492 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:24,060 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> NODE2:
> -- no log for several days before this --
> 2015-01-04 16:18:21,899 [myid:3] - WARN  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:22,760 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:22,801 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:22,886 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> NODE3 (leader):
> -- no log for several days before this --
> 2015-01-04 16:18:21,897 [myid:2] - WARN  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,898 [myid:2] - WARN  [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* GOODBYE /204.53.107.249:43402 ********
> 2015-01-04 16:18:21,905 [myid:2] - WARN  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,907 [myid:2] - WARN  [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* GOODBYE /204.53.107.247:45953 ********
> 2015-01-04 16:18:21,918 [myid:2] - WARN  [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring unexpected exception
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>         at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>         at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>         at org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
> 2015-01-04 16:18:23,003 [myid:2] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,007 [myid:2] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,115 [myid:2] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)