You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Aaron Zimmerman (JIRA)" <ji...@apache.org> on 2014/07/05 00:34:33 UTC

[jira] [Updated] (ZOOKEEPER-1955) EOFException on Reading Snapshot

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Zimmerman updated ZOOKEEPER-1955:
---------------------------------------

    Attachment: snapshot

THis is an example of the snapshot file that a node was attempting to read before the EOF exception.  

> EOFException on Reading Snapshot
> --------------------------------
>
>                 Key: ZOOKEEPER-1955
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1955
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.5
>            Reporter: Aaron Zimmerman
>         Attachments: snapshot
>
>
> We have a 5 node zookeeper cluster that has been operating normally for several months.  Starting a few days ago, the entire cluster crashes a few times per day, all nodes at the exact same time.  We can't track down the exact issue, but deleting the snapshots and logs and restarting allows the cluster to come back up.  
> We are running exhibitor to monitor the cluster.  
> It appears that something bad gets into the logs, causing an EOFException and this cascades through the entire cluster:
> 2014-07-04 12:55:26,328 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> 2014-07-04 12:55:26,328 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>         at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
> Then the server dies, exhibitor tries to restart each node, and they all get stuck trying to replay the bad transaction, logging things like:
>  
> 2014-07-04 12:58:52,734 [myid:1] - INFO  [main:FileSnap@83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
> 2014-07-04 12:58:52,896 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@575] - Created new input stream /var/lib/zookeeper/version-2/log.300000021
> 2014-07-04 12:58:52,915 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@578] - Created new input archive /var/lib/zookeeper/version-2/log.300000021
> 2014-07-04 12:59:25,870 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@618] - EOF excepton java.io.EOFException: Failed to read /var/lib/zookeeper/version-2/log.300000021
> 2014-07-04 12:59:25,871 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@575] - Created new input stream /var/lib/zookeeper/version-2/log.300011fc2
> 2014-07-04 12:59:25,872 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@578] - Created new input archive /var/lib/zookeeper/version-2/log.300011fc2
> 2014-07-04 12:59:48,722 [myid:1] - DEBUG [main:FileTxnLog$FileTxnIterator@618] - EOF excepton java.io.EOFException: Failed to read /var/lib/zookeeper/version-2/log.300011fc2
> And the cluster is dead.  The only way we have found to recover is to delete all of the data and restart.
> [~fournc] Appreciate any assistance you can offer.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)