You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Diego Lucas Jiménez (Jira)" <ji...@apache.org> on 2020/10/16 14:54:00 UTC
[jira] [Created] (ZOOKEEPER-3975) Zookeeper crashes: Unable to load
database on disk java.io.IOException: Unreasonable length
Diego Lucas Jiménez created ZOOKEEPER-3975:
----------------------------------------------
Summary: Zookeeper crashes: Unable to load database on disk java.io.IOException: Unreasonable length
Key: ZOOKEEPER-3975
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3975
Project: ZooKeeper
Issue Type: Bug
Components: jute
Affects Versions: 3.6.2
Environment: Debian 10 x64
openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Debian-1deb10u1)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Debian-1deb10u1, mixed mode, sharing)
Reporter: Diego Lucas Jiménez
After running for a while, the entire cluster (3 zookeeper) crash suddenly, all of them logging:
{code:java}
2020-10-16 10:37:00,459 [myid:2] - WARN [NIOWorkerThread-4:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
2020-10-16 10:37:00,475 [myid:2] - ERROR [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1139] - Unable to load database on disk
java.io.IOException: Unreasonable length = 5089607
at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:352)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:258)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:303)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1093)
at org.apache.zookeeper.server.quorum.QuorumPeer.getLastLoggedZxid(QuorumPeer.java:1249)
at org.apache.zookeeper.server.quorum.FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:868)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:941)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1428){code}
Apparently the "corrupted" file appears in all the servers, so no solution such as "removing version-2 on the faulty server and letting replicate from a healthy one" :(.
The entire cluster goes down, we have downtime, every-single-day since we upgraded from 3.4.9.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)