You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Young Xu (Jira)" <ji...@apache.org> on 2022/10/19 04:23:00 UTC
[jira] [Created] (ZOOKEEPER-4624) Zookeeper service cannot restarted because the IO Inject filesystem fd is used up.
Young Xu created ZOOKEEPER-4624:
-----------------------------------
Summary: Zookeeper service cannot restarted because the IO Inject filesystem fd is used up.
Key: ZOOKEEPER-4624
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4624
Project: ZooKeeper
Issue Type: Bug
Environment: environment: *{color:#FF0000}K8S{color}*
deployment: *{color:#FF0000}statefulset replicas 3{color}*
zookeeper version: *{color:#FF0000}3.8.0{color}*
Reporter: Young Xu
We're running a chaos test. and we've using this scenarios:
ZooKeeper pod is deployed on three nodes. We use {color:#FF0000}*IO injection*{color} to fill up the fd of one node(test one pod), and filesytem all operations return "Too many files". After a period of time, the ZooKeeper service stops running. Then we stopped the injection. When I manually start the process again, the ZooKeeper reports an error.
{code:java}
2022-10-19 02:03:07,876 [myid:3] - INFO [main:o.a.z.s.q.QuorumPeer@2549] - QuorumPeer communication is not secured! (SASL auth disabled)2022-10-19 02:03:07,876 [myid:3] - INFO [main:o.a.z.s.q.QuorumPeer@2574] - quorum.cnxn.threads.size set to 202022-10-19 02:03:07,877 [myid:3] - INFO [main:o.a.z.s.p.FileSnap@85] - Reading snapshot /home/edge/middleware/zookeeper/data/data/version-2/snapshot.1409ce9ac72022-10-19 02:03:07,883 [myid:3] - INFO [main:o.a.z.s.DataTree@1705] - The digest in the snapshot has digest version of 2, with zxid as 0x1409ce9acc, and digest value as 816041257652022-10-19 02:03:11,662 [myid:3] - ERROR [main:o.a.z.s.q.QuorumPeer@1200] - Unable to load database on diskjava.io.EOFException: null at java.base/java.io.DataInputStream.readInt(Unknown Source) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:707) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:725) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:693) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:774) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)2022-10-19 02:03:11,663 [myid:3] - INFO [main:o.a.z.m.p.PrometheusMetricsProvider@570] - Shutdown executor service with timeout 10002022-10-19 02:03:11,739 [myid:3] - INFO [main:o.e.j.s.AbstractConnector@383] - Stopped ServerConnector@5b03b9fe{HTTP/1.1, (http/1.1)}{zookeeper-default-2.zookeeper.default.svc.cluster.local:8080}2022-10-19 02:03:11,742 [myid:3] - INFO [main:o.e.j.s.h.ContextHandler@1159] - Stopped o.e.j.s.ServletContextHandler@17bffc17{/,null,STOPPED}2022-10-19 02:03:11,746 [myid:3] - ERROR [main:o.a.z.s.q.QuorumPeerMain@114] - Unexpected exception, exiting abnormallyjava.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1201) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1132) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)Caused by: java.io.EOFException: null at java.base/java.io.DataInputStream.readInt(Unknown Source) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:67) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:707) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:725) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:693) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:774) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1146) ... 4 common frames omitted2022-10-19 02:03:11,747 [myid:3] - INFO [main:o.a.z.a.ZKAuditProvider@42] - ZooKeeper audit is disabled.2022-10-19 02:03:11,749 [myid:3] - ERROR [main:o.a.z.u.ServiceUtils@48] - Exiting JVM with code 1 {code}
Now I know delete data directory can fix this and get the service up and running. but I dont know why the file is corrupted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)