You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/08/09 09:29:20 UTC
[GitHub] [bookkeeper] suiyuzeng opened a new issue #2760: Bookkeeper shutdown when zookeeper connection just timeout
suiyuzeng opened a new issue #2760:
URL: https://github.com/apache/bookkeeper/issues/2760
***Describe the bug***
Bookkeeper begin to shutdown after a zookeeper connection timeout. Session is not expired and it was attempting reconnect.
In the log, there is a ConnectionLossException when a node was deleted in org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager#releaseUnderreplicatedLedger. It can not be processed in org.apache.bookkeeper.replication.ReplicationWorker#deferLedgerLockRelease and shutdown was called.
`
public void releaseUnderreplicatedLedger(long ledgerId) throws ReplicationException.UnavailableException {
if (LOG.isDebugEnabled()) {
LOG.debug("releaseLedger(ledgerId={})", ledgerId);
}
try {
Lock l = heldLocks.get(ledgerId);
if (l != null) {
zkc.delete(l.getLockZNode(), -1);
}
} catch (KeeperException.NoNodeException nne) {
// this is ok
} catch (KeeperException ke) {
LOG.error("Error deleting underreplicated ledger lock", ke);
throw new ReplicationException.UnavailableException("Error contacting zookeeper", ke);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new ReplicationException.UnavailableException("Interrupted while connecting zookeeper", ie);
}
heldLocks.remove(ledgerId);
}
`
It shoul be retried as syncCallWithRetries was called. One cause is that the maxRetries of the retry policy was setted to be 0. The zk client is initialized in org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase#initialize. And the retrying policy is created in org.apache.bookkeeper.meta.zk.ZKMetadataClientDriver#initialize.
`
public synchronized MetadataClientDriver initialize(ClientConfiguration conf,
ScheduledExecutorService scheduler,
StatsLogger statsLogger,
Optional<Object> optionalCtx)
throws MetadataException {
super.initialize(
conf,
statsLogger,
new BoundExponentialBackoffRetryPolicy(
conf.getZkTimeout(),
conf.getZkTimeout(),
0),
optionalCtx);
this.statsLogger = statsLogger;
this.clientConf = conf;
this.scheduler = scheduler;
this.bookieAddressTracking = conf.getEnableBookieAddressTracking();
return this;
}
`
The maxRetries is setted 0. Can we set it to be Integer.MAX_VALUE?
***To Reproduce***
Steps to reproduce the behavior:
1. Run bookkeeper
2. release ledger lock when zk timeout
LOG:
06:18:45.732 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x37b0a37639d001d for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x37b0a37639d001d, likely server has closed socket
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
06:18:45.757 [main-SendThread(10.168.114.200:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x47b0a3221890013 for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x47b0a3221890013, likely server has closed socket
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
06:18:45.859 [PendingReplicationTimer] ERROR org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Error deleting underreplicated ledger lock
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
06:18:45.867 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments
org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:604) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/ledgers/0000/0000/0010/7cc4/urL0001080516
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2281) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:519) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:597) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
... 4 more
06:18:45.867 [PendingReplicationTimer] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments of ledger 826574
org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:635) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1. jar:4.14.1]
... 3 more
06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
06:18:45.867 [PendingReplicationTimer] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down ReplicationWorker
06:18:45.867 [main-EventThread] INFO org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event.
06:18:45.874 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,exiting the watch loop!
06:18:46.516 [AutoRecoveryDeathWatcher-3181] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: bookie-server because of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
java.lang.RuntimeException: AutoRecovery is not running any more
at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
06:18:46.528 [component-shutdown-thread] INFO org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto recovery: 0
06:18:46.530 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.Auditor - Shutting down auditor
06:18:46.532 [AuditorElector-10.90.126.13:3181] INFO org.apache.bookkeeper.replication.AuditorElector - Shutting down AuditorElector
06:18:46.532 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] StevenLuMT commented on issue #2760: Bookkeeper shutdown when zookeeper connection just timeout
Posted by GitBox <gi...@apache.org>.
StevenLuMT commented on issue #2760:
URL: https://github.com/apache/bookkeeper/issues/2760#issuecomment-1055388556
bugfix in pr #2958
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org