You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/08/09 09:29:20 UTC

[GitHub] [bookkeeper] suiyuzeng opened a new issue #2760: Bookkeeper shutdown when zookeeper connection just timeout

suiyuzeng opened a new issue #2760:
URL: https://github.com/apache/bookkeeper/issues/2760


   ***Describe the bug***
   Bookkeeper begin to shutdown after a zookeeper connection timeout. Session is not expired and it was attempting reconnect.
   In the log, there is a ConnectionLossException when a node was deleted in org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager#releaseUnderreplicatedLedger. It can not be processed in org.apache.bookkeeper.replication.ReplicationWorker#deferLedgerLockRelease and shutdown was called.
   `
   public void releaseUnderreplicatedLedger(long ledgerId) throws ReplicationException.UnavailableException {
           if (LOG.isDebugEnabled()) {
               LOG.debug("releaseLedger(ledgerId={})", ledgerId);
           }
           try {
               Lock l = heldLocks.get(ledgerId);
               if (l != null) {
                   zkc.delete(l.getLockZNode(), -1);
               }
           } catch (KeeperException.NoNodeException nne) {
               // this is ok
           } catch (KeeperException ke) {
               LOG.error("Error deleting underreplicated ledger lock", ke);
               throw new ReplicationException.UnavailableException("Error contacting zookeeper", ke);
           } catch (InterruptedException ie) {
               Thread.currentThread().interrupt();
               throw new ReplicationException.UnavailableException("Interrupted while connecting zookeeper", ie);
           }
           heldLocks.remove(ledgerId);
       }
   `
   It shoul be retried as syncCallWithRetries was called. One cause is that the maxRetries of the retry policy was setted to be 0. The zk client is initialized in org.apache.bookkeeper.meta.zk.ZKMetadataDriverBase#initialize. And the retrying policy is created in org.apache.bookkeeper.meta.zk.ZKMetadataClientDriver#initialize.
   `
   public synchronized MetadataClientDriver initialize(ClientConfiguration conf,
                                                           ScheduledExecutorService scheduler,
                                                           StatsLogger statsLogger,
   
                                                           Optional<Object> optionalCtx)
               throws MetadataException {
           super.initialize(
               conf,
               statsLogger,
               new BoundExponentialBackoffRetryPolicy(
                   conf.getZkTimeout(),
                   conf.getZkTimeout(),
                   0),
               optionalCtx);
           this.statsLogger = statsLogger;
           this.clientConf = conf;
           this.scheduler = scheduler;
           this.bookieAddressTracking = conf.getEnableBookieAddressTracking();
           return this;
       }
   `
   
   The maxRetries is setted 0. Can we set it to be Integer.MAX_VALUE? 
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   1. Run bookkeeper
   2. release ledger lock when zk timeout
   
   
   LOG:
   06:18:45.732 [main-SendThread(10.168.114.200:2181)] WARN  org.apache.zookeeper.ClientCnxn - Session 0x37b0a37639d001d for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x37b0a37639d001d, likely server has closed socket
       at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
   
   
   06:18:45.757 [main-SendThread(10.168.114.200:2181)] WARN  org.apache.zookeeper.ClientCnxn - Session 0x47b0a3221890013 for sever xxxx/x.x.x.x:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x47b0a3221890013, likely server has closed socket
       at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
   06:18:45.859 [PendingReplicationTimer] ERROR org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Error deleting underreplicated ledger lock
   org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.   jar:4.14.1]
       at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
       at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
   06:18:45.867 [ReplicationWorker] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments
   org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:604) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/ledgers/0000/0000/0010/7cc4/urL0001080516
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2256) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:833) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:827) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2281) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:854) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:848) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:519) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicateFromHierarchy(ZkLedgerUnderreplicationManager.java:556) ~[org.apache.bookkeeper-bookkeeper-server-4. 14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:597) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       ... 4 more
   06:18:45.867 [PendingReplicationTimer] ERROR org.apache.bookkeeper.replication.ReplicationWorker - UnavailableException while replicating fragments of ledger 826574
   org.apache.bookkeeper.replication.ReplicationException$UnavailableException: Error contacting zookeeper
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:635) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.   jar:4.14.1]
       at org.apache.bookkeeper.replication.ReplicationWorker$4.run(ReplicationWorker.java:600) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at java.util.TimerThread.mainLoop(Timer.java:555) ~[?:1.8.0_131]
       at java.util.TimerThread.run(Timer.java:505) ~[?:1.8.0_131]
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /iot-bk/ledgers/underreplication/locks/urL0000826574
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:773) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient$11.call(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.zookeeper.ZooKeeperClient.delete(ZooKeeperClient.java:767) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
       at org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.releaseUnderreplicatedLedger(ZkLedgerUnderreplicationManager.java:629) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.   jar:4.14.1]
       ... 3 more
   06:18:45.867 [PendingReplicationTimer] INFO  org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
   06:18:45.867 [PendingReplicationTimer] INFO  org.apache.bookkeeper.replication.ReplicationWorker - Shutting down ReplicationWorker
   06:18:45.867 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED    event.
   06:18:45.874 [ReplicationWorker] INFO  org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited loop!
   06:18:46.516 [AutoRecoveryDeathWatcher-3181] INFO  org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,exiting    the watch loop!
   06:18:46.516 [AutoRecoveryDeathWatcher-3181] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: bookie-server because of Exception in  Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
   java.lang.RuntimeException: AutoRecovery is not running any more
       at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1]
   06:18:46.528 [component-shutdown-thread] INFO  org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
   06:18:46.530 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto recovery: 0
   06:18:46.530 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
   06:18:46.532 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.Auditor - Shutting down auditor
   06:18:46.532 [AuditorElector-10.90.126.13:3181] INFO  org.apache.bookkeeper.replication.AuditorElector - Shutting down AuditorElector
   06:18:46.532 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] StevenLuMT commented on issue #2760: Bookkeeper shutdown when zookeeper connection just timeout

Posted by GitBox <gi...@apache.org>.
StevenLuMT commented on issue #2760:
URL: https://github.com/apache/bookkeeper/issues/2760#issuecomment-1055388556


   bugfix in pr #2958 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org