You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/05/25 14:13:03 UTC

[GitHub] [bookkeeper] GBM-tamerm opened a new issue, #3292: Bookkeeper shutdown when we stop ZK leader node

GBM-tamerm opened a new issue, #3292:
URL: https://github.com/apache/bookkeeper/issues/3292

   **BUG REPORT**
   
   ***Describe the bug***
   
   When we stop ZK leader node , it start new elections , and ZK clients get disconnected , any Bookie node with auto recovery running in the background will be shutdown with below exception
   2022-05-24T02:13:33,263-0400 [AuditorElector-10.119.33.232:3181] ERROR org.apache.bookkeeper.replication.AuditorElector - Exception while performing auditor election
   java.io.IOException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ledgers/underreplication/auditorelection/V_0000000079
                   at org.apache.bookkeeper.meta.ZkLedgerAuditorManager.createMyVote(ZkLedgerAuditorManager.java:204) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
                   at org.apache.bookkeeper.meta.ZkLedgerAuditorManager.tryToBecomeAuditor(ZkLedgerAuditorManager.java:98) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
                   at org.apache.bookkeeper.replication.AuditorElector$3.run(AuditorElector.java:184) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
   
   2022-05-24T02:13:33,362-0400 [AutoRecoveryDeathWatcher-3181] INFO  org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,exiting the watch loop!
   2022-05-24T02:13:33,363-0400 [AutoRecoveryDeathWatcher-3181] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: bookie-server because of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
   java.lang.RuntimeException: AutoRecovery is not running any more
                   at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
   2022-05-24T02:13:33,364-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
   2022-05-24T02:13:34,072-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
   2022-05-24T02:13:34,072-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.ReplicationWorker - Shutting down ReplicationWorker
   2022-05-24T02:13:34,073-0400 [ReplicationWorker] INFO  org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited loop!
   2022-05-24T02:13:34,237-0400 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x500000042f40000
   2022-05-24T02:13:34,238-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.proto.BookieServer - Shutting down BookieServer
   2022-05-24T02:13:34,238-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.proto.BookieNettyServer - Shutting down BookieNettyServer
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   1. Stop ZK leader node
   2. Stop one BK node ( ex : bookie1) to trigger auto-recovery
   3. other running BKs that have auto-recovery  will be shutdown with above error
   
   
   ***Expected behavior***
   
   other running BKs should not be shutdown
   
   ***Screenshots***
   
   If applicable, add screenshots to help explain your problem.
   
   ***Additional context***
   OS: Ubuntu 18.04
   Java 8
   Pulsar running as systemd service
   6 brokers
   6 bookies
   5 ZK.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137737191

   Thanks merlimat ,
   i disabled auto-recovery component for bookies by running ookkeeper shell autorecovery -disable
   and the issue is still happening looks like auto-recovery still trying to run
   use of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
   java.lang.RuntimeException: AutoRecovery is not running any more
           at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
   2022-05-25T14:47:53,921-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
   2022-05-25T14:47:53,923-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto recovery: 0
   2022-05-25T14:47:53,923-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
   2022-05-25T14:47:53,923-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.meta.ZkLedgerAuditorManager - Shutting down AuditorElector
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] merlimat commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137918848

   > But it is causing issue as shown the above excpetion trace
   the auto-recovery is failing when the leader ZK stopped and new election start , and when it fail , it still shut down bookies nodes that has auto-recovery , although i manually stopped auto recovery before shut down the ZK leader .
   what is the solution?
   
   @GBM-tamerm In bookies you need to disable auto-recovery by setting in `bookkeeper.conf`: 
   ```
   autoRecoveryDaemonEnabled=false
   ```
   
   Then you can run auto-recovery as a separate stateless service: 
   ```
   bin/bookkeeper autorecovery
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] leizhiyuan commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
leizhiyuan commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1161080803

   auto recovery component will take affect the  bookie-server , if zk leader down, auto recovery will throw a connection loss expcetion ,then it will execute the shutdown hook. auto recovery do not process connection loss correctly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137894644

   Same issue reported in BK community 
   https://github.com/apache/bookkeeper/issues/3094
   any help is highly appreciated , thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137930267

   > > But it is causing issue as shown the above excpetion trace
   > > the auto-recovery is failing when the leader ZK stopped and new election start , and when it fail , it still shut down bookies nodes that has auto-recovery , although i manually stopped auto recovery before shut down the ZK leader .
   > > what is the solution?
   > 
   > @GBM-tamerm In bookies you need to disable auto-recovery by setting in `bookkeeper.conf`:
   > 
   > ```
   > autoRecoveryDaemonEnabled=false
   > ```
   > 
   > Then you can run auto-recovery as a separate stateless service:
   > 
   > ```
   > bin/bookkeeper autorecovery
   > ```
   
   i tried that now , but autorecovery is failing with below excpetion
   2022-05-25T19:02:14,298-0400 [main] ERROR org.apache.bookkeeper.common.component.AbstractLifecycleComponent - Calling uncaughtExceptionHandler
   2022-05-25T19:02:14,299-0400 [main] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: autorecovery-server because of Exception in Thread: Thread[main,5,main]
   java.lang.RuntimeException: java.io.IOException: Failed to bind to /0.0.0.0:8000
           at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:114) ~[org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.server.service.StatsProviderService.doStart(StatsProviderService.java:51) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:83) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$4(LifecycleComponentStack.java:144) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) [com.google.guava-guava-30.1-jre.jar:?]
           at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:144) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:85) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.replication.AutoRecoveryMain.doMain(AutoRecoveryMain.java:334) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.replication.AutoRecoveryMain.main(AutoRecoveryMain.java:308) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
   Caused by: java.io.IOException: Failed to bind to /0.0.0.0:8000
           at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.Server.doStart(Server.java:401) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:111) ~[org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.4.jar:4.14.4]
           ... 8 more
   Caused by: java.net.BindException: Address already in use
           at sun.nio.ch.Net.bind0(Native Method) ~[?:1.8.0_332]
           at sun.nio.ch.Net.bind(Net.java:461) ~[?:1.8.0_332]
           at sun.nio.ch.Net.bind(Net.java:453) ~[?:1.8.0_332]
           at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) ~[?:1.8.0_332]
           at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85) ~[?:1.8.0_332]
           at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.server.Server.doStart(Server.java:401) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
           at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:111) ~[org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.4.jar:4.14.4]
           ... 8 more
   2022-05-25T19:02:14,299-0400 [component-shutdown-thread] INFO  org.apache.bookkeeper.common.component.ComponentStarter - Closing component autorecovery-server in shutdown hook.
   2022-05-25T19:02:14,301-0400 [main] INFO  org.apache.bookkeeper.common.component.ComponentStarter - Started component autorecovery-server.
   2022-05-25T19:02:14,301-0400 [component-shutdown-thread] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Failed to close component autorecovery-server in shutdown hook gracefully, Exiting anyway
   java.lang.IllegalStateException: Can't move to closed before moving to stopped mode
           at org.apache.bookkeeper.common.component.Lifecycle.moveToClosed(Lifecycle.java:185) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:121) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$close$6(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) ~[com.google.guava-guava-30.1-jre.jar:?]
           at org.apache.bookkeeper.common.component.LifecycleComponentStack.close(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.ComponentStarter$ComponentShutdownHook.run(ComponentStarter.java:47) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]
   2022-05-25T19:02:14,303-0400 [main] ERROR org.apache.bookkeeper.replication.AutoRecoveryMain - Error in bookie shutdown
   java.lang.IllegalStateException: Can't move to closed before moving to stopped mode
           at org.apache.bookkeeper.common.component.Lifecycle.moveToClosed(Lifecycle.java:185) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:121) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$close$6(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) ~[com.google.guava-guava-30.1-jre.jar:?]
           at org.apache.bookkeeper.common.component.LifecycleComponentStack.close(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at org.apache.bookkeeper.common.component.ComponentStarter$ComponentShutdownHook.run(ComponentStarter.java:47) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
           at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_332]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] merlimat commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137505624

   The restart is caused by the auto-recovery component of the bookies. In general, it is better to run the auto-recovery as a separate service (it's completely stateless), rather than as part of the bookies. 
   That will make the bookies not to restart on ZK session loss. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] merlimat commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137742428

   > Thanks merlimat ,
   > i disabled auto-recovery component for bookies by running ookkeeper shell autorecovery -disable
   > and the issue is still happening looks like auto-recovery still trying to run
   > use of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
   
   @GBM-tamerm yes, the auto-recovery process will still restart, though the bookie process won't do that anymore. 
   
   It will not be a problem since auto-recovery runs in background and won't cause any disruptions to existing clients.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2

Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137781814

   > > Thanks merlimat ,
   > > i disabled auto-recovery component for bookies by running ookkeeper shell autorecovery -disable
   > > and the issue is still happening looks like auto-recovery still trying to run
   > > use of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
   > 
   > @GBM-tamerm yes, the auto-recovery process will still restart, though the bookie process won't do that anymore.
   > 
   > It will not be a problem since auto-recovery runs in background and won't cause any disruptions to existing clients.
   
   But it is causing issue as shown the above excpetion trace
   the auto-recovery is failing when the leader ZK stopped and new election start , and when it fail , it still shut down bookies nodes that has auto-recovery , although i manually stopped auto recovery before shut down the ZK leader .
   what is the solution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org