You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2022/05/25 14:13:03 UTC
[GitHub] [bookkeeper] GBM-tamerm opened a new issue, #3292: Bookkeeper shutdown when we stop ZK leader node
GBM-tamerm opened a new issue, #3292:
URL: https://github.com/apache/bookkeeper/issues/3292
**BUG REPORT**
***Describe the bug***
When we stop ZK leader node , it start new elections , and ZK clients get disconnected , any Bookie node with auto recovery running in the background will be shutdown with below exception
2022-05-24T02:13:33,263-0400 [AuditorElector-10.119.33.232:3181] ERROR org.apache.bookkeeper.replication.AuditorElector - Exception while performing auditor election
java.io.IOException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ledgers/underreplication/auditorelection/V_0000000079
at org.apache.bookkeeper.meta.ZkLedgerAuditorManager.createMyVote(ZkLedgerAuditorManager.java:204) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.meta.ZkLedgerAuditorManager.tryToBecomeAuditor(ZkLedgerAuditorManager.java:98) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.replication.AuditorElector$3.run(AuditorElector.java:184) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
2022-05-24T02:13:33,362-0400 [AutoRecoveryDeathWatcher-3181] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,exiting the watch loop!
2022-05-24T02:13:33,363-0400 [AutoRecoveryDeathWatcher-3181] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: bookie-server because of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
java.lang.RuntimeException: AutoRecovery is not running any more
at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
2022-05-24T02:13:33,364-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
2022-05-24T02:13:34,072-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down replication worker
2022-05-24T02:13:34,072-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.ReplicationWorker - Shutting down ReplicationWorker
2022-05-24T02:13:34,073-0400 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - ReplicationWorker exited loop!
2022-05-24T02:13:34,237-0400 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x500000042f40000
2022-05-24T02:13:34,238-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.proto.BookieServer - Shutting down BookieServer
2022-05-24T02:13:34,238-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.proto.BookieNettyServer - Shutting down BookieNettyServer
***To Reproduce***
Steps to reproduce the behavior:
1. Stop ZK leader node
2. Stop one BK node ( ex : bookie1) to trigger auto-recovery
3. other running BKs that have auto-recovery will be shutdown with above error
***Expected behavior***
other running BKs should not be shutdown
***Screenshots***
If applicable, add screenshots to help explain your problem.
***Additional context***
OS: Ubuntu 18.04
Java 8
Pulsar running as systemd service
6 brokers
6 bookies
5 ZK.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137737191
Thanks merlimat ,
i disabled auto-recovery component for bookies by running ookkeeper shell autorecovery -disable
and the issue is still happening looks like auto-recovery still trying to run
use of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
java.lang.RuntimeException: AutoRecovery is not running any more
at org.apache.bookkeeper.replication.AutoRecoveryMain$AutoRecoveryDeathWatcher.run(AutoRecoveryMain.java:237) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
2022-05-25T14:47:53,921-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.common.component.ComponentStarter - Closing component bookie-server in shutdown hook.
2022-05-25T14:47:53,923-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down auto recovery: 0
2022-05-25T14:47:53,923-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.replication.AutoRecoveryMain - Shutting down AutoRecovery
2022-05-25T14:47:53,923-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.meta.ZkLedgerAuditorManager - Shutting down AuditorElector
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] merlimat commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137918848
> But it is causing issue as shown the above excpetion trace
the auto-recovery is failing when the leader ZK stopped and new election start , and when it fail , it still shut down bookies nodes that has auto-recovery , although i manually stopped auto recovery before shut down the ZK leader .
what is the solution?
@GBM-tamerm In bookies you need to disable auto-recovery by setting in `bookkeeper.conf`:
```
autoRecoveryDaemonEnabled=false
```
Then you can run auto-recovery as a separate stateless service:
```
bin/bookkeeper autorecovery
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] leizhiyuan commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
leizhiyuan commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1161080803
auto recovery component will take affect the bookie-server , if zk leader down, auto recovery will throw a connection loss expcetion ,then it will execute the shutdown hook. auto recovery do not process connection loss correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137894644
Same issue reported in BK community
https://github.com/apache/bookkeeper/issues/3094
any help is highly appreciated , thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137930267
> > But it is causing issue as shown the above excpetion trace
> > the auto-recovery is failing when the leader ZK stopped and new election start , and when it fail , it still shut down bookies nodes that has auto-recovery , although i manually stopped auto recovery before shut down the ZK leader .
> > what is the solution?
>
> @GBM-tamerm In bookies you need to disable auto-recovery by setting in `bookkeeper.conf`:
>
> ```
> autoRecoveryDaemonEnabled=false
> ```
>
> Then you can run auto-recovery as a separate stateless service:
>
> ```
> bin/bookkeeper autorecovery
> ```
i tried that now , but autorecovery is failing with below excpetion
2022-05-25T19:02:14,298-0400 [main] ERROR org.apache.bookkeeper.common.component.AbstractLifecycleComponent - Calling uncaughtExceptionHandler
2022-05-25T19:02:14,299-0400 [main] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Triggered exceptionHandler of Component: autorecovery-server because of Exception in Thread: Thread[main,5,main]
java.lang.RuntimeException: java.io.IOException: Failed to bind to /0.0.0.0:8000
at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:114) ~[org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.server.service.StatsProviderService.doStart(StatsProviderService.java:51) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:83) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$4(LifecycleComponentStack.java:144) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) [com.google.guava-guava-30.1-jre.jar:?]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:144) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:85) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.replication.AutoRecoveryMain.doMain(AutoRecoveryMain.java:334) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.replication.AutoRecoveryMain.main(AutoRecoveryMain.java:308) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
Caused by: java.io.IOException: Failed to bind to /0.0.0.0:8000
at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:349) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.Server.doStart(Server.java:401) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:111) ~[org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.4.jar:4.14.4]
... 8 more
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method) ~[?:1.8.0_332]
at sun.nio.ch.Net.bind(Net.java:461) ~[?:1.8.0_332]
at sun.nio.ch.Net.bind(Net.java:453) ~[?:1.8.0_332]
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222) ~[?:1.8.0_332]
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85) ~[?:1.8.0_332]
at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:344) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:310) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:234) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.server.Server.doStart(Server.java:401) ~[org.eclipse.jetty-jetty-server-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73) ~[org.eclipse.jetty-jetty-util-9.4.43.v20210629.jar:9.4.43.v20210629]
at org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider.start(PrometheusMetricsProvider.java:111) ~[org.apache.bookkeeper.stats-prometheus-metrics-provider-4.14.4.jar:4.14.4]
... 8 more
2022-05-25T19:02:14,299-0400 [component-shutdown-thread] INFO org.apache.bookkeeper.common.component.ComponentStarter - Closing component autorecovery-server in shutdown hook.
2022-05-25T19:02:14,301-0400 [main] INFO org.apache.bookkeeper.common.component.ComponentStarter - Started component autorecovery-server.
2022-05-25T19:02:14,301-0400 [component-shutdown-thread] ERROR org.apache.bookkeeper.common.component.ComponentStarter - Failed to close component autorecovery-server in shutdown hook gracefully, Exiting anyway
java.lang.IllegalStateException: Can't move to closed before moving to stopped mode
at org.apache.bookkeeper.common.component.Lifecycle.moveToClosed(Lifecycle.java:185) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:121) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$close$6(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) ~[com.google.guava-guava-30.1-jre.jar:?]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.close(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.ComponentStarter$ComponentShutdownHook.run(ComponentStarter.java:47) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_332]
2022-05-25T19:02:14,303-0400 [main] ERROR org.apache.bookkeeper.replication.AutoRecoveryMain - Error in bookie shutdown
java.lang.IllegalStateException: Can't move to closed before moving to stopped mode
at org.apache.bookkeeper.common.component.Lifecycle.moveToClosed(Lifecycle.java:185) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:121) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$close$6(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) ~[com.google.guava-guava-30.1-jre.jar:?]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.close(LifecycleComponentStack.java:154) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at org.apache.bookkeeper.common.component.ComponentStarter$ComponentShutdownHook.run(ComponentStarter.java:47) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_332]
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] merlimat commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137505624
The restart is caused by the auto-recovery component of the bookies. In general, it is better to run the auto-recovery as a separate service (it's completely stateless), rather than as part of the bookies.
That will make the bookies not to restart on ZK session loss.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] merlimat commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
merlimat commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137742428
> Thanks merlimat ,
> i disabled auto-recovery component for bookies by running ookkeeper shell autorecovery -disable
> and the issue is still happening looks like auto-recovery still trying to run
> use of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
@GBM-tamerm yes, the auto-recovery process will still restart, though the bookie process won't do that anymore.
It will not be a problem since auto-recovery runs in background and won't cause any disruptions to existing clients.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [bookkeeper] GBM-tamerm commented on issue #3292: Bookkeeper shutdown when we stop ZK leader node - Pulsar V2.9.2
Posted by GitBox <gi...@apache.org>.
GBM-tamerm commented on issue #3292:
URL: https://github.com/apache/bookkeeper/issues/3292#issuecomment-1137781814
> > Thanks merlimat ,
> > i disabled auto-recovery component for bookies by running ookkeeper shell autorecovery -disable
> > and the issue is still happening looks like auto-recovery still trying to run
> > use of Exception in Thread: Thread[AutoRecoveryDeathWatcher-3181,5,main]
>
> @GBM-tamerm yes, the auto-recovery process will still restart, though the bookie process won't do that anymore.
>
> It will not be a problem since auto-recovery runs in background and won't cause any disruptions to existing clients.
But it is causing issue as shown the above excpetion trace
the auto-recovery is failing when the leader ZK stopped and new election start , and when it fail , it still shut down bookies nodes that has auto-recovery , although i manually stopped auto recovery before shut down the ZK leader .
what is the solution?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org