You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/03/04 12:38:16 UTC
[GitHub] [pulsar] rfetishkevi commented on issue #3044: loadblance connect
zookeeper error, broker down
rfetishkevi commented on issue #3044: loadblance connect zookeeper error, broker down
URL: https://github.com/apache/pulsar/issues/3044#issuecomment-594496210
Are there any updates regarding this issue?
We are experiencing something very similar every 2-3 days. It looks like Zookeeper is disconnected on arbitrary broker and this makes whole pulsar cluster unavailable for a couple of minutes, which is fatal for our system.
We are running single Pulsar 2.4.1 cluster on a bare metal and haven't noticed any problems with network.
Such failures are beginning like described in this issue:
```
01:11:01: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 25 seconds
01:11:03: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 23 seconds
01:11:05: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 21 seconds
01:11:07: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 19 seconds
01:11:09: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 17 seconds
01:11:11: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 15 seconds
01:11:13: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 13 seconds
01:11:15: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 11 seconds
01:11:17: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 9 seconds
01:11:19: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 6 seconds
01:11:21: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 4 seconds
01:11:23: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 2 seconds
01:11:25: [pulsar-zk-session-watcher-5-1] WARN org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 0 seconds
```
Then there are two cases possible:
1. Broker restarts itself and gets a lot of errors like this:
```
08:15:33.649 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN org.apache.pulsar.broker.service.ServerCnx - [/<host>:<port>][persistent://xxx][xxx] Failed to create consumer: Connection was closed while the opening the cursor
java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException: Connection was closed while the opening the cursor
#011at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) ~[?:1.8.0_181]
#011at org.apache.pulsar.broker.service.BrokerService$2.lambda$openLedgerComplete$1(BrokerService.java:668) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010) ~[?:1.8.0_181]
#011at org.apache.pulsar.broker.service.BrokerService$2.openLedgerComplete(BrokerService.java:661) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$asyncOpen$7(ManagedLedgerFactoryImpl.java:328) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
#011at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:656) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) ~[?:1.8.0_181]
#011at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$2.initializeComplete(ManagedLedgerFactoryImpl.java:316) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$3$1.operationComplete(ManagedLedgerImpl.java:467) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.lambda$null$2(ManagedCursorImpl.java:362) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.client.LedgerHandle$6.onSuccess(LedgerHandle.java:802) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
#011at org.apache.bookkeeper.client.LedgerHandle$6.onSuccess(LedgerHandle.java:799) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
#011at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:42) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2]
#011at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2]
#011at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) ~[?:1.8.0_181]
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
#011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
#011at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: org.apache.pulsar.broker.service.BrokerServiceException: Connection was closed while the opening the cursor
#011at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$subscribe$9(PersistentTopic.java:528) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:656) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:669) ~[?:1.8.0_181]
#011at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:1997) ~[?:1.8.0_181]
#011at org.apache.pulsar.broker.service.persistent.PersistentTopic.subscribe(PersistentTopic.java:514) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at org.apache.pulsar.broker.service.ServerCnx.lambda$null$10(ServerCnx.java:697) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) ~[?:1.8.0_181]
#011... 27 more
08:15:33.651 [pulsar-io-20-4] WARN org.apache.pulsar.broker.service.ServerCnx - [/5.9.116.248:50190] Got exception NativeIoException : syscall:read(..) failed: Connection reset by peer
```
The same error appears at other brokers at the same time.
2. Broker remains up and running but there are lots of zookeeper related errors:
```
01:11:26.224 [pulsar-ordered-OrderedExecutor-1-0-SendThread(pulsar-prod-fsn-03.onde.in:2181)] WARN org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 20000ms for sessionid 0x301a288dd2d000a
01:11:26.325 [bookkeeper-ml-workers-OrderedExecutor-6-0] WARN org.apache.pulsar.broker.service.BrokerService - Failed to create topic persistent://xxx
org.apache.bookkeeper.mledger.ManagedLedgerException: org.apache.bookkeeper.mledger.ManagedLedgerException$MetaStoreException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
Caused by: org.apache.bookkeeper.mledger.ManagedLedgerException$MetaStoreException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
#011at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper.lambda$null$0(MetaStoreImplZookeeper.java:154) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper.lambda$null$18(MetaStoreImplZookeeper.java:380) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2]
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_191]
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_191]
#011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
#011at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
01:11:26.326 [pulsar-load-manager-4-1] WARN org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Error writing broker data on ZooKeeper: {}
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /loadbalance/brokers/<host>:<port>
#011at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
#011at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
#011at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1336) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
#011at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$3101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
#011at org.apache.bookkeeper.zookeeper.ZooKeeperClient$21.call(ZooKeeperClient.java:1065) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
#011at org.apache.bookkeeper.zookeeper.ZooKeeperClient$21.call(ZooKeeperClient.java:1059) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
#011at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
#011at org.apache.bookkeeper.zookeeper.ZooKeeperClient.setData(ZooKeeperClient.java:1059) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
#011at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.writeBrokerDataOnZooKeeper(ModularLoadManagerImpl.java:863) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.writeLoadReportOnZookeeper(ModularLoadManagerWrapper.java:120) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at org.apache.pulsar.broker.loadbalance.LoadReportUpdaterTask.run(LoadReportUpdaterTask.java:41) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_191]
#011at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_191]
#011at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_191]
#011at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_191]
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_191]
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_191]
#011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
#011at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
01:11:26.327 [BookKeeperClientWorker-OrderedExecutor-2-0] WARN org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [xxx-xxx][xxx] Failed to delete ledger 13237042: Error while using ZooKeeper
```
Any ideas what can cause this issue? It's very important for us to make pulsar cluster more stable to avoid crashes every few days.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services