You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/03/04 12:38:16 UTC

[GitHub] [pulsar] rfetishkevi commented on issue #3044: loadblance connect zookeeper error, broker down

rfetishkevi commented on issue #3044: loadblance connect zookeeper error, broker down
URL: https://github.com/apache/pulsar/issues/3044#issuecomment-594496210
 
 
   Are there any updates regarding this issue? 
   We are experiencing something very similar every 2-3 days. It looks like Zookeeper is disconnected on arbitrary broker and this makes whole pulsar cluster unavailable for a couple of minutes, which is fatal for our system. 
   We are running single Pulsar 2.4.1 cluster on a bare metal and haven't noticed any problems with network.
   
   Such failures are beginning like described in this issue:
   ```
   01:11:01: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 25 seconds
   01:11:03: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 23 seconds
   01:11:05: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 21 seconds
   01:11:07: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 19 seconds
   01:11:09: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 17 seconds
   01:11:11: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 15 seconds
   01:11:13: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 13 seconds
   01:11:15: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 11 seconds
   01:11:17: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 9 seconds
   01:11:19: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 6 seconds
   01:11:21: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 4 seconds
   01:11:23: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 2 seconds
   01:11:25: [pulsar-zk-session-watcher-5-1] WARN  org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - zoo keeper disconnected, waiting to reconnect, time remaining = 0 seconds
   ```
   
   Then there are two cases possible:
   1. Broker restarts itself and gets a lot of errors like this:
   ```
    08:15:33.649 [BookKeeperClientWorker-OrderedExecutor-10-0] WARN  org.apache.pulsar.broker.service.ServerCnx - [/<host>:<port>][persistent://xxx][xxx] Failed to create consumer: Connection was closed while the opening the cursor
     java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.BrokerServiceException: Connection was closed while the opening the cursor
     #011at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) ~[?:1.8.0_181]
     #011at org.apache.pulsar.broker.service.BrokerService$2.lambda$openLedgerComplete$1(BrokerService.java:668) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
     #011at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010) ~[?:1.8.0_181]
     #011at org.apache.pulsar.broker.service.BrokerService$2.openLedgerComplete(BrokerService.java:661) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
     #011at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$asyncOpen$7(ManagedLedgerFactoryImpl.java:328) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
     #011at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:656) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) ~[?:1.8.0_181]
     #011at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$2.initializeComplete(ManagedLedgerFactoryImpl.java:316) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
     #011at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$3$1.operationComplete(ManagedLedgerImpl.java:467) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
     #011at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl.lambda$null$2(ManagedCursorImpl.java:362) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
     #011at org.apache.bookkeeper.client.LedgerHandle$6.onSuccess(LedgerHandle.java:802) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
     #011at org.apache.bookkeeper.client.LedgerHandle$6.onSuccess(LedgerHandle.java:799) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
     #011at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:42) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2]
     #011at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2]
     #011at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) ~[?:1.8.0_181]
     #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
     #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
     #011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
     #011at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
     Caused by: org.apache.pulsar.broker.service.BrokerServiceException: Connection was closed while the opening the cursor
     #011at org.apache.pulsar.broker.service.persistent.PersistentTopic.lambda$subscribe$9(PersistentTopic.java:528) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
     #011at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:656) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:669) ~[?:1.8.0_181]
     #011at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:1997) ~[?:1.8.0_181]
     #011at org.apache.pulsar.broker.service.persistent.PersistentTopic.subscribe(PersistentTopic.java:514) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
     #011at org.apache.pulsar.broker.service.ServerCnx.lambda$null$10(ServerCnx.java:697) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
     #011at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952) ~[?:1.8.0_181]
     #011... 27 more
    
   08:15:33.651 [pulsar-io-20-4] WARN  org.apache.pulsar.broker.service.ServerCnx - [/5.9.116.248:50190] Got exception NativeIoException : syscall:read(..) failed: Connection reset by peer
   ```
   The same error appears at other brokers at the same time.
   
   2. Broker remains up and running but there are lots of zookeeper related errors:
   ```
   01:11:26.224 [pulsar-ordered-OrderedExecutor-1-0-SendThread(pulsar-prod-fsn-03.onde.in:2181)] WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 20000ms for sessionid 0x301a288dd2d000a
    
   01:11:26.325 [bookkeeper-ml-workers-OrderedExecutor-6-0] WARN  org.apache.pulsar.broker.service.BrokerService - Failed to create topic persistent://xxx
    org.apache.bookkeeper.mledger.ManagedLedgerException: org.apache.bookkeeper.mledger.ManagedLedgerException$MetaStoreException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    Caused by: org.apache.bookkeeper.mledger.ManagedLedgerException$MetaStoreException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
    #011at org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper.lambda$null$0(MetaStoreImplZookeeper.java:154) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
    #011at org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper.lambda$null$18(MetaStoreImplZookeeper.java:380) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
    #011at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-original-2.4.1.jar:2.4.1]
    #011at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2]
    #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_191]
    #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_191]
    #011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
    #011at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
    
   01:11:26.326 [pulsar-load-manager-4-1] WARN  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Error writing broker data on ZooKeeper: {}
    org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /loadbalance/brokers/<host>:<port>
    #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
    #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
    #011at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1336) ~[org.apache.pulsar-pulsar-zookeeper-2.4.1.jar:2.4.1]
    #011at org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$3101(ZooKeeperClient.java:70) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
    #011at org.apache.bookkeeper.zookeeper.ZooKeeperClient$21.call(ZooKeeperClient.java:1065) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
    #011at org.apache.bookkeeper.zookeeper.ZooKeeperClient$21.call(ZooKeeperClient.java:1059) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
    #011at org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
    #011at org.apache.bookkeeper.zookeeper.ZooKeeperClient.setData(ZooKeeperClient.java:1059) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2]
    #011at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.writeBrokerDataOnZooKeeper(ModularLoadManagerImpl.java:863) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
    #011at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.writeLoadReportOnZookeeper(ModularLoadManagerWrapper.java:120) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
    #011at org.apache.pulsar.broker.loadbalance.LoadReportUpdaterTask.run(LoadReportUpdaterTask.java:41) ~[org.apache.pulsar-pulsar-broker-2.4.1.jar:2.4.1]
    #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_191]
    #011at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_191]
    #011at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_191]
    #011at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_191]
    #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_191]
    #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_191]
    #011at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
    #011at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
   
   01:11:26.327 [BookKeeperClientWorker-OrderedExecutor-2-0] WARN  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [xxx-xxx][xxx] Failed to delete ledger 13237042: Error while using ZooKeeper
   ```
   
   Any ideas what can cause this issue? It's very important for us to make pulsar cluster more stable to avoid crashes every few days.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services