You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/10/30 10:04:53 UTC
[GitHub] [pulsar] lhotari commented on issue #6259: Retry when zookeeper session expired at broker side.

lhotari commented on issue #6259:
URL: https://github.com/apache/pulsar/issues/6259#issuecomment-719460964


   I was running into some issues where there were timeouts in the Pulsar client application logs. The Pulsar brokers were restarting. This was happening in a load test and it seemed that the broker restarts made the problem worse.
   
   To mitigate the issue, the brokers were recently configured to use zookeeperSessionExpiredPolicy=reconnect setting. The broker zookeeper timeout related settings are at the default values (zooKeeperSessionTimeoutMillis=30000, zooKeeperOperationTimeoutSeconds=30), since it seems odd that the Zookeeper interactions would take longer since the CPU consumption looks very low on zk in the load test. 
   
   Now, we reran the load test and the brokers became unavailable. The logs were filled with this type of errors:
   ```
   07:37:09.740 [pulsar-zk-session-watcher-5-1] ERROR org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for reconnecting, invoking shutdown service
   07:37:11.740 [pulsar-zk-session-watcher-5-1] ERROR org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for reconnecting, invoking shutdown service
   07:37:13.741 [pulsar-zk-session-watcher-5-1] ERROR org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for reconnecting, invoking shutdown service
   07:37:15.741 [pulsar-zk-session-watcher-5-1] ERROR org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for reconnecting, invoking shutdown service
   ```
   This is with `zookeeperSessionExpiredPolicy=reconnect`. Previously with the default setting (`zookeeperSessionExpiredPolicy=shutdown`), the broker would resume operations after a restart. Now all brokers in a 3 node cluster get unavailable.
   
   The benefit of using `zookeeperSessionExpiredPolicy=reconnect` is that now it was possible to get a thread dump from the system when it got unavailable, although the broker didn't ever resume operations.
   
   In the thread dump, there are quite a few hanging threads. Some stack traces are similar to what #8406 fixes. 
   ```
   "ForkJoinPool.commonPool-worker-7" #1258 daemon prio=5 os_prio=0 tid=0x00007f67a09f9000 nid=0x50a waiting on condition [0x00007f676c258000]
      java.lang.Thread.State: WAITING (parking)
   	at sun.misc.Unsafe.park(Native Method)
   	- parking to wait for  <0x000000044ca60338> (a java.util.concurrent.CompletableFuture$Signaller)
   	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
   	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
   	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
   	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
   	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
   	at com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache$LoadingCacheView.get(LocalAsyncLoadingCache.java:400)
   	at org.apache.pulsar.common.naming.NamespaceBundleFactory.getBundles(NamespaceBundleFactory.java:160)
   	at org.apache.pulsar.broker.namespace.NamespaceService.getBundle(NamespaceService.java:192)
   	at org.apache.pulsar.broker.namespace.NamespaceService.isTopicOwned(NamespaceService.java:913)
   	at org.apache.pulsar.broker.namespace.NamespaceService.isServiceUnitOwned(NamespaceService.java:877)
   	at org.apache.pulsar.broker.service.BrokerService.checkTopicNsOwnership(BrokerService.java:1359)
   	at org.apache.pulsar.broker.service.persistent.PersistentTopic.subscribe(PersistentTopic.java:521)
   ```
   Full thread dump of broker:
   https://gist.github.com/lhotari/ba9c45cf4111a66f5e8960df4955bbf8
   Online thread dump report: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMTAvMzAvLS1icm9rZXJfdGhyZWFkZHVtcC50eHQtLTEwLTAtMzc=&
   
   The reason why I'm bringing this up in this issue is that a Zookeeper timeout exception in the broker logs can actually be caused by threads hanging in the broker. Hopefully #8406 gets also fixed in the upcoming Pulsar 2.6.2 version


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org