You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/02/16 02:50:05 UTC

[GitHub] [pulsar] fu-turer opened a new issue #14311: creat ledger timeout and thread BookKeeperClientWorker in stuck

fu-turer opened a new issue #14311:
URL: https://github.com/apache/pulsar/issues/14311


   **Describe the bug**
   we found many timeout logs when creating producer or sending message.like this,we creat a new topic and then build a producer to send message:
   ```2022-02-15 12:50:06,265Z [pulsar-io-4-2] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.244.2.27:44396] Failed to create topic persistent://public/default/test-22-partition-0, producerId=0
   java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout 
           at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
           at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
           at java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
           at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
           at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
   ```
   we check the stack trace of create ledger and found that when creating ledger complete it can't get the thread `BookKeeperClientWorker ` to do callback's work
   
   ![image](https://user-images.githubusercontent.com/25685998/154185897-b6b5896f-cd0a-4b60-935d-823bcfef38a4.png)
   
   we dump tread stack, `BookKeeperClientWorker-OrderedExecutor-0-0` is always BLOCKED 
   ```
   "BookKeeperClientScheduler-OrderedScheduler-0-0" #26 prio=5 os_prio=0 cpu=61864.97ms elapsed=268089.10s tid=0x00007f88598ef800 nid=0x17b in Object.wait()  [0x00007f87fa5e1000]
      java.lang.Thread.State: WAITING (on object monitor)
           at java.lang.Object.wait(java.base@11.0.12/Native Method)
           - waiting on <no object reference available>
           at java.lang.Object.wait(java.base@11.0.12/Unknown Source)
           at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:275)
           - waiting to re-lock in wait() <0x00000000cb13f9c0> (a io.netty.channel.DefaultChannelPromise)
           at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
           at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.closeInternal(PerChannelBookieClient.java:1081)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.disconnect(PerChannelBookieClient.java:1034)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.disconnect(PerChannelBookieClient.java:1029)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.failTLS(PerChannelBookieClient.java:2555)
           - locked <0x00000000b8293218> (a org.apache.bookkeeper.proto.PerChannelBookieClient)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.access$2900(PerChannelBookieClient.java:154)
           at org.apache.bookkeeper.proto.PerChannelBookieClient$StartTLSCompletion.errorOut(PerChannelBookieClient.java:1961)
           at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue.timeout(PerChannelBookieClient.java:1614)
           at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue.maybeTimeout(PerChannelBookieClient.java:1606)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.lambda$static$3(PerChannelBookieClient.java:1011)
           at org.apache.bookkeeper.proto.PerChannelBookieClient$$Lambda$596/0x00000001006f8440.test(Unknown Source)
           at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap$Section.removeIf(ConcurrentOpenHashMap.java:411)
           at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap.removeIf(ConcurrentOpenHashMap.java:172)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.checkTimeoutOnPendingOperations(PerChannelBookieClient.java:1015)
           at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.checkTimeoutOnPendingOperations(DefaultPerChannelBookieClientPool.java:132)
           at org.apache.bookkeeper.proto.BookieClientImpl.monitorPendingOperations(BookieClientImpl.java:572)
           at org.apache.bookkeeper.proto.BookieClientImpl.lambda$new$0(BookieClientImpl.java:131)
           at org.apache.bookkeeper.proto.BookieClientImpl$$Lambda$163/0x000000010038e840.run(Unknown Source)
           at org.apache.bookkeeper.util.SafeRunnable$1.safeRun(SafeRunnable.java:43)
           at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
           at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:705)
           at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.12/Unknown Source)
           at java.util.concurrent.FutureTask.runAndReset(java.base@11.0.12/Unknown Source)
           at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.12/Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.12/Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.12/Unknown Source)
           at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
           at java.lang.Thread.run(java.base@11.0.12/Unknown Source)
   
   "BookKeeperClientWorker-OrderedExecutor-0-0" #27 prio=5 os_prio=0 cpu=616668.69ms elapsed=268089.10s tid=0x00007f88598f3000 nid=0x17c waiting for monitor entry  [0x00007f87fa4e0000]
      java.lang.Thread.State: BLOCKED (on object monitor)
           at org.apache.bookkeeper.proto.PerChannelBookieClient.connectIfNeededAndDoOp(PerChannelBookieClient.java:631)
           - waiting to lock <0x00000000b8293218> (a org.apache.bookkeeper.proto.PerChannelBookieClient)
           at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:121)
           at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:116)
           at org.apache.bookkeeper.proto.BookieClientImpl.addEntry(BookieClientImpl.java:329)
           at org.apache.bookkeeper.client.PendingAddOp.sendWriteRequest(PendingAddOp.java:152)
           at org.apache.bookkeeper.client.PendingAddOp.safeRun(PendingAddOp.java:278)
           at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.12/Unknown Source)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.12/Unknown Source)
           at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
           at java.lang.Thread.run(java.base@11.0.12/Unknown Source)
   
   ```
   
   **To Reproduce**
   i don't known how to reproduce
   
   **Additional context**
   version:2.9.1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on issue #14311: creat ledger timeout and thread BookKeeperClientWorker in stuck

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on issue #14311:
URL: https://github.com/apache/pulsar/issues/14311#issuecomment-1046160643


   @fu-turer thanks for opening this issue and providing the thread dumps. It looks like the blocked thread is in the bookkeeper client. Based on your thread stack, I'm assuming you're running with TLS enabled, is that correct? Do you see a log line with `TLS failure on` in it? It should end with a number, and that number is a bk error code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] github-actions[bot] commented on issue #14311: creat ledger timeout and thread BookKeeperClientWorker in stuck

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #14311:
URL: https://github.com/apache/pulsar/issues/14311#issuecomment-1079562395


   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] fu-turer commented on issue #14311: creat ledger timeout and thread BookKeeperClientWorker in stuck

Posted by GitBox <gi...@apache.org>.
fu-turer commented on issue #14311:
URL: https://github.com/apache/pulsar/issues/14311#issuecomment-1048576101


   @dlg99 i use TLSv1.2.
   
   @michaeljmarshall there are 3 bookies, 2 brokers in my test environment. before problem happen, 3 bookies are all down, after all bookies recover, one broker's thead in stuck, and the anther one is normal


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] dlg99 commented on issue #14311: creat ledger timeout and thread BookKeeperClientWorker in stuck

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #14311:
URL: https://github.com/apache/pulsar/issues/14311#issuecomment-1048514697


   @fu-turer is there anything in the logs related to not being able to connect to bookies because of e.g. a TLS issue? Or attempt to use TLS 1.3? see https://github.com/apache/bookkeeper/issues/2711
   
   @fistan684 are you rotating TLS certificates every few hours? any chance that certificates occasionally expire before new ones deployed on the bookie nodes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] fistan684 commented on issue #14311: creat ledger timeout and thread BookKeeperClientWorker in stuck

Posted by GitBox <gi...@apache.org>.
fistan684 commented on issue #14311:
URL: https://github.com/apache/pulsar/issues/14311#issuecomment-1045254332


   Since upgrading from 2.7.1 to 2.9.1 we also have been hitting issues where our producers will begin failing. Typically we'll have one broker that isn't responding to requests and restarting it will fix the issue. From looking at the broker logs we see a similar error to what is describe in this issue:
   
   `2022-02-18T17:35:06,772+0000 [pulsar-io-4-7] WARN  org.apache.pulsar.broker.service.ServerCnx - [/10.100.209.43:51632][persistent://prod/voltron-general/871_2b2d84be150dcf9c_MAID_DELETE_6333758_4bb66664126194f7-partition-0][voltron] Failed
    to create consumer: consumerId=0, Failed to load topic within timeout                                                                                                                                                                         
   java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout                                                                                            
   ...                                                                                                                                       
       at org.apache.pulsar.common.util.FutureUtil.lambda$addTimeoutHandling$1(FutureUtil.java:141) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1]                                                                                            
       at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final]                                                                                                                 
   ...                                                                                                                                                                                       
   Caused by: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout`
   
   This seems to be happening randomly every 4-7 hours since we upgraded. We typically write to a lot of topics in a given namespace. We'll try to capture thread state next time it happens.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] dlg99 commented on issue #14311: creat ledger timeout and thread BookKeeperClientWorker in stuck

Posted by GitBox <gi...@apache.org>.
dlg99 commented on issue #14311:
URL: https://github.com/apache/pulsar/issues/14311#issuecomment-1049083085


   > before problem happen, 3 bookies are all down
   
   this would explain `failTLS` in the stack :) Why are they all going down?
   will the thread unblock after some time (some timeouts can be in the range of 30sec or so by default)
   did the bookie IP address change after the bookie came back up? JVM caches dns lookups etc.
   
   bookie client (pulsar broker side) should log what it is doing, trying to connect/failing, certificate errors etc.
   I recommend checking the log (at least to make sure that bookie client logs there). With that please attach a full log from the client side and from the bookies and a full thread dump. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org