You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2019/10/14 19:18:05 UTC
[GitHub] [pulsar] addisonj opened a new issue #5383: deleteSubscription fails and leaves a dangling thread in jetty threadpool

addisonj opened a new issue #5383: deleteSubscription fails and leaves a dangling thread in jetty threadpool
URL: https://github.com/apache/pulsar/issues/5383
 
 
   **Describe the bug**
   For unknown reasons, a `deleteSubscription` call can fail and it leaves around a dangling thread in the the broker, specifically the Jetty server threadpool. If this happens too many times, the broker can start having issues as Jetty is starved of threads.
   
   **To Reproduce**
   Not sure of exactly what happens, but here appears to be what usually gets me into this state
   Steps to reproduce the behavior:
   1. Use `pulsar-flink` (https://github.com/streamnative/pulsar-flink) to create a subscription with the `earliest` start position
   2.  Have your flink app fail very quickly
   3. Using jstack, noticing the danging thread waiting to delete the subscription
   
   **Expected behavior**
   The subscription deletion should either succeed or result in a proper failure that doesn't leave a dangling thread.
   
   **Screenshots**
   In the logs, this is what I see when the call fails:
   ```
   0:51:09.507 [pulsar-web-33-3] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://code/gerrit-events/bridge-talent-management:ref-updated][flink-pulsar-16a30186-aa62-4c3b-b576-612da2bfbf51] Unsubscribing
   00:51:09.507 [pulsar-web-33-3] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://code/gerrit-events/bridge-talent-management:ref-updated][flink-pulsar-16a30186-aa62-4c3b-b576-612da2bfbf51] Successfully closed subscription [ManagedCursorImpl{ledger=code/gerrit-events/persistent/bridge-talent-management%3Aref-updated, name=flink-pulsar-16a30186-aa62-4c3b-b576-612da2bfbf51, ackPos=608:-1, readPos=608:-1}]
   00:51:09.508 [pulsar-web-33-3] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper - [code/gerrit-events/persistent/bridge-talent-management%3Aref-updated] Remove consumer=flink-pulsar-16a30186-aa62-4c3b-b576-612da2bfbf51
   00:51:09.510 [bookkeeper-ml-workers-OrderedExecutor-2-0] ERROR org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught 
   java.lang.IllegalArgumentException: inconsistent range
           at java.util.concurrent.ConcurrentSkipListMap$SubMap.<init>(ConcurrentSkipListMap.java:2620) ~[?:1.8.0_212]
           at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:2078) ~[?:1.8.0_212]
           at org.apache.bookkeeper.mledger.util.RangeCache.removeRange(RangeCache.java:132) ~[org.apache.pulsar-managed-ledger-original-2.4.1-inst2.jar:2.4.1-inst2]
           at org.apache.bookkeeper.mledger.impl.EntryCacheImpl.invalidateEntries(EntryCacheImpl.java:151) ~[org.apache.pulsar-managed-ledger-original-2.4.1-inst2.jar:2.4.1-inst2]
           at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$7.operationComplete(ManagedLedgerImpl.java:777) ~[org.apache.pulsar-managed-ledger-original-2.4.1-inst2.jar:2.4.1-inst2]
           at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$7.operationComplete(ManagedLedgerImpl.java:765) ~[org.apache.pulsar-managed-ledger-original-2.4.1-inst2.jar:2.4.1-inst2]
           at org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper.lambda$null$13(MetaStoreImplZookeeper.java:308) ~[org.apache.pulsar-managed-ledger-original-2.4.1-inst2.jar:2.4.1-inst2]
           at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-original-2.4.1-inst2.jar:2.4.1-inst2]
           at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2]
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
           at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
   ```
   
   This then results in the following thread sticking around waiting for a future to complete:
   ```
   "pulsar-web-33-11" #95 prio=5 os_prio=0 tid=0x00007ff06a5b0000 nid=0xe2 waiting on condition [0x00007ff0048b7000]
      java.lang.Thread.State: WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x000000008553c2b0> (a java.util.concurrent.CompletableFuture$Signaller)
           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
           at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
           at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
           at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
           at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
           at org.apache.pulsar.broker.admin.impl.PersistentTopicsBase.internalDeleteSubscription(PersistentTopicsBase.java:770)
           at org.apache.pulsar.broker.admin.v2.PersistentTopics.deleteSubscription(PersistentTopics.java:535)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
   
   ...
   ```
   
   
   **Desktop (please complete the following information):**
   Pulsar 2.4.1 on Linux
   
   **Additional context**
   It seems like there are two problems here:
   1. Why the `InconsistentRange` exception happens in the first place. I think what happens here is that ` PositionImpl slowestConsumerPosition = cursors.getSlowestReaderPosition();` can return `-1` or maybe `0`, this is then out of range for the entry cache. I think this happens in cases where a topic and it's subscription never receives messages. It may only happen when using the "reader" API?. This seems like a fairly minor problem and could probably be solved by just catching that exception properly.
   2. The real problem appears to be that we do an unbound `get` on a future in the web api. It seems like all `future.get` should have a timeout to prevent threads from sticking around for forever.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services