You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/05/13 06:33:42 UTC

[GitHub] [pulsar] devinbost opened a new issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

devinbost opened a new issue #6054:
URL: https://github.com/apache/pulsar/issues/6054

**Describe the bug**
Topics randomly freeze, causing catastrophic topic outages on a weekly (or more frequent) basis. This has been an issue as long as my team has used Pulsar, and it's been communicated to a number of folks on the Pulsar PMC committee.

(I thought an issue was already created for this bug, but I couldn't find it anywhere.)

**To Reproduce**
We have not figured out how to reproduce the issue. It's random (seems to be non-deterministic) and doesn't seem to have any clues in the broker logs.

**Expected behavior**
Topics should never just randomly stop working to where the only resolution is restarting the problem broker.

**Steps to Diagnose and Temporarily Resolve**
![image](https://user-images.githubusercontent.com/7418031/72367014-a3e81600-36b8-11ea-8e50-ea4ce7c9b329.png)
**Step 2**: Check the rate out on the topic. (click on the topic in the dashboard, or do a stats on the topic and look at the "msgRateOut")

If the rate out is 0 this is likely a frozen topic, but to verify do the following:

In the pulsar dashboard, click on the broker that topic is living on. If you see that there are multiple topic that have a rate out of 0, then proceed to the next step, if not it could potentially be another issue. Investigate further.
![image](https://user-images.githubusercontent.com/7418031/72367085-c843f280-36b8-11ea-8f99-d24ec1edc933.png)

![image](https://user-images.githubusercontent.com/7418031/72367102-d560e180-36b8-11ea-86e4-9f1078adb13b.png)

**Step 3**: Stop the broker on the server that the topic is living on. `pulsar-broker stop` .

**Step 4**: Wait for the backlog to be consumed and all the functions to be rescheduled. (typically wait for about 5-10 mins)

**Environment:**
```
Docker on bare metal running: `apachepulsar/pulsar-all:2.4.0`
on CentOS.
Brokers are the function workers.
```
This has been an issue with previous versions of Pulsar as well.

**Additional context**

Problem was MUCH worse with Pulsar 2.4.2, so our team needed to roll back to 2.4.0 (which has the problem, but it's less frequent).
This is preventing the team from progressing in the use of Pulsar, and it's causing SLA problems with those who use our service.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839495126


   I traced the call chain for where ack's should be getting sent, and I filled in some additional steps:
   
   When `ProducerImpl` sends the messages, it builds `newSend` command instances, which get picked up by `ServerCnx.handleSend(..)`, which goes to `Producer.publishMessage(..)`, then `PersistentTopic.publishMessage(..)`
   which calls `asyncAddEntry(headersAndPayload, publishContext)`, which calls `[PersistentTopic].ledger.asyncAddEntry(headersAndPayload, (int) publishContext.getNumberOfMessages(), this, publishContext)`
   which creates the `OpAddEntry` and calls `internalAsyncAddEntry(addOperation)` on a different thread, which adds `OpAddEntry` to `[ManagedLedgerImpl].pendingAddEntries`
   
   From somewhere (it's not clear to me exactly where yet) we call `OpAddEntry.safeRun()`, which polls `pendingAddEntries`, gets the callback on `OpAddEntry` (which is the `PersistentTopic` instance) and calls `[PersistentTopic].addComplete(lastEntry, data.asReadOnly(), ctx)`, which calls `publishContext.completed(..)` on `Producer.MessagePublishContext`, which calls `Producer.ServerCnx.execute(this)` on the `MessagePublishContext`, which calls `MessagePublishContext.run()`, which triggers `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)` [SEND_RECEIPT], which writes a `newSendReceiptCommand` to the channel.
   
   So, I added a lot of logging, and to my great surprise, I'm getting an NPE after adding these debug lines to `OpAddEntry.createOpAddEntry(..)`:
   
   ```
       private static OpAddEntry createOpAddEntry(ManagedLedgerImpl ml, ByteBuf data, AddEntryCallback callback, Object ctx) {
           log.debug("Running OpAddEntry.createOpAddEntry(..)");
           OpAddEntry op = RECYCLER.get();
           log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           op.ml = ml;
           op.ledger = null;
           op.data = data.retain();
           op.dataLength = data.readableBytes();
           op.callback = callback;
           op.ctx = ctx;
           op.addOpCount = ManagedLedgerImpl.ADD_OP_COUNT_UPDATER.incrementAndGet(ml);
           op.closeWhenDone = false;
           op.entryId = -1;
           op.startTime = System.nanoTime();
           op.state = State.OPEN;
           ml.mbean.addAddEntrySample(op.dataLength);
           log.debug("2. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           return op;
       }
   ```
   
   Here's the stack trace:
   
   ```
      2021-05-12T06:14:19,569 [pulsar-io-28-32] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Running OpAddEntry.createOpAddEntry(..)
       2021-05-12T06:14:19,569 [pulsar-io-28-32] WARN  org.apache.pulsar.broker.service.ServerCnx - [/10.20.20.160:37696] Got exception java.lang.NullPointerException
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.toString(OpAddEntry.java:354)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.createOpAddEntry(OpAddEntry.java:100)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.create(OpAddEntry.java:81)
       	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncAddEntry(ManagedLedgerImpl.java:689)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.asyncAddEntry(PersistentTopic.java:428)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.publishMessage(PersistentTopic.java:404)
       	at org.apache.pulsar.broker.service.Producer.publishMessageToTopic(Producer.java:224)
       	at org.apache.pulsar.broker.service.Producer.publishMessage(Producer.java:161)
       	at org.apache.pulsar.broker.service.ServerCnx.handleSend(ServerCnx.java:1372)
       	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:207)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:200)
       	at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:162)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1376)
       	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1265)
       	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1302)
       	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
       	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
       	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
       	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
       	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
       	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
       	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       	at java.lang.Thread.run(Thread.java:748)
   ```
   
   The only place we could be throwing an NPE is here:
   `log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);`
   
   So, `op` isn't null until the null check passes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] hpvd commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

hpvd commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-741618223


   @kemburi which version are you using?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-797031711


   @devinbost Would be able to test with a 2.8.0-SNAPSHOT version? What might fix some lockup issues is #9787 .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-817014223


   The negative permits value matches between the dispatcher and the broker Consumer for the corresponding topic:
   ![image](https://user-images.githubusercontent.com/7418031/114248073-0380b000-9954-11eb-84b2-4d0f49028cd0.png)
   ![image](https://user-images.githubusercontent.com/7418031/114248118-15625300-9954-11eb-84c0-8d39a56ebe21.png)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839540327


   That `toString()` bug is addressed by this PR: https://github.com/apache/pulsar/pull/10548


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840891270


   @lhotari I've attached thread dumps from two of the brokers that were exhibiting symptoms. 
   
   [thread_dump_broker08.log](https://github.com/apache/pulsar/files/6475517/thread_dump_broker08.log)
   [thread_dump_broker10.log](https://github.com/apache/pulsar/files/6475518/thread_dump_broker10.log)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-844626492


   @rdhabalia This happened after I applied the default broker.conf settings (fixing the non-default configs for ManagedLedger and a few other things that we had) and restarted the brokers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] rdhabalia closed issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

rdhabalia closed issue #6054:
URL: https://github.com/apache/pulsar/issues/6054


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840894974


   I noticed that in almost every case where the broker is getting a SEND command, it's producing a SEND_RECEIPT command:
   ```
   $ cat messages_broker10.log | grep 'Received cmd SEND' | wc -l
      20806
   $ cat messages_broker10.log | grep 'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      20806
   $ cat messages_broker8.log | grep 'Received cmd SEND' | wc -l
      44654
   $ cat messages_broker8.log | grep 'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      44651
   ```
   I assumed I'd see a gap between them, but unless something is creating some kind of logical deadlock that's causing it to stop emitting logs in between those commands, it seems like the problem is either before the SEND command or after the SEND_RECEIPT command. 
   
   With that said, I noticed that the `lastConfirmedEntry` isn't progressing on one of the threads after the freeze occurred:
   
   ```
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10636 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10672 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10696 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10747 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10775 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10783 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10815 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10840 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10886 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10915 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10916 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10917 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10918 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10919 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10920 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10944 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10945 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10948 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10952 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10954 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10957 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10959 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10963 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10974 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10977 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10982 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10984 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11029 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11036 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11039 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11051 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11053 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11065 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11072 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11077 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11086 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11179 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11190 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11199 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11239 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11252 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11258 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11264 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11358 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11359 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11385 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11437 -- last: 3215336:11436
   ```
   
   When I say "freeze," that may be misleading because it's not a complete lockup. We still get messages to produce, but it's so slow that it's almost equivalent to being frozen. As a comparison, here's one of the functions when it's healthy:
   ![image](https://user-images.githubusercontent.com/7418031/118199967-6365ed00-b411-11eb-871c-7b7f8f186404.png)
   
   Here's that same function when it's "frozen":
   ![image](https://user-images.githubusercontent.com/7418031/118200011-78428080-b411-11eb-80ee-c802471b3c4c.png)
   
   So, over 1 minute, it's processing 71 messages instead of 160,851 messages.
   
   I also noticed this warning in the logs after it had been stuck for a while, but I'm not sure if it's related:
   
   ```
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,634 [main-SendThread(server10.example.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd038f for sever server10.example.com/10.20.69.29:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   May 12 21:51:16 server10 3ed25e05b45a: org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 10302ms for session id 0x1028ed903dd038f
   May 12 21:51:16 server10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) [org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,634 [main-SendThread(server10.example.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever server10.example.com/10.20.69.29:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   May 12 21:51:16 server10 3ed25e05b45a: org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 12554ms for session id 0x1028ed903dd0390
   May 12 21:51:16 server10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) [org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:16 server10 cb3b837bde84: 2021-05-13T03:51:16,635 [NIOWorkerThread-72] WARN org.apache.zookeeper.server.NIOServerCnxn - Unable to read additional data from client sessionid 0x1028ed903dd038f, likely client has closed socket
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,637 [pulsar-io-28-20] INFO org.apache.pulsar.broker.service.ServerCnx - Closed connection from /10.20.17.40:44725
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,637 [pulsar-io-28-21] INFO org.apache.pulsar.broker.service.ServerCnx - Closed connection from /10.20.17.41:58126
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,639 [pulsar-io-28-19] INFO org.apache.pulsar.broker.service.ServerCnx - Closed connection from /10.20.17.41:45541
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,780 [pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:36462] Received cmd PING
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,780 [pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x1f7c0cfe, L:/10.20.69.29:6650 - R:/10.20.69.28:36462]] Replying back to ping message
   May 12 21:51:16 server10 3ed25e05b45a: 2021-05-13T03:51:16,780 [pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:36462] Received cmd PONG
   May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,036 [pulsar-io-28-19] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0xb055a726, L:/10.20.69.29:6651 - R:/10.20.69.28:36414]] Sending ping message
   May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,551 [main-SendThread(server09.example.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x1028ed903dd0390 has expired
   May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,551 [main-EventThread] ERROR org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client connection to the ZooKeeper server has expired!
   May 12 21:51:17 server10 3ed25e05b45a: 2021-05-13T03:51:17,551 [main-SendThread(server09.example.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever server09.example.com/10.20.69.28:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   May 12 21:51:17 server10 3ed25e05b45a: org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to ZooKeeper service, session 0x1028ed903dd0390 has expired
   May 12 21:51:17 server10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1419) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 server10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 server10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 server10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 server10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) [org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-873367594


   I mapped out more of the ack flow, so I will add it to what I documented [here](https://github.com/apache/pulsar/issues/6054#issuecomment-839495126) and make it more readable. 
   
   The bracket notation below is intended to specify an instance of the class (to distinguish from a static method call.)
   
   When `ProducerImpl` sends the messages, it builds `newSend` command instances, 
   which get picked up by `ServerCnx.handleSend(..)`, 
   which goes to `Producer.publishMessage(..)`, 
   then `PersistentTopic.publishMessage(..)`
   which calls `asyncAddEntry(headersAndPayload, publishContext)`, 
   which calls `[PersistentTopic].ledger.asyncAddEntry(headersAndPayload, (int) publishContext.getNumberOfMessages(), this, publishContext)`
   which creates the `OpAddEntry` and calls `internalAsyncAddEntry(addOperation)` on a different thread, 
   which adds `OpAddEntry` to `[ManagedLedgerImpl].pendingAddEntries`
   
   From somewhere (it's not clear to me exactly where yet) we call `OpAddEntry.safeRun()`, 
   which polls `pendingAddEntries`, 
   gets the callback on `OpAddEntry` (which is the `PersistentTopic` instance) 
   and calls `[PersistentTopic].addComplete(lastEntry, data.asReadOnly(), ctx)`, 
   which calls `publishContext.completed(..)` on `Producer.MessagePublishContext`, 
   which calls `Producer.ServerCnx.execute(this)` on the `MessagePublishContext`, 
   which calls `MessagePublishContext.run()`, 
   which triggers `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)` [SEND_RECEIPT], 
   which writes a `newSendReceiptCommand` to the channel.
   
   From there, the client gets the `SEND_RECEIPT` command from `PulsarDecoder`, 
   which calls `[ClientCnx].handleSendReceipt(..)`, 
   which calls `[ProducerImpl].ackReceived(..)`, 
   which calls `releaseSemaphoreForSendOp(..)` to release the semaphore. 
   The semaphore doesn't block until `maxPendingMessages` is reached, which is 1000 by default. 
   
   
   After a rollover, `pendingAddEntries` is also polled. (Could this access race with the other path that's polling `pendingAddEntries`?) Here's how this happens: 
   
   Something triggers `[ManagedLedgerImpl].createComplete(..)`, 
   which calls `[ManagedLedgerImpl].updateLedgersListAfterRollover(MetaStoreCallback<Void>)`,
   which calls `[MetaStoreImpl].asyncUpdateLedgerIds(..)`,
   which calls `callback.operationComplete(..)` on the `MetaStoreCallback<Void>`,
   which calls `[ManagedLedgerImpl].updateLedgersIdsComplete(..)`,
   which polls for `pendingAddEntries`
   which calls `op.initiate()` on each `OpAddEntry` in `pendingAddEntries`,
   which calls `ledger.asyncAddEntry(..)`,
   which calls `[LedgerHandle].asyncAddEntry(..)`,
   which calls `[LedgerHandle].doAsyncAddEntry(op)`,
     which adds the `op` to `pendingAddOps` on the `LedgerHandle` instance,
   which calls `[PendingAddOp].cb.addCompleteWithLatency(..)`
   which calls the callback on `[PendingAddOp]`, which is `[OpAddEntry]`
   which calls `addCompleteWithLatency` on the `AddCallback` interface,
   which calls `[OpAddEntry].addComplete(..)`
   which, if it succeeded:
        changes its state and recycles it if completed or if there was a failure, it triggers `[OpAddEntry].safeRun()`
       which removes itself from `[ManagedLedgerImpl].pendingAddEntries`
       and triggers `cb.addComplete(..)` on `[PersistentTopic]`,
       which triggers `publishContext.completed(..)` on `[Producer.MessagePublishContext]`
       which triggers `producer.publishOperationCompleted()`
       which decrements `pendingPublishAcks`
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-658404472


   @sijie @codelipenghui We have confirmed that this is actually still an issue in Pulsar 2.5.2. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-832454134

It turns out that the client producer isn't getting ack's from the broker when the subscription has frozen. The ack's just mysteriously stop.

I walked through the code flow all the way from when the producer starts the process of sending a message to when the ack is sent back to the client.

Simplified, that flow looks like this:

Client builds `newSend` command and drops into executor -> `PulsarDecoder.handleSend(..)` -> `ServerCnx.handleSend(..)` -> `Producer.publishMessage(..)` -> `PersistentTopic.publishMessage(..)`, which writes to the ledger and triggers a callback on the `Producer.MessagePublishContext` instance, which triggers `MessagePublishContext.run()` -> `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)`, which writes a new `CommandSendReceipt` to the Netty channel.
Then, that gets picked up by `PulsarDecoder.handleSendReceipt(..)` -> `ClientCnx.handleSendReceipt(..)`, which writes the log message:

> Got receipt for producer . . .

But, that log line is never reached when the subscription has frozen.

So, the question is: Where in that flow did it stop?

My plan is to add a bunch of debug statements to each method in that flow in a custom build to try to pinpoint where the flow is stopping. It very much seems to be another concurrency issue taking place.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-875106224






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-800868920


   I created a test cluster (on fast hardware) specifically for reproducing this issue. In our very simple function flow, using simple Java functions without external dependencies, on Pulsar **2.6.3**, as soon as we started flowing data (around 4k msg/sec at 140KB/msg average), within seconds the bug appeared (as expected), blocking up the flow, and causing a backlog to accumulate.
   
   I looked up the broker the frozen topic was running on and got heap dumps and thread dumps of the broker and functions running on that broker. 
   There was nothing abnormal in the thread dumps that I could find. However, the topic stats and internal stats seem to have some clues. 
   
   I've attached the topic stats and internal topic stats of the topic upstream from the frozen topic, the frozen topic, and the topic immediately downstream from the frozen topic. 
   The flow looks like this:
   
   -> `first-function` -> `first-topic` -> `second-function` -> `second-topic` -> `third-function` -> `third-topic` -> fourth-function -> 
   
   The `second-topic` is the one that froze and started accumulating backlog. 
   
   The `second-topic` reports -45 available permits for its consumer, `third-function`. 
   All three topics report 0 pendingReadOps. 
   `first-topic` and `third-topic` have waitingReadOp = true, indicating the subscriptions are waiting for messages. 
   `second-topic` has waitingReadOp = false, indicating its subscription hasn't caught up or isn't waiting for messages. 
   `second-topic` reports waitingCursorsCount = 0, so it has no cursors waiting for messages. 
   
   `third-topic` has pendingAddEntriesCount = 81, indicating it's waiting for write requests to complete. 
   `first-topic` and `second-topic` have pendingAddEntriesCount = 0
   
   `third-topic` is in the state ClosingLedger. 
   `first-topic` and `second-topic` are in the state LedgerOpened
   
   `second-topic`'s cursor has markDeletePosition = 17525:0 and `readPosition` = 17594:9
        `third-topic`'s cursor has markDeletePosition = 17551:9 and `readPosition` = 17551:10
   
   So, the `third-topic`'s cursor's `readPosition` is adjacent to its markDeletePosition. 
   However, `second-topic`'s cursor's `readPosition` is farther ahead than `third-topic`'s `readPosition`. 
   
   Is that unusual for a downstream topic's cursor to have a `readPosition` farther ahead (larger number) than the `readPosition` of the topic immediately upstream from it when the downstream topic's only source of messages is that upstream topic and not more than a few hundred thousand messages have been sent through the pipe?
   
   Github won't let me attach .json files, so I had to make them txt files. In the attached zip, they have the .json extension for convenience when viewing.
   [first-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154489/first-topic-internal-stats.json.txt)
   [first-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154490/first-topic-stats.json.txt)
   [second-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154491/second-topic-internal-stats.json.txt)
   [second-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154492/second-topic-stats.json.txt)
   [third-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154493/third-topic-internal-stats.json.txt)
   [third-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154494/third-topic-stats.json.txt)
   
   [stats.zip](https://github.com/apache/pulsar/files/6154471/stats.zip)
   
   I've also attached the broker thread dump: [thread_dump_3-16.txt](https://github.com/apache/pulsar/files/6154506/thread_dump_3-16.txt)
   
   Regarding the heap dump, I can't attach that until I'm able to reproduce this bug with synthetic data, but in the meantime, if anyone wants me to look up specific things in the heap dump, I'll be happy to do that. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-1012085314


   > Which version of apache pulsar has this fix ?
   
   @skyrocknroll @marcioapm All required fixes are included in Apache Pulsar 2.7.4 and 2.8.2 versions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] sijie commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

sijie commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-658537755


   @devinbost did you happen to have a heap dump?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-800868920

I created a test cluster (on fast hardware) specifically for reproducing this issue. In our very simple function flow, using simple Java functions without external dependencies, on Pulsar **2.6.3**, as soon as we started flowing data (around 4k msg/sec at 140KB/msg average), within seconds the bug appeared (as expected), blocking up the flow, and causing a backlog to accumulate.

I looked up the broker the frozen topic was running on and got heap dumps and thread dumps of the broker and functions running on that broker.
There was nothing abnormal in the thread dumps that I could find. However, the topic stats and internal stats seem to have some clues.

I've attached the topic stats and internal topic stats of the topic upstream from the frozen topic, the frozen topic, and the topic immediately downstream from the frozen topic.
The flow looks like this:

-> `first-function` -> `first-topic` -> `second-function` -> `second-topic` -> `third-function` -> `third-topic` -> fourth-function ->

The `second-topic` is the one that froze and started accumulating backlog.

The `second-topic` reports -45 available permits for its consumer, `third-function`.
All three topics report 0 pendingReadOps.
`first-topic` and `third-topic` have waitingReadOp = true, indicating the subscriptions are waiting for messages.
`second-topic` has waitingReadOp = false, indicating its subscription hasn't caught up or isn't waiting for messages.
`second-topic` reports waitingCursorsCount = 0, so it has no cursors waiting for messages.

`third-topic` has pendingAddEntriesCount = 81, indicating it's waiting for write requests to complete.
`first-topic` and `second-topic` have pendingAddEntriesCount = 0

`third-topic` is in the state ClosingLedger.
`first-topic` and `second-topic` are in the state LedgerOpened

`second-topic`'s cursor has markDeletePosition = 17525:0 and readPosition = 17594:9
`third-topic`'s cursor has markDeletePosition = 17551:9 and readPosition = 17551:10

So, the `third-topic`'s cursor's readPosition is adjacent to its markDeletePosition.
However, `second-topic`'s cursor's readPosition is farther ahead than `third-topic`'s readPosition.

Is that unusual for a downstream topic's cursor to have a readPosition farther ahead (larger number) than the topic immediately upstream from it when the downstream topic's only source of messages is that upstream topic and not more than a few hundred thousand messages have been sent through the pipe?

It's silly... Github won't let me attach .json files, so I had to make them txt files. In the attached zip, they have the .json extension for convenience.
[first-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154489/first-topic-internal-stats.json.txt)
[first-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154490/first-topic-stats.json.txt)
[second-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154491/second-topic-internal-stats.json.txt)
[second-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154492/second-topic-stats.json.txt)
[third-topic-internal-stats.json.txt](https://github.com/apache/pulsar/files/6154493/third-topic-internal-stats.json.txt)
[third-topic-stats.json.txt](https://github.com/apache/pulsar/files/6154494/third-topic-stats.json.txt)

[stats.zip](https://github.com/apache/pulsar/files/6154471/stats.zip)

I've also attached the broker thread dump: [thread_dump_3-16.txt](https://github.com/apache/pulsar/files/6154506/thread_dump_3-16.txt)

Regarding the heap dump, I can't attach that until I'm able to reproduce this bug with synthetic data, but in the meantime, if anyone wants me to look up specific things in the heap dump, I'll be happy to do that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] hpvd commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

hpvd commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-765712086


   another chance to have this solved in upcoming v2.8 without having to find the dedicated reason, may result from all the work going on using spotbugs all through pulsar's code (including function) see https://github.com/apache/pulsar/issues?q=enable+spotbugs+
   Sometimes there is luck :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] sbourkeostk commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

sbourkeostk commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-803295399


   > After taking a closer look at #7266, it's not clear if it will resolve the issue because we're seeing this issue with functions running with exclusive subscriptions, not shared subscriptions.
   
   You sure? I didn't know it was possible to have have a function on an exclusive subscription.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost removed a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost removed a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-843692903


   To throw another curveball, I reproduced the freeze, and this time the backlog is _completely_ frozen. It's not moving at all. I took a set of heap dumps and thread dumps spaced out several seconds apart from each other, and I captured all processes on the broker involved. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839401880


   This issue needs to be reopened since that PR was only a partial fix and doesn't address all the failure scenarios. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839495126


   I traced the call chain for where ack's should be getting sent, and I filled in some additional steps:
   
   When `ProducerImpl` sends the messages, it builds `newSend` command instances, which get picked up by `ServerCnx.handleSend(..)`, which goes to `Producer.publishMessage(..)`, then `PersistentTopic.publishMessage(..)`
   which calls `asyncAddEntry(headersAndPayload, publishContext)`, which calls `[PersistentTopic].ledger.asyncAddEntry(headersAndPayload, (int) publishContext.getNumberOfMessages(), this, publishContext)`
   which creates the `OpAddEntry` and calls `internalAsyncAddEntry(addOperation)` on a different thread, which adds `OpAddEntry` to `[ManagedLedgerImpl].pendingAddEntries`
   
   From somewhere (it's not clear to me exactly where yet) we call `OpAddEntry.safeRun()`, which polls `pendingAddEntries`, gets the callback on `OpAddEntry` (which is the `PersistentTopic` instance) and calls `[PersistentTopic].addComplete(lastEntry, data.asReadOnly(), ctx)`, which calls `publishContext.completed(..)` on `Producer.MessagePublishContext`, which calls `Producer.ServerCnx.execute(this)` on the `MessagePublishContext`, which calls `MessagePublishContext.run()`, which triggers `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)` [SEND_RECEIPT], which writes a `newSendReceiptCommand` to the channel.
   From there, the client gets the `SEND_RECEIPT` command from `PulsarDecoder`, which calls `[ClientCnx].handleSendReceipt(..)`, which calls `[ProducerImpl].ackReceived(..)`, which calls `releaseSemaphoreForSendOp(..)` to release the semaphore. 
   The semaphore doesn't lock until `maxPendingMessages` is reached, which is 1000 by default. 
   
   So, I added a lot of logging, and to my great surprise, I'm getting an NPE after adding these debug lines to `OpAddEntry.createOpAddEntry(..)`:
   
   ```
       private static OpAddEntry createOpAddEntry(ManagedLedgerImpl ml, ByteBuf data, AddEntryCallback callback, Object ctx) {
           log.debug("Running OpAddEntry.createOpAddEntry(..)");
           OpAddEntry op = RECYCLER.get();
           log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           op.ml = ml;
           op.ledger = null;
           op.data = data.retain();
           op.dataLength = data.readableBytes();
           op.callback = callback;
           op.ctx = ctx;
           op.addOpCount = ManagedLedgerImpl.ADD_OP_COUNT_UPDATER.incrementAndGet(ml);
           op.closeWhenDone = false;
           op.entryId = -1;
           op.startTime = System.nanoTime();
           op.state = State.OPEN;
           ml.mbean.addAddEntrySample(op.dataLength);
           log.debug("2. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           return op;
       }
   ```
   
   Here's the stack trace:
   
   ```
      2021-05-12T06:14:19,569 [pulsar-io-28-32] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Running OpAddEntry.createOpAddEntry(..)
       2021-05-12T06:14:19,569 [pulsar-io-28-32] WARN  org.apache.pulsar.broker.service.ServerCnx - [/10.20.20.160:37696] Got exception java.lang.NullPointerException
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.toString(OpAddEntry.java:354)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.createOpAddEntry(OpAddEntry.java:100)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.create(OpAddEntry.java:81)
       	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncAddEntry(ManagedLedgerImpl.java:689)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.asyncAddEntry(PersistentTopic.java:428)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.publishMessage(PersistentTopic.java:404)
       	at org.apache.pulsar.broker.service.Producer.publishMessageToTopic(Producer.java:224)
       	at org.apache.pulsar.broker.service.Producer.publishMessage(Producer.java:161)
       	at org.apache.pulsar.broker.service.ServerCnx.handleSend(ServerCnx.java:1372)
       	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:207)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:200)
       	at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:162)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1376)
       	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1265)
       	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1302)
       	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
       	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
       	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
       	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
       	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
       	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
       	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       	at java.lang.Thread.run(Thread.java:748)
   ```
   
   The only place we could be throwing an NPE is here:
   `log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);`
   
   So, `op` isn't null until the null check passes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839495126


   I traced the call chain for where ack's should be getting sent, and I filled in some additional steps:
   
   When `ProducerImpl` sends the messages, it builds `newSend` command instances, which get picked up by `ServerCnx.handleSend(..)`, which goes to `Producer.publishMessage(..)`, then `PersistentTopic.publishMessage(..)`
   which calls `asyncAddEntry(headersAndPayload, publishContext)`, which calls `[PersistentTopic].ledger.asyncAddEntry(headersAndPayload, (int) publishContext.getNumberOfMessages(), this, publishContext)`
   which creates the `OpAddEntry` and calls `internalAsyncAddEntry(addOperation)` on a different thread, which adds `OpAddEntry` to `[ManagedLedgerImpl].pendingAddEntries`
   
   From somewhere (it's not clear to me exactly where yet) we call `OpAddEntry.safeRun()`, which polls `pendingAddEntries`, gets the callback on `OpAddEntry` (which is the `PersistentTopic` instance) and calls `[PersistentTopic].addComplete(lastEntry, data.asReadOnly(), ctx)`, which calls `publishContext.completed(..)` on `Producer.MessagePublishContext`, which calls `Producer.ServerCnx.execute(this)` on the `MessagePublishContext`, which calls `MessagePublishContext.run()`, which triggers `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)` [SEND_RECEIPT], which writes a `newSendReceiptCommand` to the channel.
   From there, the client gets the `SEND_RECEIPT` command from `PulsarDecoder`, which calls `[ClientCnx].handleSendReceipt(..)`, which calls `[ProducerImpl].ackReceived(..)`, which calls `releaseSemaphoreForSendOp(..)` to release the semaphore. 
   
   So, I added a lot of logging, and to my great surprise, I'm getting an NPE after adding these debug lines to `OpAddEntry.createOpAddEntry(..)`:
   
   ```
       private static OpAddEntry createOpAddEntry(ManagedLedgerImpl ml, ByteBuf data, AddEntryCallback callback, Object ctx) {
           log.debug("Running OpAddEntry.createOpAddEntry(..)");
           OpAddEntry op = RECYCLER.get();
           log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           op.ml = ml;
           op.ledger = null;
           op.data = data.retain();
           op.dataLength = data.readableBytes();
           op.callback = callback;
           op.ctx = ctx;
           op.addOpCount = ManagedLedgerImpl.ADD_OP_COUNT_UPDATER.incrementAndGet(ml);
           op.closeWhenDone = false;
           op.entryId = -1;
           op.startTime = System.nanoTime();
           op.state = State.OPEN;
           ml.mbean.addAddEntrySample(op.dataLength);
           log.debug("2. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           return op;
       }
   ```
   
   Here's the stack trace:
   
   ```
      2021-05-12T06:14:19,569 [pulsar-io-28-32] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Running OpAddEntry.createOpAddEntry(..)
       2021-05-12T06:14:19,569 [pulsar-io-28-32] WARN  org.apache.pulsar.broker.service.ServerCnx - [/10.20.20.160:37696] Got exception java.lang.NullPointerException
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.toString(OpAddEntry.java:354)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.createOpAddEntry(OpAddEntry.java:100)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.create(OpAddEntry.java:81)
       	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncAddEntry(ManagedLedgerImpl.java:689)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.asyncAddEntry(PersistentTopic.java:428)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.publishMessage(PersistentTopic.java:404)
       	at org.apache.pulsar.broker.service.Producer.publishMessageToTopic(Producer.java:224)
       	at org.apache.pulsar.broker.service.Producer.publishMessage(Producer.java:161)
       	at org.apache.pulsar.broker.service.ServerCnx.handleSend(ServerCnx.java:1372)
       	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:207)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:200)
       	at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:162)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1376)
       	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1265)
       	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1302)
       	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
       	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
       	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
       	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
       	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
       	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
       	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       	at java.lang.Thread.run(Thread.java:748)
   ```
   
   The only place we could be throwing an NPE is here:
   `log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);`
   
   So, `op` isn't null until the null check passes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-844625376


   To throw a curveball, I was able to reproduce this issue where the subscription is completely frozen while using the function from the code shared above. 
   ![image](https://user-images.githubusercontent.com/7418031/118907664-2c924a00-b8dd-11eb-9ac5-041b8f0856c2.png)
   Interestingly, the consumer has negative permits on all four partitions, but the function isn't doing anything. 
   So, that leads me to believe (again) that pendingAcks have maxed out the semaphore, causing the function to block while it waits for ack's from the broker. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] hpvd commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

hpvd commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-765702818


   Since it's hard to reproduce, does not happen often and is still there for many versions...
   just to give it a try: does this happen on different servers/hardware (not the same part involved in every appearance of the problem)? On other places especially freezing also occur from e.g. memory bit flips...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842595662


   > @lhotari I've been creating fresh builds on master (or branches of master with extra logging) every couple of days to test changes on this issue.
   > 
   > I was pulling the thread dumps out of my heap dumps using VisualVM. I can use jstack instead if that will give a better picture.
   
   ok, I see. That method is fine, I just thought that it's a thread dump of a process which has a debugger attached to it (that's not so good since debugging can change the execution). 
   It will be useful to have a 3 thread dumps, a few seconds apart. That will help determine if there are threads that are blocked.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-830477787


   Update on findings:
   
   We discovered that this issue is multi-faceted. 
   
   1. The first issue was that the subscription was getting stuck where there were permits > 0, pendingReadOps = 0, and the cursor's read-position was not advancing despite an existing backlog. That issue was mitigated by https://github.com/apache/pulsar/pull/9789 though the root cause of why the cursor wasn't advancing is still unknown. It's possible that some code branch that should be calling `PersistentDispatcherMultipleConsumers.readMoreEntries()` isn't doing so. We also discovered that `readMoreEntries()` wasn't synchronized, which could be causing concurrency issues from preventing the reads or causing permit issues. Any such concurrency issues should be resolved now by https://github.com/apache/pulsar/pull/10413 . After those changes, I wasn't able to reproduce the frozen subscription except when consuming from a partitioned topic using a Pulsar function with several parallel function instances. 
   
   2. The second issue is that when consuming from a partitioned, persistent topic using a Pulsar function (on a shared subscription) with parallelism > 1, messages were not being dispatched properly. There were two parts to that issue:
   
   2.1. The first part of the issue was preventing messages from dispatching to functions that have permits > 0. What was happening is that the dispatcher was failing to dispatch to consumers that were reading more than the other consumers (e.g. where one consumer had permits > 0 and the others had permits <= 0.) So, some consumers were just waiting around and not processing messages until the other consumers caught up. That issue was fixed by https://github.com/apache/pulsar/pull/10417
   
   2.2. The second part of the issue is that due to current message batching behavior, too many messages were being dispatched, which resulted in negative permits on the dispatcher. That issue will be fixed by #7266.
   
   3. After applying #10417, #10413, and #9789 in a custom build of Pulsar, we resolved the dispatching issues and discovered a new problem: When consuming from a partitioned, persistent topic using a Pulsar function (on a shared subscription) with parallelism > 1, we discovered that functions with permits <= 0 are sitting around doing nothing despite having received messages from the broker into the `incomingMessages` queue on `ConsumerBase`.  
   A thread-dump revealed that it appears the functions are stuck waiting for the release of a semaphore during `ProducerImpl.sendAsync()`:
   
   ```
   "myTenant/myNamespace/function-filter-0" prio=5 tid=32 WAITING
       at sun.misc.Unsafe.park(Native Method)
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
          local variable: java.util.concurrent.locks.AbstractQueuedSynchronizer$Node#40
          local variable: java.util.concurrent.locks.AbstractQueuedSynchronizer$Node#41
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
          local variable: java.util.concurrent.Semaphore$FairSync#1
       at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
       at org.apache.pulsar.client.impl.ProducerImpl.canEnqueueRequest(ProducerImpl.java:748)
       at org.apache.pulsar.client.impl.ProducerImpl.sendAsync(ProducerImpl.java:391)
          local variable: org.apache.pulsar.client.impl.ProducerImpl#1
          local variable: org.apache.pulsar.client.impl.ProducerImpl$1#1
          local variable: io.netty.buffer.UnpooledHeapByteBuf#17
          local variable: org.apache.pulsar.common.api.proto.MessageMetadata#83
          local variable: org.apache.pulsar.client.impl.MessageImpl#82
       at org.apache.pulsar.client.impl.ProducerImpl.internalSendAsync(ProducerImpl.java:290)
          local variable: java.util.concurrent.CompletableFuture#33
       at org.apache.pulsar.client.impl.TypedMessageBuilderImpl.sendAsync(TypedMessageBuilderImpl.java:103)
       at org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtLeastOnceProcessor.sendOutputMessage(PulsarSink.java:274)
          local variable: org.apache.pulsar.functions.sink.PulsarSink$PulsarSinkAtLeastOnceProcessor#1
          local variable: org.apache.pulsar.functions.instance.SinkRecord#2
       at org.apache.pulsar.functions.sink.PulsarSink.write(PulsarSink.java:393)
       at org.apache.pulsar.functions.instance.JavaInstanceRunnable.sendOutputMessage(JavaInstanceRunnable.java:349)
       at org.apache.pulsar.functions.instance.JavaInstanceRunnable.lambda$processResult$0(JavaInstanceRunnable.java:331)
       at org.apache.pulsar.functions.instance.JavaInstanceRunnable$$Lambda$193.accept(<unknown string>)
       at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
          local variable: org.apache.pulsar.functions.instance.JavaExecutionResult#1
       at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
          local variable: org.apache.pulsar.functions.instance.JavaInstanceRunnable$$Lambda$193#1
          local variable: java.util.concurrent.CompletableFuture#35
       at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
       at org.apache.pulsar.functions.instance.JavaInstanceRunnable.processResult(JavaInstanceRunnable.java:322)
       at org.apache.pulsar.functions.instance.JavaInstanceRunnable.run(JavaInstanceRunnable.java:275)
          local variable: org.apache.pulsar.functions.instance.JavaInstanceRunnable#1
          local variable: java.util.concurrent.CompletableFuture#34
       at java.lang.Thread.run(Thread.java:748)
   ```
   After disabling batching on functions in a custom build, the issue disappeared. However, performance is severely degraded when batching is disabled, so it's not a viable workaround. 
   The current behavior expects the semaphore to block the thread since `blockIfQueueFullDisabled = true` by default in functions. So, that implies the producing queue is full. 
   
   So, the remaining mysteries are:
   Why is the producer-queue full? (Why is the function not getting ack from the broker?)
   Or, if the function is getting ack from the broker, why is the pulsar-client not batching and sending? 
   
   However, we need to confirm if the thread responsible for consuming is indeed blocked by producing. 
   (@codelipenghui or @jerrypeng , do either of you know? )
   
   Many thanks to @rdhabalia for helping to deep dive on this issue. 
   Also thanks to @lhotari for helping with the concurrency side of things. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] jerrypeng commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

jerrypeng commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-801529366


   > Do they batch by default?
   
   Yes, batching at the pulsar client level is enabled when a Pulsar function is writing messages to the "output" topic.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842595923


   > Are there JVM parameters that I should or shouldn't be including when starting the brokers?
   
   I guess the defaults are fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-848968627


   Many thanks to @lhotari for figuring out how to reproduce this bug. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-873367594


   I mapped out more of the ack flow, so I will add it to what I documented [here](https://github.com/apache/pulsar/issues/6054#issuecomment-839495126) and make it more readable. 
   
   The bracket notation below is intended to specify an instance of the class (to distinguish from a static method call.)
   
   When `ProducerImpl` sends the messages, it builds `newSend` command instances, 
   which get picked up by `ServerCnx.handleSend(..)`, 
   which goes to `Producer.publishMessage(..)`, 
   then `PersistentTopic.publishMessage(..)`
   which calls `asyncAddEntry(headersAndPayload, publishContext)`, 
   which calls `[PersistentTopic].ledger.asyncAddEntry(headersAndPayload, (int) publishContext.getNumberOfMessages(), this, publishContext)`
   which creates the `OpAddEntry` and calls `internalAsyncAddEntry(addOperation)` on a different thread, 
   which adds `OpAddEntry` to `[ManagedLedgerImpl].pendingAddEntries`
   
   From somewhere (it's not clear to me exactly where yet) we call `OpAddEntry.safeRun()`, 
   which polls `pendingAddEntries`, 
   gets the callback on `OpAddEntry` (which is the `PersistentTopic` instance) 
   and calls `[PersistentTopic].addComplete(lastEntry, data.asReadOnly(), ctx)`, 
   which calls `publishContext.completed(..)` on `Producer.MessagePublishContext`, 
   which calls `Producer.ServerCnx.execute(this)` on the `MessagePublishContext`, 
   which calls `MessagePublishContext.run()`, 
   which triggers `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)` [SEND_RECEIPT], 
   which writes a `newSendReceiptCommand` to the channel.
   
   From there, the client gets the `SEND_RECEIPT` command from `PulsarDecoder`, 
   which calls `[ClientCnx].handleSendReceipt(..)`, 
   which calls `[ProducerImpl].ackReceived(..)`, 
   which calls `releaseSemaphoreForSendOp(..)` to release the semaphore. 
   The semaphore doesn't block until `maxPendingMessages` is reached, which is 1000 by default. 
   
   
   After a rollover, `pendingAddEntries` is also polled. (Could this access race with the other path that's polling `pendingAddEntries`?) Here's how this happens: 
   
   Something triggers `[ManagedLedgerImpl].createComplete(..)`, 
   which calls `[ManagedLedgerImpl].updateLedgersListAfterRollover(MetaStoreCallback<Void>)`,
   which calls `[MetaStoreImpl].asyncUpdateLedgerIds(..)`,
   which calls `callback.operationComplete(..)` on the `MetaStoreCallback<Void>`,
   which calls `[ManagedLedgerImpl].updateLedgersIdsComplete(..)`,
   which polls for `pendingAddEntries`
   which calls `op.initiate()` on each `OpAddEntry` in `pendingAddEntries`,
   which calls `ledger.asyncAddEntry(..)`,
   which calls `[LedgerHandle].asyncAddEntry(..)`,
   which calls `[LedgerHandle].doAsyncAddEntry(op)`,
     which adds the `op` to `pendingAddOps` on the `LedgerHandle` instance,
   which calls `[PendingAddOp].cb.addCompleteWithLatency(..)`
   which calls the callback on `[PendingAddOp]`, which is `[OpAddEntry]`
   which calls `addCompleteWithLatency` on the `AddCallback` interface,
   which calls `[OpAddEntry].addComplete(..)`
   which, if it succeeded:
        changes its state and recycles it if completed or if there was a failure, it triggers `[OpAddEntry].safeRun()`
       which removes itself from `[ManagedLedgerImpl].pendingAddEntries`
       and triggers `cb.addComplete(..)` on `[PersistentTopic]`,
       which triggers `publishContext.completed(..)` on `[Producer.MessagePublishContext]`
       which triggers `producer.publishOperationCompleted()`
       which decrements `pendingPublishAcks`
   
   
   It's still not clear where the ack's are getting lost, so there must be another part of the flow that I'm missing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-848968627


   Many thanks to @lhotari for figuring out how to reproduce this bug outside of my environment. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-818463224


   The permits seem to jump to negative numbers in the broker debug logs with very little explanation.
   
   If I filter out ZK logs, after this permit-related line:
   
   > Apr 12 18:55:52 fab08 8f9131691776: 2021-04-13T00:55:52,442 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant/myNamespace/topic-1-PersistentSubscription{topic=persistent://myTenant/myNamespace/topic-1, name=myTenant/myNamespace/function-2}] Added more flow control message permits 500 (old was: 419), blocked = false
   
   I see some ManagedLedgerImpl logs, including this one:
   
   > Apr 12 18:55:52 fab08 8f9131691776: 2021-04-13T00:55:52,442 [bookkeeper-ml-workers-OrderedExecutor-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/topic-1] Resending 100 pending messages
   
   Then, there are more ManagedLedgerImpl and some OpReadEntry logs.
   Then, there are EntryCacheImpl and ManagedLedgerImpl logs.
   Then, I get this:
   
   > Apr 12 18:56:22 fab08 8f9131691776: 2021-04-13T00:56:22,301 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant/myNamespace/topic-1-PersistentSubscription{topic=persistent://myTenant/myNamespace/topic-1, name=myTenant/myNamespace/function-2}] Added more flow control message permits 500 (old was: -14), blocked = false
   
   The next time I get a `Distributing x messages to n consumers` involving that topic, it says:
   `Distributing 49 messages to 1 consumers`, which is immediately followed by:
   `No consumers found with available permits, storing 29 positions for later replay`
   
   So, it's like the permits are decrementing in some way that's not reflected in the logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-818486009






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] marcioapm commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

marcioapm commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-1011236519


   > This bug has been resolved in DataStax Luna Streaming 2.7.2_1.1.21
   
   Hi @devinbost is this is 2.9.1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840894974


   I noticed that in almost every case where the broker is getting a SEND command, it's producing a SEND_RECEIPT command:
   ```
   $ cat messages_broker10.log | grep 'Received cmd SEND' | wc -l
      20806
   $ cat messages_broker10.log | grep 'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      20806
   $ cat messages_broker8.log | grep 'Received cmd SEND' | wc -l
      44654
   $ cat messages_broker8.log | grep 'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      44651
   ```
   I assumed I'd see a gap between them, but unless something is creating some kind of logical deadlock that's causing it to stop emitting logs in between those commands, it seems like the problem is either before the SEND command or after the SEND_RECEIPT command. 
   
   With that said, I noticed that the `lastConfirmedEntry` isn't progressing on one of the threads after the freeze occurred:
   
   ```
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10636 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10672 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10696 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10747 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10775 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10783 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10815 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10840 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10886 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10915 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10916 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10917 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10918 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10919 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10920 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10944 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10945 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10948 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10952 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10954 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10957 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10959 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10963 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10974 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10977 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10982 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:10984 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11029 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11036 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11039 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11051 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11053 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11065 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11072 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11077 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11086 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11179 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11190 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11199 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11239 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11252 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11258 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11264 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11358 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11359 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11385 -- last: 3215336:11436
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - IsValid position: 3215336:11437 -- last: 3215336:11436
   ```
   
   When I say "freeze," that may be misleading because it's not a complete lockup. We still get messages to produce, but it's so slow that it's almost equivalent to being frozen. As a comparison, here's one of the functions when it's healthy:
   ![image](https://user-images.githubusercontent.com/7418031/118199967-6365ed00-b411-11eb-871c-7b7f8f186404.png)
   
   Here's that same function when it's "frozen":
   ![image](https://user-images.githubusercontent.com/7418031/118200011-78428080-b411-11eb-80ee-c802471b3c4c.png)
   
   So, over 1 minute, it's processing 71 messages instead of 160,851 messages.
   
   I also noticed this warning in the logs after it had been stuck for a while, but I'm not sure if it's related:
   
   ```
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,634 [main-SendThread(fab10.umf.prod.ostk.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd038f for sever fab10.umf.prod.ostk.com/10.20.69.29:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   May 12 21:51:16 fab10 3ed25e05b45a: org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 10302ms for session id 0x1028ed903dd038f
   May 12 21:51:16 fab10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) [org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,634 [main-SendThread(fab10.umf.prod.ostk.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever fab10.umf.prod.ostk.com/10.20.69.29:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   May 12 21:51:16 fab10 3ed25e05b45a: org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 12554ms for session id 0x1028ed903dd0390
   May 12 21:51:16 fab10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) [org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:16 fab10 cb3b837bde84: 2021-05-13T03:51:16,635 [NIOWorkerThread-72] WARN org.apache.zookeeper.server.NIOServerCnxn - Unable to read additional data from client sessionid 0x1028ed903dd038f, likely client has closed socket
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,637 [pulsar-io-28-20] INFO org.apache.pulsar.broker.service.ServerCnx - Closed connection from /10.20.17.40:44725
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,637 [pulsar-io-28-21] INFO org.apache.pulsar.broker.service.ServerCnx - Closed connection from /10.20.17.41:58126
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,639 [pulsar-io-28-19] INFO org.apache.pulsar.broker.service.ServerCnx - Closed connection from /10.20.17.41:45541
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,780 [pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:36462] Received cmd PING
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,780 [pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x1f7c0cfe, L:/10.20.69.29:6650 - R:/10.20.69.28:36462]] Replying back to ping message
   May 12 21:51:16 fab10 3ed25e05b45a: 2021-05-13T03:51:16,780 [pulsar-io-28-15] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:36462] Received cmd PONG
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,036 [pulsar-io-28-19] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0xb055a726, L:/10.20.69.29:6651 - R:/10.20.69.28:36414]] Sending ping message
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,551 [main-SendThread(fab09.umf.prod.ostk.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x1028ed903dd0390 has expired
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,551 [main-EventThread] ERROR org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client connection to the ZooKeeper server has expired!
   May 12 21:51:17 fab10 3ed25e05b45a: 2021-05-13T03:51:17,551 [main-SendThread(fab09.umf.prod.ostk.com:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x1028ed903dd0390 for sever fab09.umf.prod.ostk.com/10.20.69.28:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
   May 12 21:51:17 fab10 3ed25e05b45a: org.apache.zookeeper.ClientCnxn$SessionExpiredException: Unable to reconnect to ZooKeeper service, session 0x1028ed903dd0390 has expired
   May 12 21:51:17 fab10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1419) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxnSocket.readConnectResult(ClientCnxnSocket.java:154) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:86) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 12 21:51:17 fab10 3ed25e05b45a: #011at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) [org.apache.pulsar-pulsar-zookeeper-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-818400418


   I extracted this permit information from one of the brokers after enabling debug mode:
   
   >     Apr 12 18:55:52 fab08 8f9131691776: 2021-04-13T00:55:52,441 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/10.20.69.29:55956] Received flow from consumer 0 permits: 500
   >     Apr 12 18:55:52 fab08 8f9131691776: 2021-04-13T00:55:52,441 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant/myNamespace/topic-2 / myTenant/myNamespace/function-2-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant/myNamespace/topic-2, name=myTenant/myNamespace/function-2}, consumerId=0, consumerName=c44ef, address=/10.20.69.29:55956}] Trigger new read after receiving flow control message with permits 919
   >     Apr 12 18:55:52 fab08 8f9131691776: 2021-04-13T00:55:52,442 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant/myNamespace/topic-2-PersistentSubscription{topic=persistent://myTenant/myNamespace/topic-2, name=myTenant/myNamespace/function-2}] Added more flow control message permits 500 (old was: 419), blocked = false
   >     Apr 12 18:56:22 fab08 8f9131691776: 2021-04-13T00:56:22,301 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant/myNamespace/topic-2-PersistentSubscription{topic=persistent://myTenant/myNamespace/topic-2, name=myTenant/myNamespace/function-2}] Added more flow control message permits 500 (old was: -14), blocked = false
   >     Apr 12 18:56:52 fab08 8f9131691776: 2021-04-13T00:56:52,502 [bookkeeper-ml-workers-OrderedExecutor-1-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://feeds/myTenant/topic-1-partition-1 / myTenant/myNamespace/function-1] No consumers found with available permits, storing 29 positions for later replay
   >     Apr 12 19:00:52 fab08 8f9131691776: 2021-04-13T01:00:52,488 [bookkeeper-ml-workers-OrderedExecutor-7-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant/myNamespace/topic-2 / myTenant/myNamespace/function-2] No consumers found with available permits, storing 67 positions for later replay


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] kemburi commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

kemburi commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-742459754


   Guys any one, or can any provide info on how to reproduce this issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-841851244

> I also noticed this warning in the logs after it had been stuck for a while, but I'm not sure if it's related:

@devinbost Very interesting that there are some Zookeeper related issues happening at the same time (`org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 10302ms for session id 0x1028ed903dd038f`) .

It would be very useful to get a thread dump when this is going on. My assumption is that this type of problems aren't really a Zookeeper server issue, but Zookeeper client thread getting blocked which causes timeouts. There's one sign of this type of problem in issue #10418 .

For creating the thread dumps, please use `jstack`/`jcmd`/`kill -3`. The earlier thread dumps looked like they were created with the debugger. It's better to reproduce the issue and create thread dumps without the debugger attached since the JVM can behave differently in debugging mode. How were the previous thread dumps created?

What version of Pulsar are you running? (can you reference a Git SHA for the version if it's based on master branch?)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-864600403


   What are your broker.conf settings? And, how many messages are you
   processing? We've noticed that some settings increase the frequency of
   this, but I'm still working on investigating the root cause.
   
   Something is causing the broker to not ack messages, but we're not sure
   about the exact cause yet.
   --
   Devin G. Bost
   
   On Sun, Jun 20, 2021, 4:24 AM Enrico Olivelli ***@***.***>
   wrote:
   
   > @JohnMops <https://github.com/JohnMops> did you update to Pulsar 2.8.0?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/pulsar/issues/6054#issuecomment-864531877>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ABYTBLYUKN5YVYDBOUZKSFTTTW6WPANCNFSM4KGXAXLQ>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-827768089


   I was able to reproduce this issue on a cluster with a single broker and a partitioned topic (with 4 partitions) sending messages to a single filter function that was also running on that broker. I was also able to reproduce it using a non-partitioned topic after allowing backlog to accumulate (by stopping the function) and then resuming the function after several thousand messages accumulated on the backlog. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840535875


   > In the unhealthy case, we don't see that at all. I'm not even seeing activity on that thread. The only activity in the logs that I'm seeing on that broker is this:
   
   @devinbost interesting detail. would it be possible to share a thread dump of the broker in this state?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842626569


   I've added a lot of logging in this build: https://github.com/devinbost/pulsar/tree/debug_acks
   
   My log4j2.yaml file looks like this:
   
   ```
   Configuration:
     status: DEBUG
     monitorInterval: 30
     name: pulsar
     packages: io.prometheus.client.log4j2
   
     Properties:
       Property:
         - name: "pulsar.log.dir"
           value: "logs"
         - name: "pulsar.log.file"
           value: "pulsar.log"
         - name: "pulsar.log.appender"
           value: "RoutingAppender"
         - name: "pulsar.log.level"
           value: "trace"
         - name: "pulsar.routing.appender.default"
           value: "Console"
   
     # Example: logger-filter script
     Scripts:
       ScriptFile:
         name: filter.js
         language: JavaScript
         path: ./conf/log4j2-scripts/filter.js
         charset: UTF-8
   
     Appenders:
   
       # Console
       Console:
         name: Console
         target: SYSTEM_OUT
         PatternLayout:
           Pattern: "%d{ISO8601} [%t] %-5level %logger{36} - %msg%n"
   
       # Rolling file appender configuration
       RollingFile:
         name: RollingFile
         fileName: "${sys:pulsar.log.dir}/${sys:pulsar.log.file}"
         filePattern: "${sys:pulsar.log.dir}/${sys:pulsar.log.file}-%d{MM-dd-yyyy}-%i.log.gz"
         immediateFlush: false
         PatternLayout:
           Pattern: "%d{ISO8601} [%t] %-5level %logger{36} - %msg%n"
         Policies:
           TimeBasedTriggeringPolicy:
             interval: 1
             modulate: true
           SizeBasedTriggeringPolicy:
             size: 1 GB
         # Delete file older than 30days
         DefaultRolloverStrategy:
             Delete:
               basePath: ${sys:pulsar.log.dir}
               maxDepth: 2
               IfFileName:
                 glob: "*/${sys:pulsar.log.file}*log.gz"
               IfLastModified:
                 age: 30d
   
       Prometheus:
         name: Prometheus
   
       # Routing
       Routing:
         name: RoutingAppender
         Routes:
           pattern: "$${ctx:function}"
           Route:
             -
               Routing:
                 name: InstanceRoutingAppender
                 Routes:
                   pattern: "$${ctx:instance}"
                   Route:
                     -
                       RollingFile:
                         name: "Rolling-${ctx:function}"
                         fileName : "${sys:pulsar.log.dir}/functions/${ctx:function}/${ctx:functionname}-${ctx:instance}.log"
                         filePattern : "${sys:pulsar.log.dir}/functions/${sys:pulsar.log.file}-${ctx:instance}-%d{MM-dd-yyyy}-%i.log.gz"
                         PatternLayout:
                           Pattern: "%d{ABSOLUTE} %level{length=5} [%thread] [instance: %X{instance}] %logger{1} - %msg%n"
                         Policies:
                           TimeBasedTriggeringPolicy:
                             interval: 1
                             modulate: true
                           SizeBasedTriggeringPolicy:
                             size: "20MB"
                           # Trigger every day at midnight that also scan
                           # roll-over strategy that deletes older file
                           CronTriggeringPolicy:
                             schedule: "0 0 0 * * ?"
                         # Delete file older than 30days
                         DefaultRolloverStrategy:
                             Delete:
                               basePath: ${sys:pulsar.log.dir}
                               maxDepth: 2
                               IfFileName:
                                 glob: "*/${sys:pulsar.log.file}*log.gz"
                               IfLastModified:
                                 age: 30d
                     - ref: "${sys:pulsar.routing.appender.default}"
                       key: "${ctx:function}"
             - ref: "${sys:pulsar.routing.appender.default}"
               key: "${ctx:function}"
   
     Loggers:
   
       # Default root logger configuration
       Root:
         level: warn
         additivity: true
         AppenderRef:
           - ref: "${sys:pulsar.log.appender}"
             level: "trace"
           - ref: Prometheus
             level: warn
   
       Logger:
         - name: org.apache.pulsar.broker.service
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.pulsar.common
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.pulsar.client
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.bookkeeper.bookie.BookieShell
           level: warn
           additivity: false
           AppenderRef:
             - ref: Console
         - name: verbose
           level: warn
           additivity: false
           AppenderRef:
             - ref: Console
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-801609488


   After taking a closer look at #7266, it's not clear if it will resolve the issue because we're seeing this issue with functions running with exclusive subscriptions, not shared subscriptions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] kemburi edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

kemburi edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-741621140


   pulsar client version 2.6.1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-818486009


   One possibility is that we're passing the wrong value (or maybe there's a race condition) when we get the number of messages when updating the permits. 
   We use the value [here](https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java#L517)
   which is set [here](https://github.com/apache/pulsar/blob/197bd93150a222a6ede72758eeb1d2b276dd728f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L152)
   from [here](https://github.com/apache/pulsar/blob/197bd93150a222a6ede72758eeb1d2b276dd728f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L125)
   which is obtained [here](https://github.com/apache/pulsar/blob/197bd93150a222a6ede72758eeb1d2b276dd728f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L88)
   
   that value, `msgMetadata.getNumMessagesInBatch()`, comes from here:
   ```
       public static MessageMetadata peekMessageMetadata(ByteBuf metadataAndPayload, String subscription,
               long consumerId) {
           try {
               // save the reader index and restore after parsing
               int readerIdx = metadataAndPayload.readerIndex();
               skipBrokerEntryMetadataIfExist(metadataAndPayload);
               MessageMetadata metadata = Commands.parseMessageMetadata(metadataAndPayload);
               metadataAndPayload.readerIndex(readerIdx);
   
               return metadata;
           } catch (Throwable t) {
               log.error("[{}] [{}] Failed to parse message metadata", subscription, consumerId, t);
               return null;
           }
       }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-834127635


   I noticed a pattern when the freeze occurs. After adding some additional debug lines, we get a call to `PersistentTopic.asyncAddEntry()`, and queue an `addEntry` request in `ManagedLedgerImpl`, and then we get 2,000 or more lines like this:
   
   ```
   May  6 22:18:02 fab08 a7d8f24158d5: 2021-05-07T04:18:02,245 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myFunction/persistent/filter-function] Ledger already created when timeout task is triggered
   May  6 22:18:02 fab08 a7d8f24158d5: 2021-05-07T04:18:02,245 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myFunction/persistent/filter-function] createComplete rc=-23 ledger=-1
   May  6 22:18:02 fab08 a7d8f24158d5: 2021-05-07T04:18:02,252 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myFunction/persistent/filter-function] Ledger already created when timeout task is triggered
   May  6 22:18:02 fab08 a7d8f24158d5: 2021-05-07T04:18:02,252 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myFunction/persistent/filter-function] createComplete rc=-23 ledger=-1
   May  6 22:18:02 fab08 a7d8f24158d5: 2021-05-07T04:18:02,259 [bookkeeper-ml-scheduler-OrderedScheduler-1-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myFunction/persistent/filter-function] Ledger already created when timeout task is triggered
   May  6 22:18:02 fab08 a7d8f24158d5: 2021-05-07T04:18:02,259 [bookkeeper-ml-scheduler-OrderedScheduler-1-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myFunction/persistent/filter-function] createComplete rc=-23 ledger=-1
   ```
   After that, we get several thousand lines with the ping/pong cycle like this with almost no other broker activity on the affected broker:
   
   ```
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,517 [pulsar-io-28-54] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.29:49766] Received cmd PING
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,517 [pulsar-io-28-54] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x5d25db1f, L:/10.20.69.37:6650 - R:/10.20.69.29:49766]] Replying back to ping message
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,519 [pulsar-io-28-54] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x5d25db1f, L:/10.20.69.37:6650 - R:/10.20.69.29:49766]] Sending ping message
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,519 [pulsar-io-28-54] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.29:49766] Received cmd PONG
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,760 [pulsar-io-28-55] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x9c75d310, L:/10.20.69.37:6650 - R:/10.20.69.37:51364]] Sending ping message
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,762 [pulsar-io-28-55] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.37:51364] Received cmd PONG
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,764 [pulsar-io-28-55] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.37:51364] Received cmd PING
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,764 [pulsar-io-28-55] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x9c75d310, L:/10.20.69.37:6650 - R:/10.20.69.37:51364]] Replying back to ping message
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,968 [pulsar-io-28-58] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:60676] Received cmd PING
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,968 [pulsar-io-28-58] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x4465b5a6, L:/10.20.69.37:6651 - R:/10.20.69.28:60676]] Replying back to ping message
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,969 [pulsar-io-28-58] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x4465b5a6, L:/10.20.69.37:6651 - R:/10.20.69.28:60676]] Sending ping message
   May  6 22:19:20 fab08 a7d8f24158d5: 2021-05-07T04:19:20,969 [pulsar-io-28-58] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:60676] Received cmd PONG
   May  6 22:19:21 fab08 a7d8f24158d5: 2021-05-07T04:19:21,178 [pulsar-io-28-61] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x8ca7517a, L:/10.20.69.37:6650 - R:/10.20.69.37:51370]] Sending ping message
   May  6 22:19:21 fab08 a7d8f24158d5: 2021-05-07T04:19:21,178 [pulsar-io-28-61] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.37:51370] Received cmd PONG
   May  6 22:19:21 fab08 a7d8f24158d5: 2021-05-07T04:19:21,181 [pulsar-io-28-61] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.37:51370] Received cmd PING
   May  6 22:19:21 fab08 a7d8f24158d5: 2021-05-07T04:19:21,181 [pulsar-io-28-61] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x8ca7517a, L:/10.20.69.37:6650 - R:/10.20.69.37:51370]] Replying back to ping message
   May  6 22:19:22 fab08 a7d8f24158d5: 2021-05-07T04:19:22,910 [pulsar-io-28-63] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0xfb5df845, L:/10.20.69.37:6650 - R:/10.20.69.37:51376]] Sending ping message
   May  6 22:19:22 fab08 a7d8f24158d5: 2021-05-07T04:19:22,911 [pulsar-io-28-63] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.37:51376] Received cmd PONG
   May  6 22:19:22 fab08 a7d8f24158d5: 2021-05-07T04:19:22,913 [pulsar-io-28-63] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.37:51376] Received cmd PING
   May  6 22:19:22 fab08 a7d8f24158d5: 2021-05-07T04:19:22,913 [pulsar-io-28-63] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0xfb5df845, L:/10.20.69.37:6650 - R:/10.20.69.37:51376]] Replying back to ping message
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,148 [pulsar-io-28-65] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:54664] Received cmd PING
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,148 [pulsar-io-28-65] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0xf14980c1, L:/10.20.69.37:6650 - R:/10.20.69.28:54664]] Replying back to ping message
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,148 [pulsar-io-28-65] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0xf14980c1, L:/10.20.69.37:6650 - R:/10.20.69.28:54664]] Sending ping message
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,150 [pulsar-io-28-65] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.69.28:54664] Received cmd PONG
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,540 [pulsar-io-28-13] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x3e7436d4, L:/10.20.69.37:6651 - R:/10.20.17.40:51699]] Sending ping message
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,802 [pulsar-io-28-13] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.17.40:51699] Received cmd PING
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,802 [pulsar-io-28-13] DEBUG org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x3e7436d4, L:/10.20.69.37:6651 - R:/10.20.17.40:51699]] Replying back to ping message
   May  6 22:19:23 fab08 a7d8f24158d5: 2021-05-07T04:19:23,802 [pulsar-io-28-13] DEBUG org.apache.pulsar.common.protocol.PulsarDecoder - [/10.20.17.40:51699] Received cmd PONG
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] skyrocknroll commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

skyrocknroll commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-996944164


   @devinbost Which version of apache pulsar has this fix ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-996804728


   This bug has been resolved in DataStax Luna Streaming 2.7.2_1.1.21


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842582829


   Are there JVM parameters that I should or shouldn't be including when starting the brokers?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-818463224


   The permits seem to jump to negative numbers in the broker debug logs with very little explanation. (The logs below merge together the logs from all three brokers in this test cluster.)
   
   If I filter out ZK logs, after this permit-related line:
   
   > Apr 12 18:55:52 fab08 8f9131691776: 2021-04-13T00:55:52,442 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant/myNamespace/topic-1-PersistentSubscription{topic=persistent://myTenant/myNamespace/topic-1, name=myTenant/myNamespace/function-2}] Added more flow control message permits 500 (old was: 419), blocked = false
   
   I see some ManagedLedgerImpl logs, including this one:
   
   > Apr 12 18:55:52 fab08 8f9131691776: 2021-04-13T00:55:52,442 [bookkeeper-ml-workers-OrderedExecutor-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/topic-1] Resending 100 pending messages
   
   Then, there are more ManagedLedgerImpl and some OpReadEntry logs.
   Then, there are EntryCacheImpl and ManagedLedgerImpl logs.
   Then, I get this:
   
   > Apr 12 18:56:22 fab08 8f9131691776: 2021-04-13T00:56:22,301 [pulsar-io-25-64] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant/myNamespace/topic-1-PersistentSubscription{topic=persistent://myTenant/myNamespace/topic-1, name=myTenant/myNamespace/function-2}] Added more flow control message permits 500 (old was: -14), blocked = false
   
   The next time I get a `Distributing x messages to n consumers` involving that topic, it says:
   `Distributing 49 messages to 1 consumers`, which is immediately followed by:
   `No consumers found with available permits, storing 29 positions for later replay`
   
   So, it's like the permits are decrementing in some way that's not reflected in the logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-843675423


   @lhotari I reproduced the issue with using this as the Pulsar Function:
   
   ```
   public class RandomFilterFunction implements Function<String, String> {
   
       private Random rand = new Random();
   
       @Override
       public String process(String input, Context context) {
   	    double passFraction = Double.parseDouble((String)context.getUserConfigValueOrDefault("pass-fraction", "0.5"));
   	    if (rand.nextDouble() < passFraction) {
               return input;
           } else {
               return null;
   	    }
       }
   }
   ```
   I chained one function (with parallelism of 4) to another function (with parallelism of 4), both using the same class. 
   The first function in the flow has a `pass-fraction` of "0.3", and the other one is default. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839507871


   @lhotari pointed out that there's a bug in the toString() method for `OpAddEntry` https://github.com/apache/pulsar/blob/72962742aecf953fffabfdcfacbdcbb70a1ebba2/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L345


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-658404866


   If it's a clue, it only happens on topics involving functions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-823815763


   I was able to demonstrate the issue with the new debug statements currently in a build I created from master (2.8.0-SNAPSHOT). If you study the logs below, you will see that more permits were dispatched than available, which blocks dispatching of new messages and results in repeated entries of this message:
   `Trigger new read after receiving flow control message with permits -11 after adding 0 permits`
   
   ```
   2021-04-21T05:40:47,933 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 500 message permits in broker.service.Consumer before updating dispatcher for consumer
   2021-04-21T05:40:47,933 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits 596 after adding 500 permits
   2021-04-21T05:40:47,933 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added more flow control message permits 500 (old was: 96), blocked = false
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(248 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(249 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(14 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(10 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(14 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(12 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(13 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(1 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(15 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,942 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(15 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:47,943 [bookkeeper-ml-workers-OrderedExecutor-0-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1] Added -(16 minus 0) permits to TOTAL_AVAILABLE_PERMITS_UPDATER in PersistentDispatcherMultipleConsumers
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.Consumer - [persistent://myTenant2/myNamespace/topic-1-partition-0-PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}] Added 0 blocked permits to broker.service.Consumer for consumer
   2021-04-21T05:40:48,028 [pulsar-io-28-47] DEBUG org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://myTenant2/myNamespace/topic-1-partition-0 / myTenant/myNamespace/function1-Consumer{subscription=PersistentSubscription{topic=persistent://myTenant2/myNamespace/topic-1-partition-0, name=myTenant/myNamespace/function1}, consumerId=0, consumerName=fa0ee, address=/10.20.69.28:55028}] Trigger new read after receiving flow control message with permits -11 after adding 0 permits
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-815382279


   > > After taking a closer look at #7266, it's not clear if it will resolve the issue because we're seeing this issue with functions running with exclusive subscriptions, not shared subscriptions.
   > 
   > You sure? I didn't know it was possible to have a function on an exclusive subscription.
   
   I double-checked, and you were right that we're using shared subscriptions. However, with the way we're using the functions, each function has its own subscription, so we aren't routing messages between multiple consumers of the same shared subscription. It looks like the behavior mentioned in #7266 applies primarily to cases involving multiple consumers on a single shared subscription. 
   
   Regardless, I noticed that we're not checking for success when sending permits from the consumer to the broker (https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L857), and every time we attempt to send permits from the consumer to the broker, we reset the consumer's permits (https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L1368). So, it's possible that if there's a communication problem between the consumer and the broker that their permit numbers could get out of sync. This PR should help with those out-of-sync issues involving permits by getting things back on track when they're out of sync: https://github.com/apache/pulsar/pull/9789
   Also, I created this PR to add logging when there's a communication issue when sending available permits to the broker: https://github.com/apache/pulsar/pull/10166
   
   However, it's still not clear to me how negative permits are getting calculated when only a single consumer exists for a shared subscription. The consumer either increases permits or resets them to 0. The consumer never decrements permits. This may explain why I was unable to see negative permits on the consumer in the function heap dumps. 
   I'm hoping that @rdhabalia will chime in to share some insight into this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840221892


   After adding a lot of debug log lines, this is what **healthy traffic** should look like for the ack flow (starting from when the producer triggers the SEND command to when the SEND_RECEIPT command is sent):
   
   ```
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.ServerCnx - Calling ServerCnx.handleSend(..)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/10.20.69.37:50086] Received send message request. producer: 0:242435 example-cluster-235-0:242435 size: 37566, partition key is: null, ordering key is null
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.ServerCnx - Called startSendOperation(..) in ServerCnx.handleSend(..) for producer Producer{topic=PersistentTopic{topic=persistent://myTenant/myNamespace/bot-filtered}, client=/10.20.69.37:50086, producerName=example-cluster-235-0, producerId=0} with messages 12
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.ServerCnx - send.hasHighestSequenceId() (true) && send.getSequenceId() (242435) <= send.getHighestSequenceId() (242446)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.Producer - Calling Producer.publishMessageToTopic([headersAndPayload], 242435, 242446, 12, false
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - Calling PersistentTopic.publishMessage([headersAndPayload], publishContext) for publishContext: Producer.MessagePublishContext(producer=Producer{topic=PersistentTopic{topic=persistent://myTenant/myNamespace/bot-filtered}, client=/10.20.69.37:50086, producerName=example-cluster-235-0, producerId=0}, sequenceId=242435, ledgerId=-1, entryId=-1, rateIn=org.apache.pulsar.common.stats.Rate@1bcfab4e, msgSize=37566, batchSize=12, chunked=false, startTimeNs=15631623317576155, originalProducerName=null, originalSequenceId=-1, highestSequenceId=242446, originalHighestSequenceId=-1, recyclerHandle=io.netty.util.Recycler$DefaultHandle@5afb09fa)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - pendingWriteOps is: 1
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - isFenced is: false
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - PersistentTopic.publishMessage(..) status is NotDup. PublishContext is Producer.MessagePublishContext(producer=Producer{topic=PersistentTopic{topic=persistent://myTenant/myNamespace/bot-filtered}, client=/10.20.69.37:50086, producerName=example-cluster-235-0, producerId=0}, sequenceId=242435, ledgerId=-1, entryId=-1, rateIn=org.apache.pulsar.common.stats.Rate@1bcfab4e, msgSize=37566, batchSize=12, chunked=false, startTimeNs=15631623317576155, originalProducerName=null, originalSequenceId=-1, highestSequenceId=242446, originalHighestSequenceId=-1, recyclerHandle=io.netty.util.Recycler$DefaultHandle@5afb09fa)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - Calling PersistentTopic.asyncAddEntry(..) for PublishContext Producer.MessagePublishContext(producer=Producer{topic=PersistentTopic{topic=persistent://myTenant/myNamespace/bot-filtered}, client=/10.20.69.37:50086, producerName=example-cluster-235-0, producerId=0}, sequenceId=242435, ledgerId=-1, entryId=-1, rateIn=org.apache.pulsar.common.stats.Rate@1bcfab4e, msgSize=37566, batchSize=12, chunked=false, startTimeNs=15631623317576155, originalProducerName=null, originalSequenceId=-1, highestSequenceId=242446, originalHighestSequenceId=-1, recyclerHandle=io.netty.util.Recycler$DefaultHandle@5afb09fa)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] asyncAddEntry size=37566 state=LedgerOpened
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Running OpAddEntry.createOpAddEntry(..)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - In OpAddEntry.createOpAddEntry, OpAddEntry is myTenant/myNamespace/persistent/bot-filtered
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Created new OpAddEntry myTenant/myNamespace/persistent/bot-filtered
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,177 [pulsar-io-28-60] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - About to schedule internalAsyncAddEntry(..) on myTenant/myNamespace/persistent/bot-filtered in ManagedLedgerImpl.asyncAddEntry(..)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,178 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling ManagedLedgerImpl.internalAsyncAddEntry(..) for OpAddEntry myTenant/myNamespace/persistent/bot-filtered
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,178 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - pendingAddEntries is size 1
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,178 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - State is LedgerOpened
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,178 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling else in ManagedLedgerImpl.internalAsyncAddEntry(..) for state LedgerOpened
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,178 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Write into current ledger lh=3219013 entries=7
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,178 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Calling OpAddEntry.initiate() for OpAddEntry myTenant/myNamespace/persistent/bot-filtered
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,179 [BookKeeperClientWorker-OrderedExecutor-37-0] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - [myTenant/myNamespace/persistent/bot-filtered] [myTenant/myNamespace/persistent/bot-filtered] write-complete: ledger-id=3219013 entry-id=6 size=37566 rc=0
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Calling OpAddEntry.safeRun(). firstInQueue OpAddEntry is myTenant/myNamespace/persistent/bot-filtered
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Calling else where cb != null in OpAddEntry.safeRun()
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.pulsar.broker.service.persistent.PersistentTopic - Calling PersistentTopic.addComplete(..) with PublishContext Producer.MessagePublishContext(producer=Producer{topic=PersistentTopic{topic=persistent://myTenant/myNamespace/bot-filtered}, client=/10.20.69.37:50086, producerName=example-cluster-235-0, producerId=0}, sequenceId=242435, ledgerId=-1, entryId=-1, rateIn=org.apache.pulsar.common.stats.Rate@1bcfab4e, msgSize=37566, batchSize=12, chunked=false, startTimeNs=15631623317576155, originalProducerName=null, originalSequenceId=-1, highestSequenceId=242446, originalHighestSequenceId=-1, recyclerHandle=io.netty.util.Recycler$DefaultHandle@5afb09fa)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.pulsar.broker.service.Producer - Calling Producer.MessagePublishContext.completed([exception], 3219013, 6)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [bookkeeper-ml-workers-OrderedExecutor-5-0] DEBUG org.apache.pulsar.broker.service.Producer - [PersistentTopic{topic=persistent://myTenant/myNamespace/bot-filtered}] [example-cluster-235-0] [0] triggered send callback. cnx /10.20.69.37:50086, sequenceId 242435
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.Producer - Calling Producer.MessagePublishContext.run()
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.Producer - [PersistentTopic{topic=persistent://myTenant/myNamespace/bot-filtered}] [example-cluster-235-0] [0] Persisted message. cnx org.apache.pulsar.broker.service.ServerCnx@4f9ad6cb, sequenceId 242435
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.Producer - About to call producer.cnx.getCommandSender().sendSendReceiptResponse(..)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.PulsarCommandSenderImpl - Calling PulsarCommandSenderImpl.sendSendReceiptResponse(0,242435,242446,3219013,6)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.ServerCnx - Calling ServerCnx.completedSendOperation(false,37566)
   May 12 13:54:31 fab08 822c6b930733: 2021-05-12T19:54:31,180 [pulsar-io-28-60] DEBUG org.apache.pulsar.broker.service.ServerCnx - pendingSendRequest 1 and resumeReadsThreshold 500
   ```
   
   In the unhealthy case, we don't see that at all. I'm not even seeing activity on that thread. The only activity in the logs that I'm seeing on that broker is this:
   ```
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,608 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,608 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,608 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,608 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,659 [bookkeeper-ml-scheduler-OrderedScheduler-0-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,659 [bookkeeper-ml-scheduler-OrderedScheduler-0-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,659 [bookkeeper-ml-scheduler-OrderedScheduler-0-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,659 [bookkeeper-ml-scheduler-OrderedScheduler-0-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,711 [bookkeeper-ml-scheduler-OrderedScheduler-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,711 [bookkeeper-ml-scheduler-OrderedScheduler-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,711 [bookkeeper-ml-scheduler-OrderedScheduler-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,711 [bookkeeper-ml-scheduler-OrderedScheduler-5-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,745 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,745 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,745 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,745 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,777 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,777 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,777 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,777 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,810 [bookkeeper-ml-scheduler-OrderedScheduler-1-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,810 [bookkeeper-ml-scheduler-OrderedScheduler-1-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,810 [bookkeeper-ml-scheduler-OrderedScheduler-1-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,810 [bookkeeper-ml-scheduler-OrderedScheduler-1-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,847 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,847 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,847 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   May 12 14:20:51 fab08 822c6b930733: 2021-05-12T20:20:51,847 [bookkeeper-ml-scheduler-OrderedScheduler-7-0] DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   [followed by ping/ping activity]
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-827768089


   I was able to reproduce this issue on a cluster with a single broker and a partitioned topic (with 4 partitions) sending messages to a single filter function. I was also able to reproduce it using a non-partitioned topic after allowing backlog to accumulate (by stopping the function) and then resuming the function after several thousand messages accumulated on the backlog. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-817004087


   I've confirmed that permits are positive (and appear healthy) on ALL the functions' ConsumerImpl objects when the _brokers_ reported **negative** permits for the topics in this location:
   
   - `broker.service.Consumer`
   
   Interestingly the `totalAvailablePermits` on `broker.service.persistent.PersistentDispatcherMultipleConsumers` == 0 for each topic.
   
   I was a little surprised that `PersistentDispatcherMultipleConsumers` was used instead of `PersistentDispatcherSingleActiveConsumer` for the topics because each topic only had a single subscription with a single consumer, but heap dumps don't lie. 
   
   We also saw the negative permits on partitioned topics and non-partitioned topics. 
   
   We've also confirmed that after a very large amount of time, the messages will eventually flow through. It's like stop-and-go traffic. A few messages get through and then permits go back to negative, there's a pause, and then a few more messages get through, and the process repeats. So, messages that would typically take seconds to clear the pipe take hours to clear. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842618044


   Here are the broker.conf settings we're using:
   
   ```
   zookeeperServers=server08.example.com:2181,server09.example.com:2181,server10.example.com:2181
   configurationStoreServers=server08.example.com:2181,server09.example.com:2181,server10.example.com:2181
   brokerServicePort=6650
   webServicePort=8080
   brokerServicePortTls=6651
   webServicePortTls=8443
   bindAddress=0.0.0.0
   numIOThreads=
   numHttpServerThreads=
   isRunningStandalone=
   clusterName=pulsar-example-cluster
   failureDomainsEnabled=false
   zooKeeperSessionTimeoutMillis=30000
   brokerShutdownTimeoutMs=60000
   backlogQuotaCheckEnabled=true
   backlogQuotaCheckIntervalInSeconds=60
   backlogQuotaDefaultLimitGB=10
   backlogQuotaDefaultRetentionPolicy=producer_request_hold
   brokerDeleteInactiveTopicsEnabled=true
   brokerDeleteInactiveTopicsFrequencySeconds=60
   messageExpiryCheckIntervalInMinutes=5
   activeConsumerFailoverDelayTimeMillis=1000
   subscriptionExpirationTimeMinutes=0
   subscriptionRedeliveryTrackerEnabled=true
   subscriptionExpiryCheckIntervalInMinutes=5
   brokerDeduplicationEnabled=false
   brokerDeduplicationMaxNumberOfProducers=10000
   brokerDeduplicationEntriesInterval=1000
   brokerDeduplicationProducerInactivityTimeoutMinutes=360
   defaultNumberOfNamespaceBundles=64
   clientLibraryVersionCheckEnabled=false
   statusFilePath=
   preferLaterVersions=false
   maxUnackedMessagesPerConsumer=0
   maxUnackedMessagesPerSubscription=0
   maxUnackedMessagesPerBroker=0
   maxUnackedMessagesPerSubscriptionOnBrokerBlocked=0.16
   subscribeThrottlingRatePerConsumer=0
   subscribeRatePeriodPerConsumerInSecond=30
   dispatchThrottlingRatePerTopicInMsg=0
   dispatchThrottlingRatePerTopicInByte=0
   dispatchThrottlingRatePerSubscriptionInMsg=0
   dispatchThrottlingRatePerSubscribeInByte=0
   dispatchThrottlingOnNonBacklogConsumerEnabled=true
   maxConcurrentLookupRequest=50000
   maxConcurrentTopicLoadRequest=5000
   maxConcurrentNonPersistentMessagePerConnection=1000
   numWorkerThreadsForNonPersistentTopic=8
   enablePersistentTopics=true
   enableNonPersistentTopics=true
   enableRunBookieTogether=false
   enableRunBookieAutoRecoveryTogether=false
   maxProducersPerTopic=0
   maxConsumersPerTopic=0
   maxConsumersPerSubscription=0
   brokerServiceCompactionMonitorIntervalInSeconds=60
   proxyRoles=
   authenticateOriginalAuthData=false
   tlsEnabled=true
   tlsCertificateFilePath=/path/to/auth/broker.cert.example.pem
   tlsKeyFilePath=/path/to/auth/broker.key-pk8.example.pem
   tlsTrustCertsFilePath=/path/to/auth/ca.cert.example.pem
   tlsAllowInsecureConnection=false
   tlsProtocols=TLSv1.2,TLSv1.1,TLSv1
   tlsCiphers=ECDHE-ECDSA-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-ECDSA-CHACHA20-POLY1305,ECDHE-RSA-CHACHA20-POLY1305,ECDHE-ECDSA-AES128-GCM-SHA256,ECDHE-RSA-AES128-GCM-SHA256,ECDHE-ECDSA-AES256-SHA,ECDHE-RSA-AES256-SHA,ECDHE-ECDSA-AES128-SHA,ECDHE-RSA-AES128-SHA,ECDHE-PSK-CHACHA20-POLY1305,AES256-GCM-SHA384,AES128-GCM-SHA256,ECDHE-PSK-AES256-CBC-SHA,AES256-SHA,PSK-AES256-CBC-SHA,ECDHE-PSK-AES128-CBC-SHA,AES128-SHA,PSK-AES128-CBC-SHA
   tlsRequireTrustedClientCertOnConnect=false
   authenticationEnabled=true
   authenticationProviders=org.apache.pulsar.broker.authentication.AuthenticationProviderToken
   authorizationEnabled=true
   authorizationProvider=org.apache.pulsar.broker.authorization.PulsarAuthorizationProvider
   authorizationAllowWildcardsMatching=false
   superUserRoles=examplesuperuser,etc
   brokerClientTlsEnabled=false
   brokerClientAuthenticationPlugin=org.apache.pulsar.client.impl.auth.AuthenticationToken
   brokerClientAuthenticationParameters=token:abc...defg
   brokerClientTrustCertsFilePath=/path/to/auth/ca.cert.example.pem
   athenzDomainNames=
   anonymousUserRole=anonymous
   tokenPublicKey=file:///path/to/auth/public.key
   bookkeeperClientAuthenticationPlugin=
   bookkeeperClientAuthenticationParametersName=
   bookkeeperClientAuthenticationParameters=
   bookkeeperClientTimeoutInSeconds=30
   bookkeeperClientSpeculativeReadTimeoutInMillis=0
   bookkeeperUseV2WireProtocol=true
   bookkeeperClientHealthCheckEnabled=true
   bookkeeperClientHealthCheckIntervalSeconds=60
   bookkeeperClientHealthCheckErrorThresholdPerInterval=5
   bookkeeperClientHealthCheckQuarantineTimeInSeconds=1800
   bookkeeperClientRackawarePolicyEnabled=true
   bookkeeperClientRegionawarePolicyEnabled=false
   bookkeeperClientReorderReadSequenceEnabled=false
   bookkeeperClientIsolationGroups=
   bookkeeperEnableStickyReads=true
   managedLedgerDefaultEnsembleSize=2
   managedLedgerDefaultWriteQuorum=2
   managedLedgerDefaultAckQuorum=2
   managedLedgerDigestType=CRC32C
   managedLedgerNumWorkerThreads=8
   managedLedgerNumSchedulerThreads=8
   managedLedgerCacheSizeMB=
   managedLedgerCacheEvictionWatermark=0.9
   managedLedgerDefaultMarkDeleteRateLimit=1.0
   managedLedgerMaxEntriesPerLedger=10
   managedLedgerMinLedgerRolloverTimeMinutes=0
   managedLedgerMaxLedgerRolloverTimeMinutes=240
   managedLedgerOffloadDeletionLagMs=14400000
   managedLedgerCursorMaxEntriesPerLedger=50000
   managedLedgerCursorRolloverTimeInSeconds=14400
   managedLedgerMaxUnackedRangesToPersist=10000
   managedLedgerMaxUnackedRangesToPersistInZooKeeper=1000
   autoSkipNonRecoverableData=false
   managedLedgerMetadataOperationsTimeoutSeconds=60
   managedLedgerReadEntryTimeoutSeconds=120
   managedLedgerAddEntryTimeoutSeconds=120
   loadBalancerEnabled=true
   loadBalancerReportUpdateThresholdPercentage=10
   loadBalancerReportUpdateMaxIntervalMinutes=15
   loadBalancerHostUsageCheckIntervalMinutes=1
   loadBalancerSheddingEnabled=true
   loadBalancerSheddingIntervalMinutes=1
   loadBalancerSheddingGracePeriodMinutes=30
   loadBalancerBrokerMaxTopics=50000
   loadBalancerBrokerOverloadedThresholdPercentage=85
   loadBalancerResourceQuotaUpdateIntervalMinutes=15
   loadBalancerAutoBundleSplitEnabled=true
   loadBalancerAutoUnloadSplitBundlesEnabled=true
   loadBalancerNamespaceBundleMaxTopics=1000
   loadBalancerNamespaceBundleMaxSessions=1000
   loadBalancerNamespaceBundleMaxMsgRate=30000
   loadBalancerNamespaceBundleMaxBandwidthMbytes=100
   loadBalancerNamespaceMaximumBundles=128
   loadBalancerOverrideBrokerNicSpeedGbps=
   loadManagerClassName=org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl
   replicationMetricsEnabled=true
   replicationConnectionsPerBroker=16
   replicationProducerQueueSize=1000
   replicatorPrefix=pulsar.repl
   defaultRetentionTimeInMinutes=0
   defaultRetentionSizeInMB=0
   keepAliveIntervalSeconds=30
   bootstrapNamespaces=
   webSocketServiceEnabled=false
   webSocketNumIoThreads=8
   webSocketConnectionsPerBroker=8
   webSocketSessionIdleTimeoutMillis=300000
   exposeTopicLevelMetricsInPrometheus=true
   exposeConsumerLevelMetricsInPrometheus=false
   functionsWorkerEnabled=true
   exposePublisherStats=true
   statsUpdateFrequencyInSecs=60
   statsUpdateInitialDelayInSecs=60
   schemaRegistryStorageClassName=org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorageFactory
   isSchemaValidationEnforced=false
   offloadersDirectory=./offloaders
   globalZookeeperServers=
   replicationTlsEnabled=false
   brokerServicePurgeInactiveFrequencyInSeconds=60
   advertisedAddress=
   acknowledgmentAtBatchIndexLevelEnabled=true
   unblockStuckSubscriptionEnabled=true
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] rodrigoreis commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

rodrigoreis commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-765380753


   > Guys any one, or can anyone provide info on how to reproduce this issue?
   
   I'm spot this when I increase the clients number.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-658548718


   @sijie Unfortunately, I don't. What's the best way for us to get a heap dump when this happens again? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-817004087


   I've confirmed that permits are positive (and appear healthy) on ALL the functions' ConsumerImpl objects when the _brokers_ reported **negative** permits for the topics in this location:
   
   - `broker.service.Consumer`
   
   I was a little surprised that `PersistentDispatcherMultipleConsumers` was used instead of `PersistentDispatcherSingleActiveConsumer` for the topics because each topic only had a single subscription with a single consumer, but heap dumps don't lie. 
   
   We also saw the negative permits on partitioned topics and non-partitioned topics. 
   
   We've also confirmed that after a very large amount of time, the messages will eventually flow through. It's like stop-and-go traffic. A few messages get through and then permits go back to negative, there's a pause, and then a few more messages get through, and the process repeats. So, messages that would typically take seconds to clear the pipe take hours to clear. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-843580510


   Rajan pointed out that `updateLedgersIdsComplete` is called from a different thread, and it only happens when there's a ledger rollover, so I'm looking upstream from it to see if I can find where the slowdown is happening.
   Without the extra logs in `updateLedgersIdsComplete`, we still see a lot of this on the broker that isn't performing:
   
   ```
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] Ledger already created when timeout task is triggered
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [myTenant/myNamespace/persistent/bot-filtered] createComplete rc=-23 ledger=-1
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - checkAndCompleteLedgerOpTask(int rc, LedgerHandle lh, Object ctx) is true
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Calling checkAndCompleteLedgerOpTask(rc, lh, ctx)) on ManagedLedgerImpl.createComplete(..)
   ```
   
   So, `ManagedLedgerImpl` is still a likely suspect. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842605478


   I have a few settings in `pulsar_env.sh` that might be different:
   
   ```
   # Garbage collection options
   PULSAR_GC=" -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+AggressiveOpts -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB"
   
   # Extra options to be passed to the jvm
   PULSAR_EXTRA_OPTS="${PULSAR_EXTRA_OPTS} ${PULSAR_MEM} ${PULSAR_GC} -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.maxCapacity.default=1000 -Dio.netty.recycler.linkCapacity=1024"
   PULSAR_EXTRA_OPTS="$PULSAR_EXTRA_OPTS -XX:+PreserveFramePointer"
   
   # Kerberos support for Kafka Connectors
   PULSAR_EXTRA_OPTS="${PULSAR_EXTRA_OPTS} -Djava.security.krb5.conf=/pulsar/conf/krb5-prod.conf -Dsun.security.krb5.debug=false"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari closed issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari closed issue #6054:
URL: https://github.com/apache/pulsar/issues/6054


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-818486009


   One possibility is that we're passing the wrong value (or maybe there's a race condition) when we get the number of messages when updating the permits. 
   We use the value https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java#L517
   which is set [here](https://github.com/apache/pulsar/blob/197bd93150a222a6ede72758eeb1d2b276dd728f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L152)
   from [here](https://github.com/apache/pulsar/blob/197bd93150a222a6ede72758eeb1d2b276dd728f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L125)
   which is obtained [here](https://github.com/apache/pulsar/blob/197bd93150a222a6ede72758eeb1d2b276dd728f/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractBaseDispatcher.java#L88).
   
   That value, `msgMetadata.getNumMessagesInBatch()`, comes from here:
   ```
       public static MessageMetadata peekMessageMetadata(ByteBuf metadataAndPayload, String subscription,
               long consumerId) {
           try {
               // save the reader index and restore after parsing
               int readerIdx = metadataAndPayload.readerIndex();
               skipBrokerEntryMetadataIfExist(metadataAndPayload);
               MessageMetadata metadata = Commands.parseMessageMetadata(metadataAndPayload);
               metadataAndPayload.readerIndex(readerIdx);
   
               return metadata;
           } catch (Throwable t) {
               log.error("[{}] [{}] Failed to parse message metadata", subscription, consumerId, t);
               return null;
           }
       }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-870963659


   Related to https://github.com/apache/pulsar/issues/10813 except that issue seems to not occur if batching is disabled. 
   
   As an update, I've reproduced this bug even after correcting the non-standard broker.conf settings that were mentioned earlier in this issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-841851244


   > I also noticed this warning in the logs after it had been stuck for a while, but I'm not sure if it's related:
   
   @devinbost  Very interesting that there are some Zookeeper related issues happening at the same time (`org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 10302ms for session id 0x1028ed903dd038f`) . 
   
   It would be very useful to get a thread dump when this is going on. My assumption is that this type of problems aren't really a Zookeeper server issue, but Zookeeper client thread getting blocked which causes timeouts. There's ony sign of this in issue #10418 . 
   
   For creating the thread dumps, please use `jstack`/`jcmd`/`kill -3`. The earlier thread dumps looked like they were created with the debugger. It's better to reproduce the issue and create thread dumps without the debugger attached since the JVM can behave differently in debugging mode. How were the previous thread dumps created?
   
   What version of Pulsar are you running? (can you reference a Git SHA for the version if it's based on master branch?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-833830025


   For some reason, this line isn't getting called:  https://github.com/devinbost/pulsar/blob/959a04fee8becd78ff655a646a9d0d3402f20c39/[…]t/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java
   `cnx.ctx().writeAndFlush(cmd, cnx.ctx().voidPromise());`
   
   I added more debug lines, and the debug lines just before and just after that method are getting called. 
   ```
          @Override
           public void run() {
               log.debug("Calling ProducerImpl.WriteInEventLoopCallback.run(..) for OpSendMsg {}",
                       op.toString());
               if (log.isDebugEnabled()) {
                   log.debug("[{}] [{}] Sending message cnx {}, sequenceId {}", producer.topic, producer.producerName, cnx,
                           sequenceId);
               }
               try {
                   cnx.ctx().writeAndFlush(cmd, cnx.ctx().voidPromise());
                   log.debug("Called ProducerImpl.WriteInEventLoopCallback.run(..) for OpSendMsg {}",
                           op.toString());
                   op.updateSentTimestamp();
               } finally {
                   log.debug("About to call recycle() in ProducerImpl.WriteInEventLoopCallback");
                   recycle();
               }
           }
   ```
   The reason I think there's a problem with `cnx.ctx().writeAndFlush(cmd, cnx.ctx().voidPromise());` is because I just randomly stop receiving calls to `PulsarDecoder.channelRead(..)` for the case `SEND_RECEIPT` (https://github.com/apache/pulsar/blob/3bd443cc7056417a8321998b62224035253ace18/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/PulsarDecoder.java#L219), so the command isn't getting through the Netty channel. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-817004087


   I've confirmed that permits are positive (and appear healthy) on ALL the functions' ConsumerImpl objects when the _brokers_ reported **negative** permits for the topics in two locations:
   
   - `broker.service.persistent.PersistentDispatcherMultipleConsumers`
   - `broker.service.Consumer`
   
   I was a little surprised that `PersistentDispatcherMultipleConsumers` was used instead of `PersistentDispatcherSingleActiveConsumer` for the topics because each topic only had a single subscription with a single consumer, but heap dumps don't lie. 
   
   We also saw the negative permits on partitioned topics and non-partitioned topics. 
   
   We've also confirmed that after a very large amount of time, the messages will eventually flow through. It's like stop-and-go traffic. A few messages get through and then permits go back to negative, there's a pause, and then a few more messages get through, and the process repeats. So, messages that would typically take seconds to clear the pipe take hours to clear. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-797005237


   @sijie We noticed that the subscription after the frozen topic has a high number of negative permits.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] eolivelli commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

eolivelli commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-864531877


   @JohnMops did you update to Pulsar 2.8.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842626569


   I've added a lot of logging in this build: https://github.com/devinbost/pulsar/tree/debug_acks
   
   My log4j2.yaml file looks like this:
   
   ```
   Configuration:
     status: DEBUG
     monitorInterval: 30
     name: pulsar
     packages: io.prometheus.client.log4j2
   
     Properties:
       Property:
         - name: "pulsar.log.dir"
           value: "logs"
         - name: "pulsar.log.file"
           value: "pulsar.log"
         - name: "pulsar.log.appender"
           value: "RoutingAppender"
         - name: "pulsar.log.level"
           value: "trace"
         - name: "pulsar.routing.appender.default"
           value: "Console"
   
     # Example: logger-filter script
     Scripts:
       ScriptFile:
         name: filter.js
         language: JavaScript
         path: ./conf/log4j2-scripts/filter.js
         charset: UTF-8
   
     Appenders:
   
       # Console
       Console:
         name: Console
         target: SYSTEM_OUT
         PatternLayout:
           Pattern: "%d{ISO8601} [%t] %-5level %logger{36} - %msg%n"
   
       # Rolling file appender configuration
       RollingFile:
         name: RollingFile
         fileName: "${sys:pulsar.log.dir}/${sys:pulsar.log.file}"
         filePattern: "${sys:pulsar.log.dir}/${sys:pulsar.log.file}-%d{MM-dd-yyyy}-%i.log.gz"
         immediateFlush: false
         PatternLayout:
           Pattern: "%d{ISO8601} [%t] %-5level %logger{36} - %msg%n"
         Policies:
           TimeBasedTriggeringPolicy:
             interval: 1
             modulate: true
           SizeBasedTriggeringPolicy:
             size: 1 GB
         # Delete file older than 30days
         DefaultRolloverStrategy:
             Delete:
               basePath: ${sys:pulsar.log.dir}
               maxDepth: 2
               IfFileName:
                 glob: "*/${sys:pulsar.log.file}*log.gz"
               IfLastModified:
                 age: 30d
   
       Prometheus:
         name: Prometheus
   
       # Routing
       Routing:
         name: RoutingAppender
         Routes:
           pattern: "$${ctx:function}"
           Route:
             -
               Routing:
                 name: InstanceRoutingAppender
                 Routes:
                   pattern: "$${ctx:instance}"
                   Route:
                     -
                       RollingFile:
                         name: "Rolling-${ctx:function}"
                         fileName : "${sys:pulsar.log.dir}/functions/${ctx:function}/${ctx:functionname}-${ctx:instance}.log"
                         filePattern : "${sys:pulsar.log.dir}/functions/${sys:pulsar.log.file}-${ctx:instance}-%d{MM-dd-yyyy}-%i.log.gz"
                         PatternLayout:
                           Pattern: "%d{ABSOLUTE} %level{length=5} [%thread] [instance: %X{instance}] %logger{1} - %msg%n"
                         Policies:
                           TimeBasedTriggeringPolicy:
                             interval: 1
                             modulate: true
                           SizeBasedTriggeringPolicy:
                             size: "20MB"
                           # Trigger every day at midnight that also scan
                           # roll-over strategy that deletes older file
                           CronTriggeringPolicy:
                             schedule: "0 0 0 * * ?"
                         # Delete file older than 30days
                         DefaultRolloverStrategy:
                             Delete:
                               basePath: ${sys:pulsar.log.dir}
                               maxDepth: 2
                               IfFileName:
                                 glob: "*/${sys:pulsar.log.file}*log.gz"
                               IfLastModified:
                                 age: 30d
                     - ref: "${sys:pulsar.routing.appender.default}"
                       key: "${ctx:function}"
             - ref: "${sys:pulsar.routing.appender.default}"
               key: "${ctx:function}"
   
     Loggers:
   
       # Default root logger configuration
       Root:
         level: warn
         additivity: true
         AppenderRef:
           - ref: "${sys:pulsar.log.appender}"
             level: "trace"
           - ref: Prometheus
             level: warn
   
       Logger:
         - name: org.apache.pulsar.broker.service
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.pulsar.common
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.pulsar.client
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.bookkeeper.mledger.impl.OpAddEntry
           level: trace
           ref: "${sys:pulsar.log.appender}"
         - name: org.apache.bookkeeper.bookie.BookieShell
           level: warn
           additivity: false
           AppenderRef:
             - ref: Console
         - name: verbose
           level: warn
           additivity: false
           AppenderRef:
             - ref: Console
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-843541155


   I dropped some timestamps into `ManagedLedgerImpl.updateLedgersIdsComplete(..)`, and I'm not seeing anything glaring yet. Adding logs like this:
   
   ```
       public synchronized void updateLedgersIdsComplete(Stat stat) {
           STATE_UPDATER.set(this, State.LedgerOpened);
           lastLedgerCreatedTimestamp = clock.millis();
   
           if (log.isDebugEnabled()) {
               log.debug("[{}] Resending {} pending messages", name, pendingAddEntries.size());
           }
   
           // Avoid use same OpAddEntry between different ledger handle
           long priorNano = System.nanoTime();
           int pendingSize = pendingAddEntries.size();
           OpAddEntry existsOp;
           do {
               log.debug("updateLedgersIdsComplete 1. Nanoseconds is: {}", System.nanoTime() - priorNano);
               priorNano = System.nanoTime();
               existsOp = pendingAddEntries.poll();
               //log.debug("In ManagedLedgerImpl.updateLedgersIdsComplete(..), we're processing OpAddEntry {}",
               //        existsOp != null ? existsOp.toString() : null);
               if (existsOp != null) {
                   // If op is used by another ledger handle, we need to close it and create a new one
                   if (existsOp.ledger != null) {
                       log.debug("updateLedgersIdsComplete 2. Nanoseconds is: {}", System.nanoTime() - priorNano);
                       priorNano = System.nanoTime();
                       existsOp.close();
                       log.debug("updateLedgersIdsComplete 3. Nanoseconds is: {}", System.nanoTime() - priorNano);
                       priorNano = System.nanoTime();
                       existsOp = OpAddEntry.create(existsOp.ml, existsOp.data, existsOp.getNumberOfMessages(), existsOp.callback, existsOp.ctx);
                       log.debug("updateLedgersIdsComplete 4. Nanoseconds is: {}", System.nanoTime() - priorNano);
                       priorNano = System.nanoTime();
                   }
                   existsOp.setLedger(currentLedger);
                   log.debug("updateLedgersIdsComplete 5. Nanoseconds is: {}", System.nanoTime() - priorNano);
                   priorNano = System.nanoTime();
                   if (beforeAddEntry(existsOp)) {
                       log.debug("updateLedgersIdsComplete 6. Nanoseconds is: {}", System.nanoTime() - priorNano);
                       priorNano = System.nanoTime();
                       pendingAddEntries.add(existsOp);
                       log.debug("updateLedgersIdsComplete 7. Nanoseconds is: {}", System.nanoTime() - priorNano);
                       priorNano = System.nanoTime();
                   }
               }
           } while (existsOp != null && --pendingSize > 0);
   
           // Process all the pending addEntry requests
           for (OpAddEntry op : pendingAddEntries) {
               //log.debug("In ManagedLedgerImpl.updateLedgersIdsComplete(..), processing list of pendingAddEntries, starting"
               //                + " with OpAddEntry {}",
               //        op != null ? op.toString() : null);
               log.debug("updateLedgersIdsComplete 8. Nanoseconds is: {}", System.nanoTime() - priorNano);
               priorNano = System.nanoTime();
               ++currentLedgerEntries;
               currentLedgerSize += op.data.readableBytes();
   
               if (log.isDebugEnabled()) {
                   log.debug("[{}] Sending {}", name, op);
               }
   
               if (currentLedgerIsFull()) {
                   log.debug("updateLedgersIdsComplete 9. Nanoseconds is: {}", System.nanoTime() - priorNano);
                   priorNano = System.nanoTime();
                   STATE_UPDATER.set(this, State.ClosingLedger);
                   op.setCloseWhenDone(true);
                   log.debug("updateLedgersIdsComplete 10. Nanoseconds is: {}", System.nanoTime() - priorNano);
                   priorNano = System.nanoTime();
                   op.initiate();
                   log.debug("updateLedgersIdsComplete 11. Nanoseconds is: {}", System.nanoTime() - priorNano);
                   if (log.isDebugEnabled()) {
                       log.debug("[{}] Stop writing into ledger {} queue={}", name, currentLedger.getId(),
                               pendingAddEntries.size());
                   }
                   break;
               } else {
                   op.initiate();
                   log.debug("updateLedgersIdsComplete 12. Nanoseconds is: {}", System.nanoTime() - priorNano);
               }
           }
       }
   ```
   gave this output (plus many thousand more similar lines):
   
   ```
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 2. Nanoseconds is: 246
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 3. Nanoseconds is: 35
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 4. Nanoseconds is: 335
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 5. Nanoseconds is: 31
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 6. Nanoseconds is: 29
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 7. Nanoseconds is: 64
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 1. Nanoseconds is: 24
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 2. Nanoseconds is: 292
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 3. Nanoseconds is: 34
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 4. Nanoseconds is: 401
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 5. Nanoseconds is: 31
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 6. Nanoseconds is: 29
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 7. Nanoseconds is: 82
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 1. Nanoseconds is: 23
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 2. Nanoseconds is: 234
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 3. Nanoseconds is: 101
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 4. Nanoseconds is: 447
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 5. Nanoseconds is: 31
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 6. Nanoseconds is: 29
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 7. Nanoseconds is: 60
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 1. Nanoseconds is: 24
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 2. Nanoseconds is: 176
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 3. Nanoseconds is: 123
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 4. Nanoseconds is: 362
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 5. Nanoseconds is: 31
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 6. Nanoseconds is: 32
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 7. Nanoseconds is: 88
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 1. Nanoseconds is: 25
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 2. Nanoseconds is: 236
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 3. Nanoseconds is: 123
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 4. Nanoseconds is: 353
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 5. Nanoseconds is: 32
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 6. Nanoseconds is: 29
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 7. Nanoseconds is: 58
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 1. Nanoseconds is: 25
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 2. Nanoseconds is: 233
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 3. Nanoseconds is: 36
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 4. Nanoseconds is: 739
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 5. Nanoseconds is: 29
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 6. Nanoseconds is: 29
   DEBUG org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - updateLedgersIdsComplete 7. Nanoseconds is: 90
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] kemburi edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

kemburi edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-742459754


   Guys any one, or can anyone provide info on how to reproduce this issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-800868920

-> `first-function` -> `first-topic` -> `second-function` -> `second-topic` -> `third-function` -> `third-topic` -> fourth-function ->

The `second-topic` is the one that froze and started accumulating backlog.

`third-topic` has pendingAddEntriesCount = 81, indicating it's waiting for write requests to complete.
`first-topic` and `second-topic` have pendingAddEntriesCount = 0

`third-topic` is in the state ClosingLedger.
`first-topic` and `second-topic` are in the state LedgerOpened

`second-topic`'s cursor has markDeletePosition = 17525:0 and `readPosition` = 17594:9
`third-topic`'s cursor has markDeletePosition = 17551:9 and `readPosition` = 17551:10

So, the `third-topic`'s cursor's `readPosition` is adjacent to its markDeletePosition.
However, `second-topic`'s cursor's `readPosition` is farther ahead than `third-topic`'s `readPosition`.

Is that unusual for a downstream topic's cursor to have a `readPosition` farther ahead (larger number) than the `readPosition` of the topic immediately upstream from it when the downstream topic's only source of messages is that upstream topic and not more than a few hundred thousand messages have been sent through the pipe?

[stats.zip](https://github.com/apache/pulsar/files/6154471/stats.zip)

I've also attached the broker thread dump: [thread_dump_3-16.txt](https://github.com/apache/pulsar/files/6154506/thread_dump_3-16.txt)

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-801518480


   **I found a huge clue on this issue** (in Pulsar 2.6.3).
   Several hours after we collected those stats, I got a heap dump on the function that was consuming from the dead topic, and its permits were actually positive. 
   (In the heap dump of `third-function`, in `ConsumerImpl`, the function's `availablePermits` were positive.)
   ![image](https://user-images.githubusercontent.com/7418031/111550705-44feb080-8744-11eb-97f5-f8dd0ecfc722.png)
   
   This morning, I checked the stats on the `third-topic` again, and the permits had gone back to positive, and its backlogs were all gone. 
   
   I found in the Pulsar source code that the consumer is supposed to periodically report to the broker what its available permits are (https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerImpl.java#L112).  It seems like the function must have eventually reported its permits to the broker, which brought the permits back into the positive, so the broker resumed pushing messages to the consumer. It's possible that it locked up again after accumulating more negative permits and then cleared after a while and repeated this process until eventually all the messages cleared the backlog. 
   
   To confirm if the issue is just the broker getting out of sync with the consumer's availablePermits, I need to reproduce the issue again and get a function heapdump when the topic permits are negative to verify if the function permits are positive at that point or if they're also negative. If they're also negative, then there's something happening in the consumer that's somehow making them positive after a while.
   
   It looks like this PR may fix some of the permit issues: 
   https://github.com/apache/pulsar/pull/7266
   However, we're not doing any explicit batching with the Pulsar functions. Do they batch by default? Maybe @jerrypeng or someone can provide some guidance here. 
   If there isn't batching taking place, then that PR won't completely solve this issue because these findings would suggest that there's another reason (other than incorrect batch reporting of permits, as mentioned in that PR) that the permits are getting out of sync between the broker and function. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] kemburi commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

kemburi commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-741621140


   pulsar client version 2.6.2


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-875106224

After spending many hours digging through the ack paths and not finding any issues, I took another look at the client (`ProducerImpl`), and I noticed something interesting in the heap dump.
`ProducerImpl` is in a `Connecting` state, so it's waiting to get the connection.
However, `[ProducerImpl].connectedSince` says it's been connected for about 3 days...
If it's in a `Connecting` state, that implies it was disconnected at some point.

If there's a connectivity issue, that would explain why this bug has been so hard to reproduce and why it can't be reproduced locally. It could be that some network hardware is doing something weird with the connection, and the client doesn't handle it correctly and gets stuck in a `Connecting` state and doesn't try re-establishing the connection. So, it's just waiting forever for the connection to establish.
When in the `Connecting` state, in `ProducerImpl.sendAsync(..)` , `ProducerImpl.isValidProducerState(..)` returns true even though we haven't completed the establishment of the connection. When that method returns true, it allows the producer to enqueue messages.
Sure enough, `[ProducerImpl].connectionHandler.state.pendingMessages` contains all (exactly) 1000 of the OpSendMsg that are blocking the semaphore. (The semaphore blocks once 1000 messages accumulate that haven't been ack'd.)
It also has exactly 1000 null value entries for `pendingCallbacks`, but I'm not sure if that means anything.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842711900


   @lhotari @rdhabalia I reproduced this issue with batching DISABLED.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-827768089


   I was able to reproduce this issue on a cluster with a single broker and a partitioned topic (with 4 partitions) sending messages to a single filter function that was also running on that broker. I was also able to reproduce it using a non-partitioned topic after allowing backlog to accumulate (by stopping the function) and then resuming the function after several thousand messages accumulated on the backlog. 
   Traffic must be sufficiently high velocity to reproduce the issue. 
   We bulk-loaded data from NiFi to reproduce. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-850420985


   I have reported a separate issue about a direct memory leak, it's #10738 . It includes a full repro case with a helm deployment that can be used to reproduce the issue in minikube (requires sufficient RAM) (or any k8s environment).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-842582061


   @lhotari I've been creating fresh builds on master (or branches of master with extra logging) every couple of days to test changes on this issue.
   
   I was pulling the thread dumps out of my heap dumps using VisualVM. I can use jstack instead if that will give a better picture. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839495126


   I traced the call chain for where ack's should be getting sent, and I filled in some additional steps:
   
   When `ProducerImpl` sends the messages, it builds `newSend` command instances, which get picked up by `ServerCnx.handleSend(..)`, which goes to `Producer.publishMessage(..)`, then `PersistentTopic.publishMessage(..)`
   which calls `asyncAddEntry(headersAndPayload, publishContext)`, which calls `[PersistentTopic].ledger.asyncAddEntry(headersAndPayload, (int) publishContext.getNumberOfMessages(), this, publishContext)`
   which creates the `OpAddEntry` and calls `internalAsyncAddEntry(addOperation)` on a different thread, which adds `OpAddEntry` to `[ManagedLedgerImpl].pendingAddEntries`
   
   From somewhere (it's not clear to me exactly where yet) we call `OpAddEntry.safeRun()`, which polls `pendingAddEntries`, gets the callback on `OpAddEntry` (which is the `PersistentTopic` instance) and calls `[PersistentTopic].addComplete(lastEntry, data.asReadOnly(), ctx)`, which calls `publishContext.completed(..)` on `Producer.MessagePublishContext`, which calls `Producer.ServerCnx.execute(this)` on the `MessagePublishContext`, which calls `MessagePublishContext.run()`, which triggers `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)` [SEND_RECEIPT], which writes a `newSendReceiptCommand` to the channel.
   
   So, I added a lot of logging, and to my great surprise, I'm getting an NPE after adding these debug lines to `OpAddEntry.createOpAddEntry(..)`:
   
   ```
       private static OpAddEntry createOpAddEntry(ManagedLedgerImpl ml, ByteBuf data, AddEntryCallback callback, Object ctx) {
           log.debug("Running OpAddEntry.createOpAddEntry(..)");
           OpAddEntry op = RECYCLER.get();
           log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           op.ml = ml;
           op.ledger = null;
           op.data = data.retain();
           op.dataLength = data.readableBytes();
           op.callback = callback;
           op.ctx = ctx;
           op.addOpCount = ManagedLedgerImpl.ADD_OP_COUNT_UPDATER.incrementAndGet(ml);
           op.closeWhenDone = false;
           op.entryId = -1;
           op.startTime = System.nanoTime();
           op.state = State.OPEN;
           ml.mbean.addAddEntrySample(op.dataLength);
           log.debug("2. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           return op;
       }
   ```
   
   Here's the stack trace:
   
   ```
      2021-05-12T06:14:19,569 [pulsar-io-28-32] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Running OpAddEntry.createOpAddEntry(..)
       2021-05-12T06:14:19,569 [pulsar-io-28-32] WARN  org.apache.pulsar.broker.service.ServerCnx - [/10.20.20.160:37696] Got exception java.lang.NullPointerException
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.toString(OpAddEntry.java:354)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.createOpAddEntry(OpAddEntry.java:100)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.create(OpAddEntry.java:81)
       	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncAddEntry(ManagedLedgerImpl.java:689)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.asyncAddEntry(PersistentTopic.java:428)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.publishMessage(PersistentTopic.java:404)
       	at org.apache.pulsar.broker.service.Producer.publishMessageToTopic(Producer.java:224)
       	at org.apache.pulsar.broker.service.Producer.publishMessage(Producer.java:161)
       	at org.apache.pulsar.broker.service.ServerCnx.handleSend(ServerCnx.java:1372)
       	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:207)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:200)
       	at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:162)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1376)
       	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1265)
       	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1302)
       	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
       	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
       	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
       	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
       	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
       	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
       	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       	at java.lang.Thread.run(Thread.java:748)
   ```
   
   The only place we could be throwing an NPE is here:
   `log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);`
   
   So, `op` isn't null until the null check passes?
   
   Seems like this issue: https://github.com/apache/pulsar/issues/10433
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-840227041


   In the broker logs, for every (or almost every) case that we're seeing the broker receive the SEND command, it's creating a SEND_RECEIPT:
   
   ```
   $ cat messages_broker10.log | grep 'Received cmd SEND' | wc -l
      20806
   $ cat messages_broker10.log | grep 'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      20806
   
   $ cat messages_broker8.log | grep 'Received cmd SEND' | wc -l
      44654
   $ cat messages_broker8.log | grep 'PulsarCommandSenderImpl.sendSendReceiptResponse' | wc -l
      44651
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost removed a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost removed a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-833830025


   For some reason, this line isn't getting called:  https://github.com/devinbost/pulsar/blob/959a04fee8becd78ff655a646a9d0d3402f20c39/[…]t/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java
   `cnx.ctx().writeAndFlush(cmd, cnx.ctx().voidPromise());`
   
   I added more debug lines, and the debug lines just before and just after that method are getting called. 
   ```
          @Override
           public void run() {
               log.debug("Calling ProducerImpl.WriteInEventLoopCallback.run(..) for OpSendMsg {}",
                       op.toString());
               if (log.isDebugEnabled()) {
                   log.debug("[{}] [{}] Sending message cnx {}, sequenceId {}", producer.topic, producer.producerName, cnx,
                           sequenceId);
               }
               try {
                   cnx.ctx().writeAndFlush(cmd, cnx.ctx().voidPromise());
                   log.debug("Called ProducerImpl.WriteInEventLoopCallback.run(..) for OpSendMsg {}",
                           op.toString());
                   op.updateSentTimestamp();
               } finally {
                   log.debug("About to call recycle() in ProducerImpl.WriteInEventLoopCallback");
                   recycle();
               }
           }
   ```
   The reason I think there's a problem with `cnx.ctx().writeAndFlush(cmd, cnx.ctx().voidPromise());` is because I just randomly stop receiving calls to `PulsarDecoder.channelRead(..)` for the case `SEND_RECEIPT` (https://github.com/apache/pulsar/blob/3bd443cc7056417a8321998b62224035253ace18/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/PulsarDecoder.java#L219), so the command isn't getting through the Netty channel. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost edited a comment on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost edited a comment on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-839495126


   I traced the call chain for where ack's should be getting sent, and I filled in some additional steps:
   
   When `ProducerImpl` sends the messages, it builds `newSend` command instances, which get picked up by `ServerCnx.handleSend(..)`, which goes to `Producer.publishMessage(..)`, then `PersistentTopic.publishMessage(..)`
   which calls `asyncAddEntry(headersAndPayload, publishContext)`, which calls `[PersistentTopic].ledger.asyncAddEntry(headersAndPayload, (int) publishContext.getNumberOfMessages(), this, publishContext)`
   which creates the `OpAddEntry` and calls `internalAsyncAddEntry(addOperation)` on a different thread, which adds `OpAddEntry` to `[ManagedLedgerImpl].pendingAddEntries`
   
   From somewhere (it's not clear to me exactly where yet) we call `OpAddEntry.safeRun()`, which polls `pendingAddEntries`, gets the callback on `OpAddEntry` (which is the `PersistentTopic` instance) and calls `[PersistentTopic].addComplete(lastEntry, data.asReadOnly(), ctx)`, which calls `publishContext.completed(..)` on `Producer.MessagePublishContext`, which calls `Producer.ServerCnx.execute(this)` on the `MessagePublishContext`, which calls `MessagePublishContext.run()`, which triggers `Producer.ServerCnx.getCommandSender().sendSendReceiptResponse(..)` [SEND_RECEIPT], which writes a `newSendReceiptCommand` to the channel.
   From there, the client gets the `SEND_RECEIPT` command from `PulsarDecoder`, which calls `[ClientCnx].handleSendReceipt(..)`, which calls `[ProducerImpl].ackReceived(..)`, which calls `releaseSemaphoreForSendOp(..)` to release the semaphore. 
   The semaphore doesn't block until `maxPendingMessages` is reached, which is 1000 by default. 
   
   So, I added a lot of logging, and to my great surprise, I'm getting an NPE after adding these debug lines to `OpAddEntry.createOpAddEntry(..)`:
   
   ```
       private static OpAddEntry createOpAddEntry(ManagedLedgerImpl ml, ByteBuf data, AddEntryCallback callback, Object ctx) {
           log.debug("Running OpAddEntry.createOpAddEntry(..)");
           OpAddEntry op = RECYCLER.get();
           log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           op.ml = ml;
           op.ledger = null;
           op.data = data.retain();
           op.dataLength = data.readableBytes();
           op.callback = callback;
           op.ctx = ctx;
           op.addOpCount = ManagedLedgerImpl.ADD_OP_COUNT_UPDATER.incrementAndGet(ml);
           op.closeWhenDone = false;
           op.entryId = -1;
           op.startTime = System.nanoTime();
           op.state = State.OPEN;
           ml.mbean.addAddEntrySample(op.dataLength);
           log.debug("2. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);
           return op;
       }
   ```
   
   Here's the stack trace:
   
   ```
      2021-05-12T06:14:19,569 [pulsar-io-28-32] DEBUG org.apache.bookkeeper.mledger.impl.OpAddEntry - Running OpAddEntry.createOpAddEntry(..)
       2021-05-12T06:14:19,569 [pulsar-io-28-32] WARN  org.apache.pulsar.broker.service.ServerCnx - [/10.20.20.160:37696] Got exception java.lang.NullPointerException
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.toString(OpAddEntry.java:354)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.createOpAddEntry(OpAddEntry.java:100)
       	at org.apache.bookkeeper.mledger.impl.OpAddEntry.create(OpAddEntry.java:81)
       	at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncAddEntry(ManagedLedgerImpl.java:689)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.asyncAddEntry(PersistentTopic.java:428)
       	at org.apache.pulsar.broker.service.persistent.PersistentTopic.publishMessage(PersistentTopic.java:404)
       	at org.apache.pulsar.broker.service.Producer.publishMessageToTopic(Producer.java:224)
       	at org.apache.pulsar.broker.service.Producer.publishMessage(Producer.java:161)
       	at org.apache.pulsar.broker.service.ServerCnx.handleSend(ServerCnx.java:1372)
       	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:207)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.flow.FlowControlHandler.dequeue(FlowControlHandler.java:200)
       	at io.netty.handler.flow.FlowControlHandler.channelRead(FlowControlHandler.java:162)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
       	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:311)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1376)
       	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1265)
       	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1302)
       	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
       	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
       	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
       	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
       	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
       	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
       	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
       	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
       	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
       	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
       	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
       	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
       	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
       	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
       	at java.lang.Thread.run(Thread.java:748)
   ```
   
   The only place we could be throwing an NPE is here:
   `log.debug("1. In OpAddEntry.createOpAddEntry, OpAddEntry is {}", op != null ? op.toString() : null);`
   
   So, `op` isn't null until the null check passes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-841851244


   > I also noticed this warning in the logs after it had been stuck for a while, but I'm not sure if it's related:
   
   @devinbost  Very interesting that there are some Zookeeper related issues happening at the same time (`org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 10302ms for session id 0x1028ed903dd038f`) . 
   
   It would be very useful to get a thread dump when this is going on. My assumption is that this type of problems aren't really a Zookeeper server issue, but Zookeeper client thread getting blocked which causes timeouts. There's ony sign of this in issue #10418 . 
   
   For creating the thread dumps, please use `jstack`/`jcmd`/`kill -3`. The earlier thread dumps looked like they were created with the debugger. It's better to reproduce the issue and create thread dumps without the debugger attached since the JVM can behave differently in debugging mode. How were the previous thread dumps created?
   
   What version of Pulsar are you running? (can you reference a Git SHA for the version if it's based on master branch?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] JohnMops commented on issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

JohnMops commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-864527745


   Any news regarding this one? 
   We are seeing a topic stops every 9 seconds and needs to be reloaded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] rodrigoreis commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

rodrigoreis commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-765380117


   I'm facing this problem with 2.7.0


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] devinbost commented on issue #6054: Catastrophic frequent random topic freezes, especially on high-traffic topics.

Posted by GitBox <gi...@apache.org>.

devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-800868920

I created a test cluster (on fast hardware) specifically for reproducing this issue. In our very simple function flow, using simple Java functions without external dependencies, on Pulsar **2.6.3**, as soon as we started flowing data (around 4k msg/sec at 140KB/msg average), within seconds the bug appeared, blocking up the flow, and causing a backlog to accumulate.