You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2020/10/10 00:05:29 UTC

[GitHub] [pulsar] lhotari opened a new issue #8229: Broker enters an infinite loop in ManagedLedgerImpl.asyncReadEntries and consumes 100%+ CPU

lhotari opened a new issue #8229:
URL: https://github.com/apache/pulsar/issues/8229


   Pulsar Broker 2.6.1 enters an infinite loop between these stack frames:
   ```
   org.apache.bookkeeper.mledger.impl.OpReadEntry.checkReadCompletion
   org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.internalReadFromLedger
   org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl.asyncReadEntries
   org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$checkReadCompletion$1
   ```
   
   **To Reproduce**
   There aren't currently known steps to reproduce the issue in an isolated way. This happens in a load test on a 3 broker, 3 bookie, 3 zk cluster running in Openshift/k8s. The load test covers 7500 Pulsar topics which are read with short-living readers, there aren't any subscriptions. The topics are created in a namespace with retention policy of 10080 minutes. Each topic contains up to 5 messages.
   
   **Additional context**
   Pulsar Slack thread with lots of details at https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1602159831227900
   
   **Possible reason**
   
   This is the first few lines of code in [OpRead.checkReadCompletion](https://github.com/apache/pulsar/blob/8b7b60e6507ae4a961e66f10e920fc1e3808df41/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java#L133) method:
   ```java
       void checkReadCompletion() {
           if (entries.size() < count && cursor.hasMoreEntries()) {
               // We still have more entries to read from the next ledger, schedule a new async operation
               if (nextReadPosition.getLedgerId() != readPosition.getLedgerId()) {
                   cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this);
               }
   ```
   
   The usage of `cursor.hasMoreEntries()` seems invalid here.
   
   In the table of data extracted from the heap dump by using Eclipse MAT OQL, I can see that `OpRead.readPosition.ledgerId != OpRead.cursor.readPosition.ledgerId` and `OpRead.nextReadPosition.ledgerId != OpRead.cursor.ledger.lastConfirmedEntry.ledgerId` (cursor write position) .
   
   ![image](https://user-images.githubusercontent.com/66864/95639961-1f89be80-0aa3-11eb-8f90-184bce1d44e3.png)
   
   Therefore, when `cursor.hasMoreEntries()` is called, it should take `OpRead.readPosition.getLedgerId()` into account. Currently the ledgerId is ignored and in the case of when cursor really has more entries in a different ledgerId, it will lead to an infinite loop.
   
   The code in `checkReadCompletion` has another problem. The call to `[cursor.ledger.startReadOperationOnLedger](https://github.com/apache/pulsar/blob/e2fbf31ff34f2aa94276ab6103c73cec23849c1b/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1844)` might call `opReadEntry.readEntriesFailed`. 
   ```java
   PositionImpl startReadOperationOnLedger(PositionImpl position, OpReadEntry opReadEntry) {
           Long ledgerId = ledgers.ceilingKey(position.getLedgerId());
           if (null == ledgerId) {
               opReadEntry.readEntriesFailed(new ManagedLedgerException.NoMoreEntriesToReadException("The ceilingKey(K key) method is used to return the " +
                       "least key greater than or equal to the given key, or null if there is no such key"), null);
           }
   ```
   The problem here is that there isn't currently a way to signal this back to `checkReadCompletion` method and the execution will continue calling startReadOperationOnLedger again later on in the same `checkReadCompletion` method call. 
   
   **Possibly related issue**
   #8078 has a similar stack trace in a screenshot of a threaddump. That also has a symptom of high CPU consumption.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] jiazhai closed issue #8229: Broker enters an infinite loop in ManagedLedgerImpl.asyncReadEntries and consumes 100%+ CPU

Posted by GitBox <gi...@apache.org>.

jiazhai closed issue #8229:
URL: https://github.com/apache/pulsar/issues/8229


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] hangc0276 commented on issue #8229: Broker enters an infinite loop in ManagedLedgerImpl.asyncReadEntries and consumes 100%+ CPU

Posted by GitBox <gi...@apache.org>.

hangc0276 commented on issue #8229:
URL: https://github.com/apache/pulsar/issues/8229#issuecomment-706816478


   I will try to locate this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [pulsar] lhotari commented on issue #8229: Broker enters an infinite loop in ManagedLedgerImpl.asyncReadEntries and consumes 100%+ CPU

Posted by GitBox <gi...@apache.org>.

lhotari commented on issue #8229:
URL: https://github.com/apache/pulsar/issues/8229#issuecomment-709984504

I've been trying to understand the code more and analyze it. It's not so clear that the problem is in `cursor.hasMoreEntries()` call.

I added some more columns to the table that I extracted from one of the heap dumps: https://gist.github.com/lhotari/c34db2664b15fc32c9f0e7ae8b37dbda

The goal of this is that when I have the state of the variables in the table, I can manually read the code and debug it like I would have done if I would have had the change to attach a debugger when the broker was in the problematic state.

For example, I added `cursor.ledger.currentLedger.lastAddConfirmed` which is relevant state to understand how the execution keeps on looping. That is read on this line https://github.com/apache/pulsar/blob/e2fbf31ff34f2aa94276ab6103c73cec23849c1b/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1566. All rows in the extracted table have -1 as cursor.ledger.currentLedger.lastAddConfirmed. therefore, the execution will always go to the branch https://github.com/apache/pulsar/blob/e2fbf31ff34f2aa94276ab6103c73cec23849c1b/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1569
Within that branch, `opReadEntry.checkReadCompletion()` gets called again and the infinite loop keeps on looping.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org