You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/09/07 10:02:14 UTC

[GitHub] [pulsar] zymap opened a new issue, #17516: [Bug] Consumer trying to read a non-exist ledger

zymap opened a new issue, #17516:
URL: https://github.com/apache/pulsar/issues/17516

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar.
   
   
   ### Version
   
   2.8.1
   
   ### Minimal reproduce step
   
   Not reproduced yet.
   
   ### What did you expect to see?
   
   The offload data shouldn't lose.
   
   ### What did you see instead?
   
   **The error message we saw in the broker is caused by the consumer trying to read a non-existing ledger.**
   
   ```
   2022-08-16T09:22:04.791520867Z 09:22:04.789 [offloader-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader - Failed readOffloaded:
   [pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791545871Z java.lang.NullPointerException: null
   [pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791548867Z        at org.apache.bookkeeper.mledger.offload.jcloud.impl.DataBlockUtils.lambda$static$0(DataBlockUtils.java:73) ~[?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791551588Z        at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.open(BlobStoreBackedReadHandleImpl.java:237) ~[?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791554123Z        at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreManagedLedgerOffloader.lambda$readOffloaded$3(BlobStoreManagedLedgerOffloader.java:506) ~[?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791556370Z        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
   [pod/production-pulsar-broker-0/production-pulsar-broker] 2022-08-16T09:22:04.791565338Z        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) [com.g
   oogle.guava-guava-30.1-jre.jar:?]
   ```
   
   ##  Details: 
   
   we saw some ledgers are offloaded but not seen in the S3. 
   
   For example, we found the ledger 1025044 logs. We saw the ledger has been offloaded failed and the offload is trying to clean up the previous offload information and then try to offload it again. But we didn't see successfully offloaded logs.
   
   **The weird thing is, the logs show the ledger is failed to offload but the metadata shows it offload successfully.**
   
   ![Screen Shot 2022-09-07 at 17 56 04](https://user-images.githubusercontent.com/24502569/188850625-db7a4707-f826-4376-8a1e-50a31c65d84b.png)
   
   In Pulsar offloader implementation, the offload process has three steps. 
   
   First, prepare the metadata. The offloader will generate a UUID for the offload ledger and using it get a filename used to offload to the cloud storage. Then it will persist the offload context(including UUID and offloader information, such as offloader name, bucket, and so on) into the meta store (zookeeper).
   
   Second, offload the ledger. It starts offloading the ledger into the cloud storage. The index file name is UUID-ledger-<ledgerId>-index, and the data file name is UUID-ledger<ledgerId>-data.
   
   Once the ledger offloads successfully, it will update the `complete` in the LedgerInfo.OffloadContext to `true` and then persist it into the meta store. When it persists successfully, it will update the ledgers map in the memory to the latest status. From now, if there has a consumer who wants to read the ledger, it will read from the tiered storage.
   
   If there has a failure in any steps, it will clean up the offloaded files and do the process again.
   
   ---
   
   Through the logs, we found the zookeeper is not normal since 8:00. Then we found when the ledger run the third step, it failed and throws an exception:
   
   `Caused by: org.apache.pulsar.metadata.api.MetadataStoreException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /managed-ledgers/navision/product/persistent/ordered.product_category.navision_model.events-partition-0
   `
   And then, the pulsar offloader cannot send the request to the zookeeper successfully. It causes the offload process will do again and again, and what we see in the logs is the ledger continues to offload but never succeed.
   
   There has a case the client will throw `ConnectionLoss`, that is the request sent to the server successfully, but the client was closed for some other reason and can't receive the response.
   
   This will explain why we see in the metadata, that the offload `complete` is true, but we can not find it in the S3.  
   It updates the metadata successfully but returns an error, then our offloader cleans up the ledger.  Then the consumer won't get the message.
   
   
   
   
   
   
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] ethqunzhong commented on issue #17516: [Bug] Consumer trying to read a non-exist ledger

Posted by GitBox <gi...@apache.org>.
ethqunzhong commented on issue #17516:
URL: https://github.com/apache/pulsar/issues/17516#issuecomment-1247952799

   it seems this PR have fixed.
   https://github.com/apache/pulsar/pull/17056


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [pulsar] github-actions[bot] commented on issue #17516: [Bug] Consumer trying to read a non-exist ledger

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #17516:
URL: https://github.com/apache/pulsar/issues/17516#issuecomment-1279873560

   The issue had no activity for 30 days, mark with Stale label.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org