You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/05/20 23:11:43 UTC

[GitHub] [bookkeeper] devinbost opened a new issue #2716: High volume of PendingAddOp write failures due to timeouts

devinbost opened a new issue #2716:
URL: https://github.com/apache/bookkeeper/issues/2716


   I noticed that we were getting a large volume of these messages:
   
   `DEBUG org.apache.bookkeeper.client.LedgerHandle - pending add not completed: PendingAddOp(lid:3510117, eid:27, completed:false)`
   It seemed like they were getting caused by timeouts, so I checked the log for how many timeouts we were seeing:
   ```
   $ cat /var/log/messages | grep 'Bookie operation timeout' | wc -l
   504184
   ```
   That's over half a million timeouts over 2 hours on a test cluster, and I don't think I even sent that many messages to Pulsar that would have resulted in these `PendingAddOp`s.
   
   I wanted to see the full history of actions on an entryID to get an understanding of the behavior, so I did a grep for it in the log.
   Here's what appeared:
   
   ```
   DEBUG org.apache.bookkeeper.client.PendingAddOp - Unsetting success for ledger: 3510077 entry: 12098 bookie index: 1
   DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Could not write Add request to bookie 10.20.69.29/10.20.69.29:3181 for ledger 3510077, entry 12098
   DEBUG org.apache.bookkeeper.client.PendingAddOp - Write did not succeed: 3510077, 12098. But we have already fixed it.
   DEBUG org.apache.bookkeeper.proto.PerChannelBookieClient - Could not write Add request to bookie 10.20.69.37/10.20.69.37:3181 for ledger 3510077, entry 12098
   WARN  org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (3510077, 12098): Bookie operation timeout
   . . .
   [ Repeats a ton of times ]
   ```
   
   It looks like we're returning a success callback [in LedgerHandle](https://github.com/apache/bookkeeper/blob/87579b0a9f18833ee41fcae37582bb68606d68e7[…]er/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java)
   but it seems like the future is failing, and then bookkeeper seems to be repeatedly trying to resolve the issue and timing out without reporting the problem to the client. 
   
   This is related to https://github.com/apache/pulsar/issues/6054
   (In that case, it appears that these PendingAddOp timeouts are causing the broker to not ack the messages.)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org