You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/02/27 09:33:02 UTC

[GitHub] [bookkeeper] Vanlightly commented on issue #2612: Wrong ReadLastAddConfirmed logic that can lead to data loss in client applications

Vanlightly commented on issue #2612:
URL: https://github.com/apache/bookkeeper/issues/2612#issuecomment-787043790

@ivankelly and I have discussed this general scenario recently, where a bookie is wiped clean and brought back with the same identity. This introduces arbitrary failures (bookies giving wrong answers) for which the current logic is not currently designed to handle. The current logic assumes any explicit positive or negative response to be trustworthy and acts accordingly.

If this scenario of a wiped clean bookie can occur, then we need to update the logic to either handle arbitrary failure (tricky to do well) or prevent arbitrary failures from happening. The method that Ivan has been working on is the latter and is related to making running without the WAL more resilient.

The idea is that you can configure bookies to run a pre-boot check where they identify all open ledgers and mark them as in a limbo state because there is a risk that the bookie could have received writes that is has now lost. The bookie performs a recovery and close on the ledger, then clears the limbo status. While in this limbo status, reads of any type are treated differently, as follows:

- When the requested entry exists, the bookie responds with a positive read response as normal.
- When the request entry is not found, the bookie responds with an "unknown" response (EUNKNOWN).

The result is that arbitrary failures are avoided. In the case where a client receives an EUNKNOWN, it is not treated as a valid response that could cause this data loss scenario. Limbo is a temporary status that is cleared as soon as the bookie is able to complete ledger recovery. This work is largely complete already.

I think it's worth considering the general approach to arbitrary failures, whether the protocol should be designed to work correctly in the face of them, or we put in safeguards to avoid arbitrary failures from occurring. Avoidance seems to be the best route. This is mostly Ivan's work but I fully agree with his conclusions.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org