You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/02/20 13:24:24 UTC

[GitHub] [bookkeeper] Vanlightly opened a new issue #2614: Existing fencing not enough to prevent data loss

Vanlightly opened a new issue #2614:
URL: https://github.com/apache/bookkeeper/issues/2614

I have implemented the BookKeeper replication protocol in TLA+. You can find the specification and a readme here: https://github.com/Vanlightly/bookkeeper-tlaplus

The TLA+ model checker found a data loss scenario related to how fencing is used to protect split-brain and data loss.

**Protocol Defect**

The BookKeeper protocol defect found with this spec is that fencing of LAC reads is not enough to prevent committing an add. The below example involves the loss of a fencing request. However, the same is achievable if one fencing LAC req is delivered after a Recovery Read Req, for example due to some network delay and a connection failure between the requests (I have not verified if that is possible though).

Paste into https://sequencediagram.org/ or use link https://sequencediagram.org/index.html#initialData=A4QwTgLglgxloDsIAIDuBGAUKSt4iTQCZtxo5EUAjLHc-QqkuvS5KgZlNwoJQC8A1pgwBaUQA8aALmQBTAAwiiogHxDZAW3CDkIAM7IoCZGDkwA9gDc5YAJ7K1M5ADM5CGHMw01qIrIAbEBhkAF5kUSwxVSZZRW8VVQw4hTQwKAgId0cY-1d3TwTfPKCQ8MjHSU5ZNw8vBAss5GtbNHQAGg18uuR9AFcYT31DCxN0ZAaAEzlkYFbg3QBHPoswPs0AHQRSo0NIrYAxArlJ2QBvGnamAF925ABRJHS5fVkAHjeaZGlVZDPb9hEb6-M6KAGcYF-a6qVQ5ZxmECTeRKPxOPIIpHxVExDiyDHI7zoYqyAByFgAygMABaPCD2bwcYnIMmUmA0p4ObFdGABCz6GYBE4Ac3mhjkmmAEE5RJxslQ6Sa8U4vnQKTSGSyCBEMuSehgggaqEFkxFyK2yAgFmQMBAAUFYEwDSaLTAbU6glkAEF9YbjSKkfKMjNFLt5C43DBoDYAnZkLz9CgDEYUKgk3S+h4QFkkVRY5o5BBEVmQEA

```
participant w1
participant w2
participant b1
participant b2
participant b3
participant zk
w1--xb1: e0
w2->zk: mark as in recovery
w2->b1: fence
b1->w2: lac = -1
w1->b2: e0
b2->w1: e0 written
w2->b2: fence
b2->w2: lac = -1
w2--xb3: fence
note over w1,zk: fence success on 1 node per ack quorum\nlac is -1\nFenced: {b1,b2}, Entries: <<b1 :> {}, b2 :> {e0}, b3 :> {}>>
w2->b1: read e0
w2->b2: read e0
w2->b3: read e0
b1->w2: NoSuchEntry
b3->w2: NoSuchEntry
w2->zk: close ledger as empty
w1->b3: write e0
b3->w1: e0 written
w1->w1: acknowledge e0\n to caller
note over w1,zk: Acknowledged write e0 is effectively lost as it was truncated by metadata
```

Data loss!

The issue is that it is possible to reach AckQuorum even after enough bookies are fenced.

It derives from the combination of a couple of things (using Qw 3, Qa 2 as an example):

1. Ledger recovery (by w2) can begin when an entry is already at Qa-1 and the original writer (w1) is still alive. In this case, w1 only needs one more positive response from a bookie for the entry to be committed.

2. Recovery consists of 3 phases:
1. Fencing (LAC reads)
2. Recovery read and write back
3. Ledger close

Phase 2 begins once one bookie in every ack quorum is fenced. This allows for Phase 2 and 3 to execute when a single bookie has not responded to the LAC read. If say, one LAC read request was lost, this leaves a single bookie unfenced which means that the ensemble as a whole still permits a single write to get through. This single write is enough for w1 to reach Qa after Phase 3 has completed (aka the ledger being closed).

3. Recovery reads succeed as long as (Qw-Qa)+1 bookies do not explicitly respond with a NoSuchEntry/Ledger code. Because it is possible for a single bookie to not be fenced during Phase 2, recovery reads are not monotonic, in the sense that once an answer is received, later information may invalidate that answer. Basically a NoSuchEntry response might later turn into a successful response (if w2 were to ask again) because w1 was able to write to an unfenced bookie.

**The fix**

The fix is to make recovery reads monotonic (in the sense that once a read is negative, it is guaranteed to remain negative). We achieve this by adding fencing to recovery reads also. Once a bookie has responded with a NoSuchEntry code to a recovery read, it will not accept a write for that entry because it is fenced.

For that reason the TLA+ specification includes an extra constant, RecoveryReadsFence, in order to test that. The invariant violation no longer occurs when set to TRUE.

I will be submitting a PR shortly which includes the code change and a unit test.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on issue #2614: Existing fencing not enough to prevent data loss

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on issue #2614:
URL: https://github.com/apache/bookkeeper/issues/2614#issuecomment-797384941


   This has been fixed by #2616


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly closed issue #2614: Existing fencing not enough to prevent data loss

Posted by GitBox <gi...@apache.org>.

Vanlightly closed issue #2614:
URL: https://github.com/apache/bookkeeper/issues/2614


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org