You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/05/05 14:25:20 UTC

[GitHub] [bookkeeper] Vanlightly opened a new pull request #2706: BP-44: Running without journal proposal

Vanlightly opened a new pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706


   Includes the BP-44 design proposal markdown document.
   
   Master Issue: #2705


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630162948



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.

Review comment:
       For an acknowledged entry to be lost it, every single bookie that received the entry (write quorum in the normal case, ack quorum in some edge cases) would need to simultaneously be terminated before it could flush to disk. This gives us Apache Kafka level safety, which uses the page cache. So a DC-wide power loss would be an example.
   
   Basically it would require a correlated failure which is uncommon but does happen. The chance of an uncorrelated failure (two random servers dying at the same time for example) leading to data loss is extremely low.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r678087429



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot

Review comment:
       The bookie fences the ledger locally. Only non-closed ledgers can ever be fenced. The writer may have already have chosen another bookie, leaving this bookie outside of the current ensemble, but this does not pose any consistency issues. The only consistency threat is the bookie saying it doesn't have an entry that it once did.
   
   Data loss is losing acknowledged entries. Entries are acknowledged as normal, we simply don't write to the journal. This means data may only exist in the write cache upon acknowledgement.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

eolivelli commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627106046



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).

Review comment:
       good

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:
+     - For each non-closed ledger, mark the ledger as fenced and in-limbo in the index.
+     - Update the cookie if it was a cookie failure
+4. Phase two
+   - For each ledger
+     1. If the ledger is in-limbo, open and recover the ledger.

Review comment:
       good




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r638543790



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       As far as I know from my research into BookKeeper design decisions, Entry Log Per Ledger (ELPL) can create problems with high volumes of active ledgers, so is not a universal solution. Perhaps @ivankelly can comment on that as he know more than me on that subject.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630158077



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.

Review comment:
       The writeset of an entry is the set of bookies that should host that entry, whose cardinality is the write quorum. A fragment is commonly referred to as an ensemble, though that word gets a little overloaded. Concretely a fragment is one kv entry in the ledger ensemble metadata, where the key is the first entry id of the fragment, and the value is the ensemble of bookies responsible for that range. The range is bounded by the key (first entry) and either the next fragment (exclusive) or the end of the ledger (inclusive). That range is contiguous.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r628057521



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       Another topic is that of plugging gaps in a ledger that has experienced entry loss. If an entry is lost, then readers are blocked at that point. On our radar is a command to plug holes in the ledger with some kind of no op entry that the client skips. It allows for continued availability of the ledger, even though it experienced data loss. Such a mechanism is not included in this BP though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

eolivelli commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r628189710



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       @dlg99 I see your point. It makes sense.
   
   +1 for adding a configuration option, disabled by default.
   
   But, can't we keep this cookie part out of this BP ? This BP is about running without journal, not running on ephemeral disks
   
   for the Cookie part we do not need a BP, you can simply send a PR, I imagine it will be an easy and simple patch




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] fpj commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

fpj commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630323674



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.

Review comment:
       It sounds right to me that it provides a Kafka-like default semantics. For me, it is an advantage that BK has to be fast while guaranteeing durability. Weakening durability is not desirable from my perspective. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

ivankelly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627221618



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:

Review comment:
       There's two I/Os we need to concern ourselves here.
   * The check for whether the bookie has the entries it should have. This check runs against the index, so should not impact other traffic.
   * The copying of missing entries. In the common case, the number of entries to be copied should be a rounding error in terms of I/O. The only case where it would be significant is if the disks have been wiped and the bookie is trying to reconstruct the full contents. In this case, I agree it may make sense to make the bookie read-only. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r628004798



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       Agreed. Allow through configuration for automatic cookie rewriting to be enabled or disabled. If the admin has it disabled and cookie validation does fail then as @ivankelly suggested, it would be good if the admin can run a command to resolve the cookie mismatch and execute the data repair.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627173665



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):

Review comment:
       Yes you are right, for regular reads there is only positive or negative, with negative being all read results in, with no positive response. We don't need to differentiate between explicit negative and unknown for these reads as they are not used for consistency decisions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627799356



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       Did I understand it correctly that LAC behavior does not change except readLAC can get entryId that hasn't been fsynced to the disk yet?
   
   What happens in case of coordinated restart of ES bookies (or the whole DC, fwiw)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] merlimat commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

merlimat commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630340440



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.

Review comment:
       Durable & fast writes are one of the biggest advantages of BK. 
   
   Having said that, the impact of a small amount of data loss varies a lot, depending on the use case and the nature of the data. In some cases, it would make sense to have the option for a less durable mode if that means a reduction in hardware cost. 
   
   Another consideration can be made on the usefulness of journal when using locally attached disks in cloud VMs. Since the volume is going to be lost when the VM fails, the journal and the fsync on it will not have the same results they have on a bare metal deployment, where the data is safe unless there's a mechanical failure on the disk. 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] jvrao commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

jvrao commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r637594482



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 

Review comment:
       With the current mechanism: An entry will be flushed to the disk before responding. So even it loses unfushed operations it should not lose either E1 or fencing request.
   Also at least in salesforce we won't let the bookie comeup if all disks are cleared.

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot

Review comment:
       Do you consider un-flushed data loss as data loss? Without journal do we ack without flushing to the disk?

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:

Review comment:
       > This check runs against the index,
   
   Index file on the disk is also might not have flushed. right? Unless you are talking with ZK metadata comparison. 

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)

Review comment:
       Adding ensemble  to the list makes it complete.

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.

Review comment:
       Based on how exactly you are going to detect unclean shutdown, it may not say if the data loss actually happened or not. Fencing all ledgers may be a big hammer.

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger

Review comment:
       Fencing requests go to journal right? so they should have been persisted. What am I missing here?

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)

Review comment:
       You mean the data that is flushed to disk is lost here right? So real data loss.

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       Did we consider ELPL to get-rid of journal? With ELPL it is kind of each ledger having its own consistency. With the addition of ledger level flags of durability levels with flush on close option that @dlg99 mentioned we can achieve all the current durability with ledger level granularity and completely get rid of journal. 

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.

Review comment:
       Public clouds charge by the IOPS. This could be another argument we could make. Journal+Index+EntryLog; we have great scope to reduce the storage and IOPS requirements. 

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       > Yes, in case of mismatch, the cookie gets rewritten after all ledgers have been fenced and non-closed 
   
   How can you fence a non-closed ledger? Writer might have changed the ensemble and picked some other bookie to continue and may even come back to this bookie after another ensemble change. I may have to read down your exact proposal on how to detect and correct/fix.
   
   > running bookie with the data on ephemeral storage
   
   Imagine a situation where Writer is writing to ephemeral storage with Qa=2 Qw=3 En=3 (B0, B1, B2). If a Bookie (B0) goes down and comes up with loss of data (ephemeral) and if that bookie fences can we make the writer continue to write the ledger as it could cotinue to receive 2 acks from other two bookies (B1, B2)?
   

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       > If it were a DC power outage
   
   There are many k8s scenarios where the orderly shutdown may not happen. There are various scenarios but I guess we can improve ops process to make it happen. But that would take some effort.

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.

Review comment:
       With ELPL, I think we can have ledger level durability and still avoid journals.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626974741



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:
+     - For each non-closed ledger, mark the ledger as fenced and in-limbo in the index.
+     - Update the cookie if it was a cookie failure
+4. Phase two
+   - For each ledger
+     1. If the ledger is in-limbo, open and recover the ledger.
+     2. Check that all entries assigned to this bookie exist in the index.
+     3. For any entries that are missing, copy from another bookie.
+     4. Clear limbo status if set
+
+When booting a bookie with empty disks, only phase one needs to be complete before the bookie makes itself available for client requests. 
+
+In phase one, if the cookie check fails, we mark all non-closed ledgers as “fenced”. This prevents any future writes to these ledgers on this bookie. This solves the problem of an empty bookie disk allowing writes to closed ledgers (Scenario 1).
+
+Given that the algorithm solves both the issues that cookies are designed to solve, we can now allow the bookie to update its cookie without operator intervention. 
+
+### Formal Verification of Proposed Changes
+
+The use of the limbo status and fencing of all ledgers on boot-up when detecting an unclean shutdown has been modelled in TLA+. It does not model the whole boot-up sequence but a simplified version with only fencing and limbo status. 
+
+The specification models the lifetime of a single ledger and includes a single bookie crashing, losing all data. The specification allows the testing of:
+
+- enabling/disabling the fencing
+- enabling/disabling the limbo status.
+
+When running without limbo status, the model checker finds the counterexample of scenario 2. When running without fencing of all ledgers, the model checker finds the counterexample of scenario 1. When running with both enabled, the model checker finds no invariant violation.
+
+The specification can be found here: https://github.com/Vanlightly/bookkeeper-tlaplus
+
+### Public Interfaces
+
+- New server config `journalWriteData`
+- Return codes. Addition of a new return code: `EUNKNOWN` which is returned when a read hits an in-limbo ledger and that ledger not contain the requested entry id.
+- Bookie ledger metadata format (LedgerData). Addition of the limbo status.
+
+### Compatibility, Deprecation, and Migration Plan
+

Review comment:
       what happens if bookie runs with journalWriteData set to true, then journalWriteData set to false, bookie reboots and ledgers are in limbo state? What is the order of recovery then? Do we do the "limbo processing", if yes - is it happening before or after recovery from journal?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

eolivelli commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627237461



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       Still I cannot understand why this is related to this work.
   The cookie is a static file (if we do not consider changing hostname or storage expansion...) and it cannot be corrupted by an unclean shutdown




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

ivankelly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r638579649



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:

Review comment:
       With DbLedgerStorage, the index is flushed directly after update, which happens in a batch as the entrylog is flushed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on pull request #2706: BP-46: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#issuecomment-957355091


   I have updated the BP number to 46 as I used 44 also for the USE metrics BP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r678092409



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger

Review comment:
       We can make it so that fencing requests continue to go to the journal to avoid this problem. Losing a fencing status would still be possible via the cookie mechanism changes though. If a disk was permanently lost, and the automatic cookie mechanism proposed is used, then the bookie would self repair, but losing the fenced status of any repaired ledgers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r678080442



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)

Review comment:
       I will rename ALL to Ensemble as that is what I meant by that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

eolivelli commented on pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#issuecomment-880789730


   @Vanlightly what's the status of this great work ?
   
   It would be great to move forward with this discussion and also see a preview of the implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r628020851



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       If it is a controlled restart, then all bookies should be able to flush all their data to disk. So everything would be fine. 
   
   There is a risk that the graceful shutdown ends up not so graceful though, such as k8s not waiting long enough before killing the pods. Any early termination should be detected as an unclean shutdown.
   
   If it were a DC power outage then we lose unflushed data across the cluster.
   
   Either way, we have elevated risk of data loss. Any given ledger can recover from lossy writes and fencing ops without data loss only if there is the number of overlapping unclean shutdown and recoveries is < Ack Quorum bookies. Once we reach AQ, there could exist one or more entries that only reached AQ and were hosted on the affected bookies and lost, therefore being unrecoverable. We just need an entry to remain intact on a single bookie for it to be recoverable.
   
   It would make the use of AZs even more important and to have good automation that does not kill bookies if they take a long time to shutdown.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630023281



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       @dlg99 These additions are very interesting and this BP lays the foundation for such additional features. I recommend that we leave this BP as is and then once our implementation is merged we consider these potential additions as a next step along those lines.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627200781



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:
+     - For each non-closed ledger, mark the ledger as fenced and in-limbo in the index.
+     - Update the cookie if it was a cookie failure
+4. Phase two
+   - For each ledger
+     1. If the ledger is in-limbo, open and recover the ledger.
+     2. Check that all entries assigned to this bookie exist in the index.
+     3. For any entries that are missing, copy from another bookie.
+     4. Clear limbo status if set
+
+When booting a bookie with empty disks, only phase one needs to be complete before the bookie makes itself available for client requests. 
+
+In phase one, if the cookie check fails, we mark all non-closed ledgers as “fenced”. This prevents any future writes to these ledgers on this bookie. This solves the problem of an empty bookie disk allowing writes to closed ledgers (Scenario 1).
+
+Given that the algorithm solves both the issues that cookies are designed to solve, we can now allow the bookie to update its cookie without operator intervention. 
+
+### Formal Verification of Proposed Changes
+
+The use of the limbo status and fencing of all ledgers on boot-up when detecting an unclean shutdown has been modelled in TLA+. It does not model the whole boot-up sequence but a simplified version with only fencing and limbo status. 
+
+The specification models the lifetime of a single ledger and includes a single bookie crashing, losing all data. The specification allows the testing of:
+
+- enabling/disabling the fencing
+- enabling/disabling the limbo status.
+
+When running without limbo status, the model checker finds the counterexample of scenario 2. When running without fencing of all ledgers, the model checker finds the counterexample of scenario 1. When running with both enabled, the model checker finds no invariant violation.
+
+The specification can be found here: https://github.com/Vanlightly/bookkeeper-tlaplus
+
+### Public Interfaces
+
+- New server config `journalWriteData`
+- Return codes. Addition of a new return code: `EUNKNOWN` which is returned when a read hits an in-limbo ledger and that ledger not contain the requested entry id.
+- Bookie ledger metadata format (LedgerData). Addition of the limbo status.
+
+### Compatibility, Deprecation, and Migration Plan
+

Review comment:
       Phase one is executed at the time of the cookie validation, which is pre boot. Phase two, currently named DataIntegrityService, is inserted into the boot sequence immediately after the AutoRecoveryService. This means that phase two if run, is run after the journal has been replayed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r678079744



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.

Review comment:
       You probably could but it wouldn't support large numbers of concurrent ledgers very well. The problem is that even with SSDs, you suffer the penalty of many small writes. You can't even delay the syncs to disk as it would increase write latency too much. So while ELPL, once further matured could provide a way for running without the journal, it may not be advisable for all workloads.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r678082834



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 

Review comment:
       This describes running without a journal, flushing occurs on entry log rotation. But it is true that the cookie mechanism prevents coming back up with wiped disks, so I will remove that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630027542



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       @eolivelli I'd prefer to keep the cookie stuff in this BP as we have a working implementation ready to merge. Changes now would require refactoring and our main priority is syncing with OSS. The flag we will add of course but further changes, unless really necessary, we'd like to avoid.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

eolivelli commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627106583



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       are you talking about automatically rewriting the cookie in case of mismatch ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627793341



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       @eolivelli my guess it could be related to running bookie with the data on ephemeral storage (low value or transient/recoverable data, cost saving). It simplifies maintenance in this case.
   But what happens in private DC if someone borks fstab?
   I agree it has to be an option.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r628378654



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       Currently readLac returns N and you can read entry N now and after DC-wide power loss. 
   With the changes you can read entry N now but unluckily timed after DC-wide power loss whole WQ will return EUNKNOWN. 
   
   I understand that this has to be expected in this case but I'd like to consider some additions:
   - Consider having a way to get something like `readLacPersisted()` (or `readLac(boolean isPersisted)`) to have a way to distinguish potentially lossy from persisted tail entries
   - and/or have journal bypass configurable on the per ledger handle level
   - have ledger close() return after all ledger entries (at least AQ) are flushed to entry logs and fsynced.
   
   Also document durability expectations for the tail entries with clear explanation of the risks to let people make more educated choice.
   
   Otherwise the only option is to have two bookie clusters (strong durability and journal bypass) which adds operational overhead + more changes in the app.
   
   The usecase is an app using BK for its WAL with strong durability requirements and for data where open to close durability matters but failure in the process is recoverable. i.e. lost entry at the tail of WAL may prevent app from recovering gracefully. 
   
   @jvrao fyi




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

ivankelly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627227206



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       @Vanlightly we should have a flag for whether we automatic rewriting is allowed. We've been burned in the past by misconfiguration allowing bookies to come up where it really should have been kicked back to a human. Or maybe not a flag, but the steps to unjam yourself from a cookie mismatch should be a single command. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626709616



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       It's related. We're protecting against silent data loss that can cause inconsistency in the protocol. The cookies already played a role but we wanted the new mechanism to play well with the cookie mechanism and improve it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626704676



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).

Review comment:
       This is an internal status that is not exposed to clients. It is another field of the bookie ledger metadata (not the ZK metadata). The only exposure to clients is the EUNKNOWN response code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r628054942



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       Regarding your question about readLac. A limbo ledger returns an EUNKNOWN as it cannot safely answer that question.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#issuecomment-833889979


   @jvrao You might be interested in this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on pull request #2706: BP-46: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#issuecomment-957355091


   I have updated the BP number to 46 as I used 44 also for the USE metrics BP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626956932



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):

Review comment:
       IIRC, currently:
   Negative when explicit negative from a single bookie and no explicit positive response before that.
   https://github.com/apache/bookkeeper/issues/2612




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

ivankelly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r638575113



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       > Did we consider ELPL to get-rid of journal?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626946127



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies

Review comment:
       IIRC, currently:
   Negative when explicit negative from a single bookie and no explicit positive response before that.
   https://github.com/apache/bookkeeper/issues/2612
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] fpj commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

fpj commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630311579



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.

Review comment:
       Right, that's a concept that's in the code, makes sense. We might want to reflect some of that in Javadocs, but that's a separate issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r678096550



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.

Review comment:
       We set a bit in the index on start-up and clear it as the last step of shutdown. Based on the value on start-up we know if the bookie was shutdown cleanly. The only way to know if data loss has occurred are the subsequent steps of the process.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

ivankelly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r638575113



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+

Review comment:
       > Did we consider ELPL to get-rid of journal?
   
   We did not. ELPL requires random writes per ledger, which negatively impacts performance in the presence of many ledgers, which was not acceptable in our usecase.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626946127



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies

Review comment:
       IIRC, currently:
   Negative when explicit negative from a single bookie and no explicit positive response before that.
   https://github.com/apache/bookkeeper/issues/2612
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r627205191



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       Yes, in case of mismatch, the cookie gets rewritten after all ledgers have been fenced and non-closed ledgers put in-limbo, as part of phase one, pre boot sequence. It allows the bookie to automatically handle its own recovery, rather than require operator intervention.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r638537175



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       Regarding the fencing. The bookie only fences itself, this has no impact on any other bookies. Later on it performs recovery, which we know is correct no matter what any other writer is doing.
   
   Regarding the ephemeral storage. Yes, another writer could still make progress, until the bookie initiates recovery and closes the ledger. This does not result in data loss.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#issuecomment-888166341


   @eolivelli Once I am done with a couple of active projects I can finish up the code changes and submit a PR for review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] eolivelli commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

eolivelli commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626690004



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:
+     - For each non-closed ledger, mark the ledger as fenced and in-limbo in the index.
+     - Update the cookie if it was a cookie failure
+4. Phase two
+   - For each ledger
+     1. If the ledger is in-limbo, open and recover the ledger.

Review comment:
       this will start a BK client inside the bookie,
   how can we ensure that the client is configured with the right parameters ?
   also Authentication won't be easy to setup.
   
   probably we could leverage the same configuration for AutoRecovery deamon

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.

Review comment:
       why this is related to this BP ?

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).

Review comment:
       "All open ledgers are placed in the limbo status"
   
   is this limbo status helm in memory on the Bookie ?
   we are not talking to propagate this information to every BK client that is writing




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626708122



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:
+     - For each non-closed ledger, mark the ledger as fenced and in-limbo in the index.
+     - Update the cookie if it was a cookie failure
+4. Phase two
+   - For each ledger
+     1. If the ledger is in-limbo, open and recover the ledger.

Review comment:
       It uses the BookKeeperAdmin client like the Auditor does.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r678083491



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)

Review comment:
       No, this is describing running without the journal where flushes only occur on entry log rotation. So unflushed data is lost.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r626964288



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.
+- Limbo ledgers that have been repaired have their limbo status cleared.
+
+### The Full Boot-Up Sequence
+
+This mechanism of limbo ledgers and self-repair needs to work hand-in hand with the cookie validation check. Combining everything together:
+
+On boot:
+1. Check for unclean shutdown and validate cookies
+2. Fetch the metadata for all ledgers in the cluster from ZK where the bookie is a member of its ensemble.
+3. Phase one:
+   - If the cookie check fails or unclean shutdown is detected:

Review comment:
       Should we keep the bookie as read-only until all ledgers are out of limbo to prevent slow bookies affecting the cluster? I assume that Phase Two creates additional IO and processes multiple ledgers in parallel.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] fpj commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

fpj commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630059455



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.

Review comment:
       Given that the data is not being synced to disk on acknowledgment, there is no guarantee that a single copy will make it to enable repair, is it right? There is only a weak promise that any given entry is repairable in the case that some bookie was able to get it to disk before crashing. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] fpj commented on a change in pull request #2706: BP-44: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

fpj commented on a change in pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#discussion_r630040878



##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.

Review comment:
       Writeset is the set of bookies that have acknowledged? Fragment maps to a contiguous range of entry ids?

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.
+
+However, running without the journal would introduce data consistency problems as the BookKeeper Replication Protocol requires that all writes are persistent for correctness. Running without the journal introduces the possibility of lost writes. In order to continue to offer strong data safety and support running without the journal, changes to the protocol are required.
+
+### A note on Response Codes
+
+The following categories are relevant:
+
+- Positive: OK
+- Explicit Negative: NoSuchEntry/NoSuchLedger
+- Unknown: Any other non-success response that is not an explicit negative.
+
+For correctness explicit negatives must be treated differently than other errors.
+
+### A note on Quorums
+
+In order to explain the protocol changes, it is useful to first consider how quorums are used for safety. We have the following relevant quorums:
+
+- Single bookie (S)
+- Ack quorum (AQ)
+- Write quorum (WQ)
+- Quorum Coverage (QC) where QC = (WQ - AQ) + 1
+- Ensemble Coverage (EC) where EC = (E - AQ) + 1
+- All bookies (ALL)
+
+Quorum Coverage (QC) and Ensemble Coverage (EC) are both defined by the following, only the cohorts differ: 
+
+- A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
+- There exists no ack quorum of bookies that do not satisfy the property within the cohort.
+
+For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.
+
+For example:
+
+- For fencing, we need to ensure that no AQ of bookies is unfenced before starting the read/write phase of recovery. This is true once EC successful fencing responses have been received.
+- For a recovery read, a read is only negative once we know that no AQ of bookies could exist that might have the entry. Doing otherwise could truncate committed entries from a ledger. A read is negative once NoSuchEntry responses reach QC.
+
+Different protocol actions require different quorums:
+
+- Add entry: AQ success responses
+- Read entry:
+  - Positive when positive response from a single bookie
+  - Negative when explicit negative from all bookies
+  - Unknown: when at least one unknown and no positive from all bookies
+- Fencing phase, LAC read (sent to ensemble of current fragment):
+  - Complete when EC positive responses
+  - Unknown (cannot make progress) when AQ unknown responses (fencing LAC reads cannot cause an explicit negative as fencing creates the ledger on the bookie if it doesn’t exist)
+- Recovery read (sent to writeset of entry):
+  - Entry recoverable: AQ positive read responses
+  - Entry Unrecoverable: QC negative read responses
+  - Unknown (cannot make progress):
+    - QC unknown responses or
+    - All responses received, but not enough for either a positive or negative
+
+
+### Impact of Undetected Data Loss on Consistency
+
+The ledger recovery process assumes that ledger entries are never arbitrarily lost. In the event of the loss of an entry, the recovery process can:
+- allow the original client to keep writing entries to a ledger that has just been fenced and closed, thus losing those entries 
+- allow the recovery client to truncate the ledger too soon, closing it with a last entry id lower than that of previously acknowledged entries - thus losing data.
+
+### Scenario 1 - Lost Fenced Status Allows Writes After Ledger Close
+
+1. 3 bookies, B1, B2 & B3
+2. 2 clients, C1 & C2
+3. 1 ledger, L1, with e3:w3:a2 configuration.
+4. C1 writes entry E1 to L1. The write hits all three bookies.
+5. C1 hangs for an indeterminate length of time. 
+6. C2 sees that C1 is unresponsive, and assumes it has failed. C2 tries to recover the ledger L1.
+7. L1 sends a fencing message to all bookies in the ensemble.
+8. The fencing message succeeds in arriving at B1 & B2 and is acknowledged by both. The message to B3 is lost. 
+9. C2 sees that at least one bookie in each possible ack quorum has acknowledged the fencing message (EC threshold reached), so continues with the read/write phase of recovery, finding that E1 is the last entry of the ledger, and committing the endpoint of the ledger in the ZK.
+10. B2 crashes and boots again with all disks cleared or unflushed operations lost. 
+11. C1 wakes up and writes entry E2 to all bookies. B2 & B3 acknowledge positively, so C1 considers E2 as persisted. B1 rejects the message as the ledger is fenced, but since ack quorum is 2, B2 & B3 are enough to consider the entry written.
+
+### Scenario 2 - Recovery Truncates Previously Acknowledged Entries
+
+1. C1 adds E0 to B1, B2, B3
+2. B1 and B3 confirms. W1 confirms the write to its client.
+3. C2 starts recovery
+4. B2 fails to respond. W1 tries to change ensemble but gets a metadata version conflict.
+5. B1 crashes and restarts, has lost E0 (undetected)
+6. C2 fences the ledger on B1, B2, B3
+7. C2 sends Read E0 to B1, B2, B3
+8. B1 responds with NoSuchEntry
+9. B2 responds with NoSuchEntry
+10. QC negative response threshold reached. W2 closes the ledger as empty. Losing E0.
+
+The problem is that a bookie can:
+- lose the fenced status of a previously existing ledger
+- respond with an explicit negative even though it had previously seen an entry. 
+
+Undetected data loss could occur in the following ways:
+- Running without the journal. Bookie crashes and loses most recent entries and fence statuses that had not yet been written and synced to disk.
+- Bookie is restarted with one or more disks empty - through some kind of automation error.
+
+The first case is the main subject of this proposal as it is not covered by any existing mechanisms. The second case is already protected against by the use of cookies.
+
+### A note on cookies
+
+Cookies play an essential part in the bookkeeper replication protocol, but their purpose is often unclear. 
+
+When a bookie boots for the first time, it generates a cookie. The cookie encapsulates the identity of the bookie and should be considered immutable. This identity contains the advertised address of the bookie, the disks used for the journal, index, and ledger storage, and a unique ID. The bookie writes the cookie to ZK and each of the disks in use. On all subsequent boots, if the cookie is missing from any of these places, the bookie fails to boot.
+
+The absence of a disk's cookie implies that the rest of the disk's data is also missing. Cookie validation is performed on boot-up and prevents the boot from succeeding if the validation fails, thus preventing the bookie starting with undetected data loss. 
+
+This proposal improves the cookie mechanism by automating the resolution of a cookie validation error which currently requires human intervention to resolve.
+
+### Proposed Changes
+
+The proposed changes involve:
+- A new config that controls whether add operations go into the journal
+- Detecting possible data loss on boot
+- Prevent explicit negative responses when data loss may have occurred, instead reply with unknown code, until data is repaired.
+- Repair data loss
+- Auto fix cookies
+
+In these proposed changes, when running "without" the journal, the journal still exists, but add entry operations skip the addition to the journal. The boot-up sequence still replays the journal.
+
+Add operations can be configured to be written to the journal or not based on a new config `journalWriteData`. When set to `false`, add operations are not added to the journal.
+
+### Detecting Data Loss On Boot
+
+The new mechanism for data loss detection is checking for an unclean shutdown (aka a crash or abrupt termination of the bookie). When an unclean shutdown is detected further measures are taken to prevent data inconsistency. 
+
+Cookie validation will continue to be used to detect booting with one or more missing or empty disks (that once existed and contained a cookie).
+
+### Protection Mechanism
+
+Once possible data loss has been detected the following protection mechanism is carried out during the boot:
+
+- Fencing: Ledger metadata for all ledgers of the cluster are obtained and all those ledgers are fenced on this bookie. This prevents data loss scenario 1.
+- Limbo: All open ledgers are placed in the limbo status. Limbo ledgers can serve read requests, but never respond with an explicit negative, all explicit negatives are converted to unknowns (with the use of a new code EUNKNOWN).
+- Recovery: All open ledgers are opened and recovered.
+- Repair: Each ledger is scanned and any missing entries are sourced from peers.

Review comment:
       Given that the data is not being synced to disk on acknowledgment, there is no guarantee that a single copy will make it to enable repair, is it right? There is only a weak guarantee that any given entry is repairable in the case that some bookie was able to get it to disk before crashing. 

##########
File path: site/bps/BP-44-run-without-journal.md
##########
@@ -0,0 +1,203 @@
+---
+title: "BP-44: Running without the journal"
+issue: https://github.com/apache/bookkeeper/2705
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+
+The journal allows for fast add operations that provide strong data safety guarantees. An add operation is only acked to a client once written to the journal and an fsync performed. This however means that every entry must be written twice: once to the journal and once to an entry log file.
+
+This double write increases the cost of ownership as more disks must be provisioned to service requests and makes disk provisioning more complex (separating journal from entry log writes onto separate disks). Running without the journal would halve the disk IO required (ignoring indexes) thereby reducing costs and simplifying provisioning.

Review comment:
       Running without the entry log also halves the IOs, why keep the journal and not the entry log? It would be useful to explain as part of the motivation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] Vanlightly commented on pull request #2706: BP-46: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

Vanlightly commented on pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706#issuecomment-957355091


   I have updated the BP number to 46 as I used 44 also for the USE metrics BP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 merged pull request #2706: BP-46: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 merged pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [bookkeeper] dlg99 merged pull request #2706: BP-46: Running without journal proposal

Posted by GitBox <gi...@apache.org>.

dlg99 merged pull request #2706:
URL: https://github.com/apache/bookkeeper/pull/2706


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org