You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2020/04/03 20:25:00 UTC

[GitHub] [bookkeeper] ivankelly commented on a change in pull request #2281: [bookie-ledger-recovery] Fix bookie recovery stuck even with enough ack-quorum response

ivankelly commented on a change in pull request #2281: [bookie-ledger-recovery] Fix bookie recovery stuck even with enough ack-quorum response
URL: https://github.com/apache/bookkeeper/pull/2281#discussion_r403306604
 
 

 ##########
 File path: bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java
 ##########
 @@ -137,8 +140,15 @@ public synchronized void readEntryComplete(final int rc, final long ledgerId, fi
         }
 
         if (numResponsesPending == 0 && !completed) {
 
 Review comment:
   @eolivelli @sijie 
   
    This fix is the wrong fix. This problem is the same problem as #2273 , and we've also seen internally. The root cause is that quorum coverage is broken. 5e399df broke it. 5e399df was a change in the write path, so why it touched the quorum coverage in the read path is a mystery. dcdd1e88 supplemented the break which and added a test case. The test case tests exactly the wrong behaviour Pre-5e399df it worked correctly. 
   
   It seems there is confusion as to what the quorum coverage is supposed to do.
   
   The quorum coverage set is there to check if we have received a valid response from enough nodes that there cannot exist an entry which has been acknowledged to the client which has not touched at least one of these nodes. In other words, there cannot exist an entry which formed an ack quorum without touching one of the nodes we have talked to.
   
   The quorum coverage set should only have two states for a node, known or unknown. The known state means that we know definitely than an entry does or does not exist on that node. So, an OK, NoSuchLedger and NoSuchEntry response will move the node into a known state. Anything else is an unknown state. No response from node is unknown state. Node had an IOException is an unknown state.
   
   Take the test supplied with dcdd1e88:
   ```
           RoundRobinDistributionSchedule schedule = new RoundRobinDistributionSchedule(
               5, 3, 5);
           DistributionSchedule.QuorumCoverageSet covSet = schedule.getCoverageSet();
           covSet.addBookie(0, BKException.Code.NoSuchLedgerExistsException);
           covSet.addBookie(1, BKException.Code.NoSuchEntryException);
           covSet.addBookie(2, BKException.Code.NoSuchLedgerExistsException);
           covSet.addBookie(3, BKException.Code.UNINITIALIZED);
           covSet.addBookie(4, BKException.Code.UNINITIALIZED);
           assertFalse(covSet.checkCovered());
   ```
   For clarify, w is 5, a is 3, e is 5. RoundRobinDistributionSchedule parameters are messed up.
   
   This test is simply wrong. The coverage set is covered. There is no way an entry can have ack quorum in any write quorum. There is no set of 3 bookies in the unknown state.
   
   Anyhow, I have a fix going through CI internally. I'll push up a PR on monday.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services