You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2019/09/23 17:40:22 UTC

[GitHub] [bookkeeper] reddycharan opened a new pull request #2166: Enhance deferLedgerLockReleaseOfFailedLedger in ReplicationWorker

reddycharan opened a new pull request #2166: Enhance deferLedgerLockReleaseOfFailedLedger in ReplicationWorker
URL: https://github.com/apache/bookkeeper/pull/2166
 
 
   
   
   Descriptions of the changes in this PR:
   
   **(Note: Working on testcases and should add more comments to the changes)**
   
   **Issue:** In the past, ReplicationWorker (RW) retrial logic is enhanced to backoff
   replication after threshold number of replication failures of a ledger. This is
   to help in a pathological situation where data (ledger/entry) is unavailable.
   But this is sub-optimal solution, since there is possibility that each RW can
   try recovering a ledger threshold number of times, before a RW defers
   ledgerLockRelease. Also each time a RW tries to recover it would read entry/fragment
   sequentially and writes to new bookies until it finds a missing entry (completely
   unavailable) before failing on replication of that ledger. This is done for
   each retrial and it bloats the storage and overreplication need to detect and
   delete it, which runs once a day by default. So because of this cluster can
   run out of storage space and may become RO cluster. Also this puts quite a bit of
   load on cluster in vain.
   
   **So the new proposal is to**
   - On each RW. remember the state in addition to the counter. State must include the entries which RW failed to read.
   - Counter and state must kept around in each RW node. And exponential backup should be used for deferLedgerLockReleaseOfFailedLedger
   - During next attempt by RW, try to read the failed entries which is noted in the state. Read must be successful before proceeding replicating.
   - With this model we avoid duplicate copies on each attempt. At the most each RW will create only one copy

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services