You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Sijie Guo (JIRA)" <ji...@apache.org> on 2016/12/17 01:44:58 UTC

[jira] [Resolved] (BOOKKEEPER-946) Provide an option to delay auto recovery of lost bookies

     [ https://issues.apache.org/jira/browse/BOOKKEEPER-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sijie Guo resolved BOOKKEEPER-946.
----------------------------------
    Resolution: Fixed

Issue resolved by merging pull request 82
            [https://github.com/apache/bookkeeper/pull/82]

            {noformat}
            commit 669ab4ac32bcbf6b3d883a07ed942d36d25b8a6e
Author:     Rithin <ri...@salesforce.com>
AuthorDate: Fri Dec 16 17:44:24 2016 -0800
Commit:     Sijie Guo <si...@apache.org>
CommitDate: Fri Dec 16 17:44:24 2016 -0800

    BOOKKEEPER-946: Provide an option to delay auto recovery of lost bookies
    
    Fixing a bug in the test AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies which
    fails sometimes with:
    AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies:367->_testDelayedAuditOfLostBookies:345 audit of lost bookie isn't delayed
    
    Author: Rithin <ri...@salesforce.com>
    
    Reviewers: Enrico Olivelli <eo...@gmail.com>, Sijie Guo <si...@apache.org>
    
    Closes #82 from rithin-shetty/audit_delay_fix

            {noformat}
            

> Provide an option to delay auto recovery of lost bookies
> --------------------------------------------------------
>
>                 Key: BOOKKEEPER-946
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.5.0
>            Reporter: Rithin Shetty
>            Assignee: Rithin Shetty
>             Fix For: 4.5.0
>
>         Attachments: org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt, org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt
>
>
> If auto recovery is enabled, and a bookie goes down for upgrade or even if it looses zk connection intermittently, the auditor detects it as a lost bookie and starts under replication detection and the replication workers on other bookie nodes start replicating the under replicated ledgers. All of this stops once the bookie comes up but by then a few ledgers would get replicated. Given the fact that we have multiple copies of data, it is probably not necessary to start the recovery as soon as a bookie goes down. We can probably wait for an hour or so and then start recovery. This should cover cases like planned upgrade, intermittent network connectivity loss, etc. The amount of time to wait can be an option and the default would be to not wait at all(i.e. retain current behavior).
> Of course, if more than one bookie goes down within a short interval, we could decide to start auto recovery without waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)