You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/09/14 23:02:21 UTC

[jira] [Commented] (BOOKKEEPER-946) Provide an option to delay auto recovery of lost bookies

    [ https://issues.apache.org/jira/browse/BOOKKEEPER-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491691#comment-15491691 ] 

ASF GitHub Bot commented on BOOKKEEPER-946:
-------------------------------------------

GitHub user rithin-shetty opened a pull request:

    https://github.com/apache/bookkeeper/pull/58

    BOOKKEEPER-946: Provide an option to delay auto recovery of lost bookies

    If auto recovery is enabled, and a bookie goes down for upgrade or even if it looses zk connection
    intermittently, the auditor detects it as a lost bookie and starts under replication detection and
    the replication workers on other bookie nodes start replicating the under replicated ledgers. All
    of this stops once the bookie comes up but by then a few ledgers would get replicated. Given the
    fact that we have multiple copies of data, it is probably not necessary to start the recovery as
    soon as a bookie goes down. We can wait for an hour or so and then start recovery. This should
    cover cases like planned upgrade, intermittent network connectivity loss, etc.
    
    This change:
        1) Provides a bookie option 'lostBookieRecoveryDelay' in secs, which when set to a non zero value,
           will delay the start of recovery by that number of seconds. By default, this option is set to 0;
           which means the audit is not delayed.
        2) If another bookie were to go down in this interval, the recovery is immediately started and the
           one scheduled for future is canceled.
        3) Adds counters to track how many audits were delayed(#1) and how many scheduled audits were
           canceled due to multiple bookie failures(#2).
        4) Three junit tests to verify the new feature.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rithin-shetty/bookkeeper audit_delay

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/bookkeeper/pull/58.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #58
    
----
commit 1294c2f8cb80b66493a8bd314a999aae757d2e94
Author: Rithin <ri...@salesforce.com>
Date:   2016-09-14T22:31:39Z

    BOOKKEEPER-946: Provide an option to delay auto recovery of lost bookies
    
    If auto recovery is enabled, and a bookie goes down for upgrade or even if it looses zk connection
    intermittently, the auditor detects it as a lost bookie and starts under replication detection and
    the replication workers on other bookie nodes start replicating the under replicated ledgers. All
    of this stops once the bookie comes up but by then a few ledgers would get replicated. Given the
    fact that we have multiple copies of data, it is probably not necessary to start the recovery as
    soon as a bookie goes down. We can wait for an hour or so and then start recovery. This should
    cover cases like planned upgrade, intermittent network connectivity loss, etc.
    
    This change:
        1) Provides a bookie option 'lostBookieRecoveryDelay' in secs, which when set to a non zero value,
           will delay the start of recovery by that number of seconds. By default, this option is set to 0;
           which means the audit is not delayed.
        2) If another bookie were to go down in this interval, the recovery is immediately started and the
           one scheduled for future is canceled.
        3) Adds counters to track how many audits were delayed(#1) and how many scheduled audits were
           canceled due to multiple bookie failures(#2).
        4) Three junit tests to verify the new feature.

----


> Provide an option to delay auto recovery of lost bookies
> --------------------------------------------------------
>
>                 Key: BOOKKEEPER-946
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.5.0
>            Reporter: Rithin Shetty
>            Assignee: Rithin Shetty
>            Priority: Minor
>             Fix For: 4.5.0
>
>
> If auto recovery is enabled, and a bookie goes down for upgrade or even if it looses zk connection intermittently, the auditor detects it as a lost bookie and starts under replication detection and the replication workers on other bookie nodes start replicating the under replicated ledgers. All of this stops once the bookie comes up but by then a few ledgers would get replicated. Given the fact that we have multiple copies of data, it is probably not necessary to start the recovery as soon as a bookie goes down. We can probably wait for an hour or so and then start recovery. This should cover cases like planned upgrade, intermittent network connectivity loss, etc. The amount of time to wait can be an option and the default would be to not wait at all(i.e. retain current behavior).
> Of course, if more than one bookie goes down within a short interval, we could decide to start auto recovery without waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)