You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "Ivan Kelly (JIRA)" <ji...@apache.org> on 2012/05/30 17:31:23 UTC

[jira] [Created] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Ivan Kelly created BOOKKEEPER-278:
-------------------------------------

             Summary: Ability to disable auto recovery temporarily
                 Key: BOOKKEEPER-278
                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
             Project: Bookkeeper
          Issue Type: Sub-task
            Reporter: Ivan Kelly
             Fix For: 4.2.0


Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458532#comment-13458532 ] 

Ivan Kelly commented on BOOKKEEPER-278:
---------------------------------------

[~rakeshr] sorry for taking so long to get back to you on this one.


Consider the following.
There's three bookies, A B C, with ledgers on all three, 1 2 3 4 5.

# The disable Znode is set.
# A is taken down for upgrade.
# The auditor sees that bookie A is down.
# The auditor builds a list of the ledgers on bookie A, which are 1, 2, 3, 4 & 5
# The auditor starts marking these ledgers as underreplicated, enters into WAITING.

Now the upgrade of all bookies continues as expected. Once finished.

# the disable ZNode is unset
# auditor exits the waiting state. marks 1, 2, 3, 4 & 5 as underreplicated

Now this isn't a huge problem, as the replication worker will see that there are in fact no fragments unavailable, but it toes induce an extra check implicitly. 

It would be better for the auditor to check is auto recovery is enabled after seeing a bookie drop, and only build the index, mark the ledgers, if it is enabled.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395878#comment-13395878 ] 

Ivan Kelly commented on BOOKKEEPER-278:
---------------------------------------

It also needs to be possible to disable it temporarily during runtime. For example, if you have 5 bookies, and want to upgrade them without taking down the whole system, you need to upgrade one at a time, so that there will always be enough bookies up for clients to form an ensemble. In this case, you don't want autorereplication to occur when you take the bookie down to upgrade it.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469330#comment-13469330 ] 

Hudson commented on BOOKKEEPER-278:
-----------------------------------

Integrated in bookkeeper-trunk #737 (See [https://builds.apache.org/job/bookkeeper-trunk/737/])
    BOOKKEEPER-278: Ability to disable auto recovery temporarily (rakeshr via ivank) (Revision 1393983)

     Result = UNSTABLE
ivank : 
Files : 
* /zookeeper/bookkeeper/trunk/CHANGES.txt
* /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/LedgerUnderreplicationManager.java
* /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/meta/ZkLedgerUnderreplicationManager.java
* /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java
* /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationEnableCb.java
* /zookeeper/bookkeeper/trunk/bookkeeper-server/src/main/java/org/apache/bookkeeper/util/ZkUtils.java
* /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/replication/AuditorLedgerCheckerTest.java
* /zookeeper/bookkeeper/trunk/bookkeeper-server/src/test/java/org/apache/bookkeeper/replication/TestLedgerUnderreplicationManager.java

                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.1.patch, BOOKKEEPER-278.2.patch, BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454727#comment-13454727 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

Hi Ivan, Could you please go through my above comments and would like to know your opinion on this.

Thanks,
Rakesh
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rakesh R updated BOOKKEEPER-278:
--------------------------------

    Attachment: BOOKKEEPER-278.2.patch
    
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.1.patch, BOOKKEEPER-278.2.patch, BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395852#comment-13395852 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

I'm thinking to provide new configuration item 'enableAutoRecovery=true' in the bk_server.conf file. 
By default this feature will be disabled.

Also this will be added in the ServerConfiguration.java as follows:

{code}
    /**
     * Is Auto recovery enabled
     * 
     * @return is statistics enabled
     */
    public boolean isAutoRecoveryEnabled() {
        return getBoolean(ENABLE_AUTORECOVERY, false);
    }

    /**
     * Turn on/off Auto recovery
     * 
     * @param enabled
     *            Whether auto recovery enabled or not.
     * @return server configuration
     */
    public ServerConfiguration setAutoRecoveryEnabled(boolean enabled) {
        setProperty(ENABLE_AUTORECOVERY, Boolean.toString(enabled));
        return this;
    }
{code}
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400404#comment-13400404 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

bq.I think it would be better to globally disable it for the duration of a rolling upgrade

ok, I got it. Its simple approach, disable globally. Hope we will not have any usecase of disabling rereplication of a single bookie?

                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400378#comment-13400378 ] 

Ivan Kelly commented on BOOKKEEPER-278:
---------------------------------------

I think it would be better to globally disable it for the duration of a rolling upgrade. The administrator sets a flag (by running a command etc), which creates a /ledgers/rereplication/disable znode. Before running any rereplication, the recovery worker checks if this znode exists. If it does it waits until it disappears to do any work.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rakesh R updated BOOKKEEPER-278:
--------------------------------

    Attachment: BOOKKEEPER-278.patch
    
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460557#comment-13460557 ] 

Ivan Kelly commented on BOOKKEEPER-278:
---------------------------------------

New patch looks good. isEnabledLedgerReplication should be split into two calls though. 
{code}
boolean isLedgerReplicationEnabled();
void notifyLedgerReplicationEnabled(GenericCallback<Void> cb);
{code}
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.1.patch, BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400389#comment-13400389 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-278:
------------------------------------------------

What about the currently running Replication tasks? Admin should wait for some time to complete all currently initiated replications.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459490#comment-13459490 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-278:
------------------------------------------------

Yes, Ivan. I got your point now.

Auditor should not go till markLedgerUnderreplicated itself once it disabled. Otherwise it will publish many ledgers even though they started correctly now. This is because of the calls waiting at markLedgerUnderreplicated.

{quote}
It would be better for the auditor to check is auto recovery is enabled after seeing a bookie drop, and only build the index, mark the ledgers, if it is enabled.
{quote}
if it disabled it will just ignore or wait?
I think there might be small race here if we ignore silently BK failure notifications when it is in disabled state.
ex:
Admin added disable Znode.
- restarted some Bookies
- Before admin enable that Autorecovery, some restarted/fresh BKs failed. Now they are real failures.
then admin enabled it.

If we ignore the BK failure notifications when it was in disabled state, it may miss this valid notifications, which may need to process once re-replication enabled right?

Do you think we have to rebuild complete index once it enabled? or simply we can queue the calls and process them, once it enabled. Need to check this case.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rakesh R reassigned BOOKKEEPER-278:
-----------------------------------

    Assignee: Rakesh R
    
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458564#comment-13458564 ] 

Ivan Kelly commented on BOOKKEEPER-278:
---------------------------------------

Not quite. What is happening in the patch, and in the way Rakesh described, is that the markLedgerUnderreplicated is blocking, but there will be a number call calls to it queued up once it is unblocked.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460440#comment-13460440 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

Hi Ivan, 

I've refactored the patch. Now the auditor is checking the enable/disable znode after seeing a bookie drop, and only build the index, mark the ledgers, if it is enabled.

Could you please review the latest patch.

Thanks,
Rakesh
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.1.patch, BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447605#comment-13447605 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

Thanks a lot Ivan, Uma for the suggestions.

I've prepared initial patch to discuss the api design. Script changes are pending and has to be done by exposing 'disable' command. 
I'm thinking 'disable' command can be written in the admin class[BOOKKEEPER-319] which is to manage the whole replica process.

Could you please review the proposed changes.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rakesh R updated BOOKKEEPER-278:
--------------------------------

    Attachment: BOOKKEEPER-278.1.patch
    
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.1.patch, BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G updated BOOKKEEPER-278:
-------------------------------------------

    Component/s: bookkeeper-auto-recovery
    
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Uma Maheswara Rao G (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458537#comment-13458537 ] 

Uma Maheswara Rao G commented on BOOKKEEPER-278:
------------------------------------------------

Hi Ivan,

>From above Rakesh comments, I think it is doing what you are expecting right?

{quote}
# When Auditor recieved any failure notification, it will get lost bookie/ledgers and during "markLedgerUnderreplicated" seeing "disable" znode then add a znode watcher and enters to WAITING state.
{quote}

from the above order,
{quote}
1.The disable Znode is set.
2.A is taken down for upgrade.
{quote}
first it added the disable Znode, so, markLedgerUnderreplicated call from Auditor will check for disable Znode existence as per Rakesh comments.
So, it seems to me that, you both are in the same lines here right?
I have not looked at patch yet, So, you pointed something wrong in the patch which are leading for this comments?
@Rakesh, Ivan, Please correct me, if I understood wrongly about your comments.

                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450732#comment-13450732 ] 

Ivan Kelly commented on BOOKKEEPER-278:
---------------------------------------

This patch won't actually disable the rereplication, it'll just delay it. Using this the admin would disbale the replication workers, but the auditor would continue to see bookie's dropping off, and marking all the ledgers as underreplicated. Then once reenabled, the replication workered would go about rereplicating everything. They probably wouldn't actually rereplicate much, because the check at the start wouldn't show much many missing segments, but still, it's not a side effect we want. It would be better to disable at the auditor level. 

On a side note, checking the ledgers after a rolling upgrade is a good idea, but I get the impression it wasn't intentional here.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400133#comment-13400133 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

@Ivan
Yeah, its a good scenario.

I'm thinking to provide an admin command which will pass the info(bookie:IP) to the /auditor znode through zookeeper. So that the present auditor can ignore this and dont consider as a failed bookie. 
When the bookie starts back it will do cleaning his bookie:IP from /auditor node.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461682#comment-13461682 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

Thanks Ivan for the review. I've splitted the logic accordingly, could you please look at the latest patch.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.1.patch, BOOKKEEPER-278.2.patch, BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459582#comment-13459582 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

Thanks Ivan and Uma for your time and responses. Could you please go through the following and would like to know the opinion.

@Ivan
bq.is that the markLedgerUnderreplicated is blocking
Yup, its a blocking call and latch enters into infinite waiting state if it sees a 'disable' znode.

bq.but there will be a number call calls to it queued up once it is unblocked.
Hope you are pointing me to: the multiple bookie failure notifictions which are queuing into 'bookieNotifications' queue.

As we know Auditor is recieving the bookie failure notifications only through the getChildren() watcher. When Auditor enters into the waiting state, it will be in a blocking call at markLedgerUnderreplicated() and consequently run() method also will not be finished unless recieved 'enable' notification. Since Auditor has only registered one getChildren() zk watcher before enters to waiting state, at max he will recieve only one bookie failure notification and will not see further failures(because watcher is already fired and not doing the reregistration of it). After enabling, anyway he is getting available bookies and will recalculate lost bookies...and continue the cycle. Am I missing anything?

Its good scenario, I will add one more test case: "behaviour of multiple bookie failures in disable mode".

bq.It would be better for the auditor to check is auto recovery is enabled after seeing a bookie drop, and only build the index, mark the ledgers, if it is enabled.
I agree to place the disable checks just before processing bookie failure. In that case, once it started generating index, will finish the publishing/cycle of ledgers. Then, only on the next bookie failure notification he will enter into the waiting state. Does this sound good to you?

                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Rakesh R (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451787#comment-13451787 ] 

Rakesh R commented on BOOKKEEPER-278:
-------------------------------------

Hi Ivan, Thanks for the reviews.

bq.It would be better to disable at the auditor level.

I just confused by looking your comments. Its doing the disabling at Auditor level as well as RW level also. Could you please give more details.

The proposed patch is a kind of delaying/waiting the replication processes(Auditor and RW) by using a CountDownLatch. The logic what I've followed is:

# Admin is calling disable call, then creats the 'disable' znode in /underreplication root node.
# When Auditor recieved any failure notification, it will get lost bookie/ledgers and during "markLedgerUnderreplicated" seeing "disable" znode then add a znode watcher and enters to WAITING state. 
# Also RW, if he tries to 'getLedgerToRereplicate' seeing "disable" znode then add a znode watcher and enters to WAITING state.

CountDownLatch makes blocking wait and will internally suspending both the Auditor, RW processes. On disabling both Auditor/RW will continue with the previous populated data.


Enters to waiting state:
{code}
if (null != zkc.exists(basePath + '/' + DISABLE_NODE, w)) {
  LOG.info("Automatic ledger re-replication is disabled "
       + "by Administrator!. So waiting until its enabled.")
  changedLatch.await();
}
{code}


Comes out from the infinite waiting and only after "disable" node deletion:
{code}
if (e.getType() == Watcher.Event.EventType.NodeDeleted) {
      changedLatch.countDown();
}
{code}

                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>          Components: bookkeeper-auto-recovery
>    Affects Versions: 4.0.0
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>         Attachments: BOOKKEEPER-278.patch
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-278) Ability to disable auto recovery temporarily

Posted by "Ivan Kelly (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/BOOKKEEPER-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400390#comment-13400390 ] 

Ivan Kelly commented on BOOKKEEPER-278:
---------------------------------------

The disable mechanism is to avoid false rereplication processes from happening when a bookie is taken down for upgrade. Valid rereplication tasks still need to run, which is fine. We can let them continue. They may fail due to the bookie being rereplicated being taken down for upgrade, but it's ok because the admin should run a health check on the cluster after upgrading all bookies anyhow.
                
> Ability to disable auto recovery temporarily
> --------------------------------------------
>
>                 Key: BOOKKEEPER-278
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-278
>             Project: Bookkeeper
>          Issue Type: Sub-task
>            Reporter: Ivan Kelly
>            Assignee: Rakesh R
>             Fix For: 4.2.0
>
>
> Administrators will need to do rolling upgrades of bookies. If auto recovery is enabled during a rolling upgrade, there will be a lot of thrashing of ledgers as they recovery gets kicked off. Therefore we need a way to temporarily disable it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira