You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jonathan Gray (JIRA)" <ji...@apache.org> on 2011/01/05 18:38:45 UTC

[jira] Created: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
-------------------------------------------------------------------------------------------------------------

                 Key: HBASE-3419
                 URL: https://issues.apache.org/jira/browse/HBASE-3419
             Project: HBase
          Issue Type: Bug
          Components: regionserver, zookeeper
    Affects Versions: 0.90.0, 0.92.0
            Reporter: Jonathan Gray
            Priority: Critical
             Fix For: 0.90.1, 0.92.0


The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.

We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3419:
---------------------------------

    Attachment: HBASE-3419-v2.patch

Squashed the v1 patch with another patch.  v2 is just this stuff.

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977983#action_12977983 ] 

stack commented on HBASE-3419:
------------------------------

Chatting about this up on IRC, the tickle does not happen if we are skipping edits.  Thats wrong.  We should tickle even if we skip edits .

Regards making progressable a Chore, I'd say not exactly.  Progressable is about whether or no progress is being made.  We dont' want the tickle to happen if we are stuck on HDFS.  Chatting w/ Jon,  the tickle should happen not after N edits but after P milliseconds AS LONG AS we're making progress. 

Also, killing regionserver if we fail replay recovered.edits in time is wrong.  Instead we should fail the region open.

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray reassigned HBASE-3419:
------------------------------------

    Assignee: Jonathan Gray

Working on implementing what stack outlined above.

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978553#action_12978553 ] 

stack commented on HBASE-3419:
------------------------------

+1 on patch.

Jon says he's running it too.

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991701#comment-12991701 ] 

Todd Lipcon commented on HBASE-3419:
------------------------------------

patch doesn't apply anymore - jgray is going to check it out

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991755#comment-12991755 ] 

Todd Lipcon commented on HBASE-3419:
------------------------------------

+1 for trunk patch

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch, HBASE-3419-v3-TRUNK.patch, HBASE-3419-v3.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991724#comment-12991724 ] 

Todd Lipcon commented on HBASE-3419:
------------------------------------

+1, looks good to me.

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch, HBASE-3419-v3.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977870#action_12977870 ] 

Jonathan Gray commented on HBASE-3419:
--------------------------------------

Currently the "tickle" happens on a number-of-replayed-edits interval (does not count edits skipped).  This is probably not the best idea since edits can be wildly different sizes (in this case, an all increment cluster where there are very high numbers of small edits).

The tickle is really about time not number of edits.  Maybe a Chore instead set at 1/2 master timeout?  Or some other way of doing it based on time instead of edits?

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray resolved HBASE-3419.
----------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.92.0)
     Hadoop Flags: [Reviewed]

Committed to branch and trunk.  Following new convention of marking against branch for fix version and listing under 0.90.1 in CHANGES.

Thanks for poking me and for reviews Todd!

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch, HBASE-3419-v3-TRUNK.patch, HBASE-3419-v3.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3419:
---------------------------------

    Attachment: HBASE-3419-v1.patch

As outlined.

Had to add new {{CancelableProgressable}} interface because we needed to be able to tell the caller to cancel the operation.

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991799#comment-12991799 ] 

Hudson commented on HBASE-3419:
-------------------------------

Integrated in HBase-TRUNK #1737 (See [https://hudson.apache.org/hudson/job/HBase-TRUNK/1737/])
    HBASE-3419 If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.


> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch, HBASE-3419-v3-TRUNK.patch, HBASE-3419-v3.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3419:
---------------------------------

    Attachment: HBASE-3419-v3.patch

Rebased on branch.

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch, HBASE-3419-v3.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-3419:
---------------------------------

    Attachment: HBASE-3419-v3-TRUNK.patch

Rebased on trunk

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3419-v1.patch, HBASE-3419-v2.patch, HBASE-3419-v3-TRUNK.patch, HBASE-3419-v3.patch
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira