You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jonathan Gray (JIRA)" <ji...@apache.org> on 2011/01/05 18:40:47 UTC

[jira] Commented: (HBASE-3419) If re-transition to OPENING during log replay fails, server aborts. Instead, should just cancel region open.

    [ https://issues.apache.org/jira/browse/HBASE-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977870#action_12977870 ] 

Jonathan Gray commented on HBASE-3419:
--------------------------------------

Currently the "tickle" happens on a number-of-replayed-edits interval (does not count edits skipped).  This is probably not the best idea since edits can be wildly different sizes (in this case, an all increment cluster where there are very high numbers of small edits).

The tickle is really about time not number of edits.  Maybe a Chore instead set at 1/2 master timeout?  Or some other way of doing it based on time instead of edits?

> If re-transition to OPENING during log replay fails, server aborts.  Instead, should just cancel region open.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3419
>                 URL: https://issues.apache.org/jira/browse/HBASE-3419
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, zookeeper
>    Affects Versions: 0.90.0, 0.92.0
>            Reporter: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.90.1, 0.92.0
>
>
> The {{Progressable}} used on region open to tickle the ZK OPENING node to prevent the master from timing out a region open operation will currently abort the RegionServer if this fails for some reason.  However it could be "normal" for an RS to have a region open operation aborted by the master, so should just handle as it does other places by reverting the open.
> We had a cluster trip over some other issue (for some reason, the tickle was not happening in < 30 seconds, so master was timing out every time).  Because of the abort on BadVersion, this eventually led to every single RS aborting itself eventually taking down the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.