You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2018/02/26 18:40:00 UTC

[jira] [Commented] (HBASE-20087) Periodically attempt redeploy of regions in FAILED_OPEN state

    [ https://issues.apache.org/jira/browse/HBASE-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377350#comment-16377350 ] 

Andrew Purtell commented on HBASE-20087:
----------------------------------------

I attached a simple port of the RSGroups hack to the master, as a separate chore from the AM. 

I don't think this is the best approach though. 

Looking at the AM code, changing the default for "hbase.assignment.maximum.attempts" to INT_MAX would partly achieve the aim, and introducing some new logic to revisit all of the assignments tracked with the failedOpenTracker when a new server comes online would take care of the rest. 

> Periodically attempt redeploy of regions in FAILED_OPEN state
> -------------------------------------------------------------
>
>                 Key: HBASE-20087
>                 URL: https://issues.apache.org/jira/browse/HBASE-20087
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, Region Assignment
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 1.5.0
>
>         Attachments: 0001-W-4723090-Port-the-RIT-FAILED_OPEN-state-hack-from-R.patch
>
>
> Because RSGroups can cause permanent RIT with regions in FAILED_OPEN state, we added logic to the master portion of the RSGroups extention to enumerate RITs and retry assignment of regions in FAILED_OPEN state.
> However, this strategy can be applied generally to reduce need of operator involvement in cluster operations. Now an operator has to manually resolve FAILED_OPEN assignments but there is little risk in automatically retrying them after a while. If the reason the assignment failed has not cleared, the assignment will just fail again. Should the reason the assignment failed be resolved, then operators don't have to do more in order for the cluster to fully heal. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)