You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ping (JIRA)" <ji...@apache.org> on 2014/01/22 04:34:20 UTC

[jira] [Commented] (HBASE-9740) A corrupt HFile could cause endless attempts to assign the region without a chance of success

    [ https://issues.apache.org/jira/browse/HBASE-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878210#comment-13878210 ] 

Ping commented on HBASE-9740:
-----------------------------

hi ~ ,  @Aditya Kishore
   We encountered lost hfile problem like this scenario which prevent all balancing and table creation and couldn't be repaired online, this is serious in our product envionment, so I made a patch considering all possibilities that a region can't be opened to avoid the problem we encounter.  and when I prepare to submit a bug for that I find this one is like what I want to resolve. So I submit it for this one, please review and give some suggestion.
   The solution is to create a counter to count a region assign failed times in one assign round, when it fails beyond the threshold , we call regionOffline()  function to set this region to be OFFLINE, and remove from RIT in mermory, and LOG one error info for it.
   This patch can deal with all the always-failed assignment scenario.

> A corrupt HFile could cause endless attempts to assign the region without a chance of success
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-9740
>                 URL: https://issues.apache.org/jira/browse/HBASE-9740
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Aditya Kishore
>            Assignee: Aditya Kishore
>
> As described in HBASE-9737, a corrupt HFile in a region could lead to an assignment storm in the cluster since the Master will keep trying to assign the region to each region server one after another and obviously none will succeed.
> The region server, upon detecting such a scenario should mark the region as "RS_ZK_REGION_FAILED_ERROR" (or something to the effect) in the Zookeeper which should indicate the Master to stop assigning the region until the error has been resolved (via an HBase shell command, probably "assign"?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)