You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jimmy Xiang (JIRA)" <ji...@apache.org> on 2013/09/29 23:26:25 UTC
[jira] [Comment Edited] (HBASE-9514) Prevent region from assigning before log splitting is done

    [ https://issues.apache.org/jira/browse/HBASE-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781502#comment-13781502 ] 

Jimmy Xiang edited comment on HBASE-9514 at 9/29/13 9:25 PM:
-------------------------------------------------------------

Here is the list of changes:
1. fixed a bug in AM#assign(line ~2645), when bulk assign fails, each region should be assigned again, otherwise, they will be stuck in transition;
2. fixed a bug in AM#unassign(line ~2461), if region is offline, assign it again (moved to final block, so all scenarios are covered);
3. in RegionStates if the last hosting region server is online, get the server's info to confirm it has the expected start code (may be too conservative, hasn't seen it in my test yet);
4. in AM, force region state offline, if force new plan, check meta to make sure the last assignment is not changed (may be too conservative, hasn't seen it in my test yet);
5. enhanced bulk assign a little so that if a region is already assign, no need to force assign.

I have a new patch in testing now (v5.1 attached). The new patch has the following changes:
1. added a CM action to log cluster status every 90 seconds so we know details about regions in transition;
2. added a hbck check after verification failure so that we know if the cluster is consistent, i.e., any region is lost/unassigned;
3. added another verify with CM disabled after verification failure so we know if we really have data loss.

It seems that there is no data loss now since 3. shows ok while the test still fails.


was (Author: jxiang):
Here is the list of changes:
1. fixed a bug in AM#assign(line ~2645), when bulk assign fails, each region should be assigned again, otherwise, they will be stuck in transition;
2. fixed a bug in AM#unassign(line ~2461), if region is offline, assign it again (moved to final block, so all scenarios are covered);
3. in RegionStates if the last hosting region server is online, get the server's info to confirm it has the expected start code (may be too conservative, hasn't seen it in my test yet);
4. in AM, force region state offline, if force new plan, check meta to make sure the last assignment is not changed (may be too conservative, hasn't seen it in my test yet);
5. enhanced bulk assign a little so that if a region is already assign, no need to force assign.

I have a new patch in testing now. The new patch has the following changes:
1. added a CM action to log cluster status every 90 seconds so we know details about regions in transition;
2. added a hbck check after verification failure so that we know if the cluster is consistent, i.e., any region is lost/unassigned;
3. added another verify with CM disabled after verification failure so we know if we really have data loss.

It seems that there is no data loss now since 3. shows ok while the test still fails.

> Prevent region from assigning before log splitting is done
> ----------------------------------------------------------
>
>                 Key: HBASE-9514
>                 URL: https://issues.apache.org/jira/browse/HBASE-9514
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>            Priority: Blocker
>             Fix For: 0.96.0
>
>         Attachments: trunk-9514_v1.patch, trunk-9514_v2.patch, trunk-9514_v3.patch, trunk-9514_v5.1.patch, trunk-9514_v5.patch
>
>
> If a region is assigned before log splitting is done by the server shutdown handler, the edits belonging to this region in the hlogs of the dead server will be lost.
> Generally this is not an issue if users don't assign/unassign a region from hbase shell or via hbase admin. These commands are marked for experts only in the hbase shell help too.  However, chaos monkey doesn't care.
> If we can prevent from assigning such regions in a bad time, it would make things a little safer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)