You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by bijieshan <bi...@huawei.com> on 2011/05/20 09:33:42 UTC

ROOT region appeared in two regionserver's onlineRegions at the same time

This could be happen under the following steps with little probability:
(I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)

1.Root region was opened in RS1.
2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
3.ServerShutdownHandler process start.
4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2.
5.Root region was opened successfully in RS2.
6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
Case a:
ROOT region was assigned to RS1.
It seemed nothing would be affected. But the root region was still online in RS2.

Case b:
ROOT region was assigned to RS2.
The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.

This could be proved from the logs:

1. ROOT region was opened with two times:
2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212

2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052
10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052
10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server

This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.

Thanks!

Regards,
Jieshan Bean

Re: ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by Stack <st...@duboce.net>.

That sounds reasonable Jieshan.   Would you mind filing an issue
referring to this mail thread?  If you have a patch, that'd be
excellent.
St.Ack

2011/5/23 bijieshan <bi...@huawei.com>:
> There's 2 references about assignRoot():
>
> 1.
> HMaster# assignRootAndMeta:
>
>    if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>      this.assignmentManager.assignRoot();
>      this.catalogTracker.waitForRoot();
>      assigned++;
>    }
>
> 2.
> ServerShutdownHandler# process:
>
>      if (isCarryingRoot()) { // -ROOT-
>        try {
>           this.services.getAssignmentManager().assignRoot();
>        } catch (KeeperException e) {
>           this.server.abort("In server shutdown processing, assigning root", e);
>           throw new IOException("Aborting", e);
>        }
>      }
>
> I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.
> Expecting for anyone's reply.
>
> Thanks!
>
> Regards,
> Jieshan Bean
>
>
> -----邮件原件-----
> 发件人: bijieshan [mailto:bijieshan@huawei.com]
> 发送时间: 2011年5月20日 15:34
> 收件人: user@hbase.apache.org
> 抄送: Chenjian
> 主题: ROOT region appeared in two regionserver's onlineRegions at the same time
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)
>
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT region was unsetted, and assigned to RS2.
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was restarted. So there's two possibilities:
>  Case a:
>   ROOT region was assigned to RS1.
>   It seemed nothing would be affected. But the root region was still online in RS2.
>
>  Case b:
>   ROOT region was assigned to RS2.
>   The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.
>
> This could be proved from the logs:
>
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
>
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: -ROOT-,,0.70236052
> 10:49:30,920 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of -ROOT-,,0.70236052
> 10:49:30,920 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted open of -ROOT-,,0.70236052 but already online on this server
>
> This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.
>
> Thanks!
>
> Regards,
> Jieshan Bean
>
>
>
>
>
>

Re: ROOT region appeared in two regionserver's onlineRegions at the same time

Posted by bijieshan <bi...@huawei.com>.

There's 2 references about assignRoot():

1.
HMaster# assignRootAndMeta:

if (!catalogTracker.verifyRootRegionLocation(timeout)) {
this.assignmentManager.assignRoot();
this.catalogTracker.waitForRoot();
assigned++;
}

2.
ServerShutdownHandler# process:

if (isCarryingRoot()) { // -ROOT-
try {
this.services.getAssignmentManager().assignRoot();
} catch (KeeperException e) {
this.server.abort("In server shutdown processing, assigning root", e);
throw new IOException("Aborting", e);
}
}

I think each time call the method of assignRoot(), we should verify Root Region's Location first. Because before the assigning, the ROOT region could have been assigned by another place.
Expecting for anyone's reply.

Thanks!

Regards,
Jieshan Bean

-----邮件原件-----
发件人: bijieshan [mailto:bijieshan@huawei.com]
发送时间: 2011年5月20日 15:34
收件人: user@hbase.apache.org
抄送: Chenjian
主题: ROOT region appeared in two regionserver's onlineRegions at the same time

This could be happen under the following steps with little probability:
(I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 10,000 regions in the cluster)

Case b:
ROOT region was assigned to RS2.
The ROOT Region couldn't be opened until it would be reassigned to other regionserver, because it was showed online in this regionserver.

This could be proved from the logs:

This could be cause a long break of ROOT region offline, though it happened under a special scenario. And I have checked the code, it seems a tiny bug here.

Thanks!

Regards,
Jieshan Bean

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by bijieshan <bi...@huawei.com>.

Hi, Ted Yu:
Thanks for your reply.
The logs is indeed from the same regionserver. It was most careless of me, Sorry. Because the first logs is from another region.
There's no doubts about the comments on the OpenRegionHandler#openRegion. While opening failure , the Master will assign it again, but the ZK node's in an unexpect state.
So the re-assigning will not success.
I paste the logs again:

Here's the logs from one Regionserver
2011-05-20 15:49:19,503 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
java.io.IOException: Exception occured while connecting to the server
        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
        at $Proxy6.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
        at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

2011-05-20 15:49:19,503 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51

Here's the Master logs
------TimeoutMonitor found the timeout region, assign it again but failed for each time.
2011-05-20 16:18:27,728 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. state=PENDING_OPEN, ts=1305879327726
2011-05-20 16:18:27,728 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:18:27,728 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. state=PENDING_OPEN, ts=1305879327726
2011-05-20 16:18:27,728 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. so generated a random one; hri=ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51., src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) available servers
2011-05-20 16:18:27,728 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. so generated a random one; hri=ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51., src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) available servers
2011-05-20 16:18:27,728 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51. to 157-5-111-12,20020,1305877626108

Regards,
Jieshan Bean

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by Ted Yu <yu...@gmail.com>.

Hi, Jieshan:
You pasted logs from two region servers, right ?

>> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.
regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer
exists, encodedName=d7555a12586e6c788ca55017224b5a51

Was d7555a12586e6c788ca55017224b5a51 the same region as the one in first log
snippet where IOE occurred ?

In IOE catch block of OpenRegionHandler#openRegion, I see:
      // We failed open.  Let our znode expire in regions-in-transition and
      // Master will assign elsewhere.  Presumes nothing to close.

Did the node d7555a12586e6c788ca55017224b5a51 never expire ?

Thanks

On Thu, May 26, 2011 at 12:15 AM, bijieshan <bi...@huawei.com> wrote:

> It caused the region couldn't been open anymore, for it has fallen into an
> loop of opening operations, but failed for each time. The Balancer would
> skip for the region still remain in RIT. So the regions looked un-balance
> between the regionservers.
>
> I describe the problem step by step as following:
>
> 1.HMaster send Msg to openregion on RS1.
> 2.RS1 received the Msg, and start to open the region. Before the opening,
> update the state of ZK node from offline to opening.
> 3.IOException happened while openRegion, so the opening failed.
> 4.The ZK node state was still opening.
> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the
> opening Msg again. Maybe it send to RS2
> 6.RS2 execute the opening, while update the ZK node state, it got an
> unexpected state. So failed again.
> 7.Loop the steps from 5 to 6.
>
> And from the code:
>
> OpenRegionHandler#process
>      if (!transitionZookeeperOfflineToOpening(encodedName)) {
>        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>          encodedName);
>        return;
>      }
>
>  /************************************************************************/
>      /*********IOException happened, region is
> null***************************/
>
>  /************************************************************************/
>      region = openRegion();
>
>  /************************************************************************/
>      /*********(region == null) is true, so return
> directly*******************/
>
>  /************************************************************************/
>      if (region == null) return;
>      boolean failed = true;
>      if (tickleOpening("post_region_open")) {
>        if (updateMeta(region)) failed = false;
>      }
>
> OpenRegionHandler#openRegion
>    HRegion region = null;
>    try {
>
>  /************************************************************************/
>      /*********IOException happened here..
> ***********************************/
>
>  /************************************************************************/
>        region = HRegion.openHRegion(this.regionInfo,
> this.rsServices.getWAL(),
>        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
>        new CancelableProgressable() {
>          public boolean progress() {
>            return tickleOpening("open_region_progress");
>          }
>        });
>    } catch (IOException e) {
>      LOG.error("Failed open of region=" +
>        this.regionInfo.getRegionNameAsString(), e);
>    }
>
>    return region;
>
> Here's the logs:
> 2011-05-20 15:49:48,122 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open
> of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
> java.io.IOException: Exception occured while connecting to the server
>        at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
>        at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
>        at
> com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
>        at $Proxy6.getFileInfo(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
>        at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
>        at
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
>        at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-05-20 16:21:27,731 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:21:27,732 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for
> region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:21:27,732 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,737 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:30:27,738 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for
> region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,738 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,747 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempting to transition node
> d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned
> node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was in the state
> RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:48:27,748 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> transition from OFFLINE to OPENING for
> region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,748 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was
> hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:51:27,748 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by bijieshan <bi...@huawei.com>.

I have filed an issue, and I'll commit a patch soon(For I still need to do some test on the patch).
Issue Address: https://issues.apache.org/jira/browse/HBASE-3937

It indeed has something relating to HBASE-3789
I'm still looking into this issue.
Any further discussion, I'll add into comments.

Thanks for looking into this problem Stack.

Jieshan Bean

--------------

Thanks for digging in Jean.  Your diagnosis below looks right to me --
the bit about master trying to reset OFFSET before reassigning.  It
will help if a regionserver has set it OPENING in the meantime.  How
do you propsose to handle the case where we fail setting it to OFFLINE
because RS1 has already set it OPENING?  Will you just drop the
transaction.  Mind filing and issue and if possible, a patch?  You
might want to checkout J-D's work in this area too: HBASE-3879 [1] (In
particular, the comment where he describes his fix: [2]).  There is
some overlap but I do not think he has addressed what you see in the
below.

Yours,
St.Ack

1. https://issues.apache.org/jira/browse/HBASE-3789
2. https://issues.apache.org/jira/browse/HBASE-3789?focusedCommentId=13039368&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13039368

On Sun, May 29, 2011 at 6:48 PM, bijieshan <bi...@huawei.com> wrote:
> After I trace into the logs and the code, I found the problem.
> Maybe I didn't describe the problem correctly. The title is also puzzling.
>
> I try again to show the scenario of how to create the problem:
>
> 1.HMaster assigned the region A to RS1. So the RegionState was set to PENDING_OPEN.
> 2.For there's too many opening requests, the open process on RS1 was blocked.
> 3.Some time later, TimeoutMonitor found the assigning of A was timeout. For the RegionState was in PENDING_OPEN, went into the following handler process(Just put the region into an waiting-assigning set):
>
>   case PENDING_OPEN:
>      LOG.info("Region has been PENDING_OPEN for too " +
>          "long, reassigning region=" +
>          regionInfo.getRegionNameAsString());
>      assigns.put(regionState.getRegion(), Boolean.TRUE);
>      break;
> So we can see that, under this case, we consider the ZK node state was OFFLINE. Indeed, in an normal disposal, it's OK.
>
> 4.But before the real-assigning, the requests of RS1 was disposed. So that affected the new-assigning. For it update the ZK node state from OFFLINE to OPENING.
>
> 5.The new assigning started, so it send region to open in RS2. But while the opening, it should update the ZK node state from OFFLINE to OPENING. For the current state is OPENING, so this operation failed.
> So this region couldn't be open success anymore.
>
> So I think, to void this problem , under the case of PENDING_OPEN of TiemoutMonitor, we should transform the ZK node state to OFFLINE first.
>
> Thanks!
>
> Jieshan Bean
>
> ------------------------
>
> Hi,
> During that time, there's too many regions were assigning.
> I have read the related code, but the problem is still scratch my head over. The fact is the region could not open for the zk state is not the expect one.
>
> 2011-05-20 16:02:58,993 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x1300c11b4f30051 Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
>
> So the question is, under what condition could cause the inconsistently states?
>
> This is the a segment of HMaster logs around that time(There's so many logs like this)
>
> 15:49:47,864 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr,051410,1305873959469.14cfc2222fff69c0b44bf2cdc9e20dd1. to 157-5-111-13,20020,1305877624933
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=157-5-111-14,20020,1305877627727, region=5910a81f573f8e9e255db473e9407ab4
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. state=PENDING_OPEN, ts=1305877600490
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 5910a81f573f8e9e255db473e9407ab4; deleting unassigned node
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. so generated a random one; hri=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb., src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) available servers
>
> Regards,
> Jieshan Bean
>
>
>
> --------------
>
> I was asking about what was going on in the master during that time, I
> really would like to see it. It should be some time after that
> exception:
>
> 2011-05-20 15:49:48,122 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
>
> About resetting the znode, as you can see in TimeoutMonitor we don't
> really care if it was reset or not as it should take care of doing it.
> The issue here is getting at the root of the problem.
>
> J-D
>
>

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by Stack <st...@duboce.net>.

Thanks for digging in Jean.  Your diagnosis below looks right to me --
the bit about master trying to reset OFFSET before reassigning.  It
will help if a regionserver has set it OPENING in the meantime.  How
do you propsose to handle the case where we fail setting it to OFFLINE
because RS1 has already set it OPENING?  Will you just drop the
transaction.  Mind filing and issue and if possible, a patch?  You
might want to checkout J-D's work in this area too: HBASE-3879 [1] (In
particular, the comment where he describes his fix: [2]).  There is
some overlap but I do not think he has addressed what you see in the
below.

Yours,
St.Ack

1. https://issues.apache.org/jira/browse/HBASE-3789
2. https://issues.apache.org/jira/browse/HBASE-3789?focusedCommentId=13039368&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13039368

On Sun, May 29, 2011 at 6:48 PM, bijieshan <bi...@huawei.com> wrote:
> After I trace into the logs and the code, I found the problem.
> Maybe I didn't describe the problem correctly. The title is also puzzling.
>
> I try again to show the scenario of how to create the problem:
>
> 1.HMaster assigned the region A to RS1. So the RegionState was set to PENDING_OPEN.
> 2.For there's too many opening requests, the open process on RS1 was blocked.
> 3.Some time later, TimeoutMonitor found the assigning of A was timeout. For the RegionState was in PENDING_OPEN, went into the following handler process(Just put the region into an waiting-assigning set):
>
>   case PENDING_OPEN:
>      LOG.info("Region has been PENDING_OPEN for too " +
>          "long, reassigning region=" +
>          regionInfo.getRegionNameAsString());
>      assigns.put(regionState.getRegion(), Boolean.TRUE);
>      break;
> So we can see that, under this case, we consider the ZK node state was OFFLINE. Indeed, in an normal disposal, it's OK.
>
> 4.But before the real-assigning, the requests of RS1 was disposed. So that affected the new-assigning. For it update the ZK node state from OFFLINE to OPENING.
>
> 5.The new assigning started, so it send region to open in RS2. But while the opening, it should update the ZK node state from OFFLINE to OPENING. For the current state is OPENING, so this operation failed.
> So this region couldn't be open success anymore.
>
> So I think, to void this problem , under the case of PENDING_OPEN of TiemoutMonitor, we should transform the ZK node state to OFFLINE first.
>
> Thanks!
>
> Jieshan Bean
>
> ------------------------
>
> Hi,
> During that time, there's too many regions were assigning.
> I have read the related code, but the problem is still scratch my head over. The fact is the region could not open for the zk state is not the expect one.
>
> 2011-05-20 16:02:58,993 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x1300c11b4f30051 Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
>
> So the question is, under what condition could cause the inconsistently states?
>
> This is the a segment of HMaster logs around that time(There's so many logs like this)
>
> 15:49:47,864 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr,051410,1305873959469.14cfc2222fff69c0b44bf2cdc9e20dd1. to 157-5-111-13,20020,1305877624933
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=157-5-111-14,20020,1305877627727, region=5910a81f573f8e9e255db473e9407ab4
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. state=PENDING_OPEN, ts=1305877600490
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 5910a81f573f8e9e255db473e9407ab4; deleting unassigned node
> 2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. so generated a random one; hri=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb., src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) available servers
>
> Regards,
> Jieshan Bean
>
>
>
> --------------
>
> I was asking about what was going on in the master during that time, I
> really would like to see it. It should be some time after that
> exception:
>
> 2011-05-20 15:49:48,122 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
>
> About resetting the znode, as you can see in TimeoutMonitor we don't
> really care if it was reset or not as it should take care of doing it.
> The issue here is getting at the root of the problem.
>
> J-D
>
>

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by bijieshan <bi...@huawei.com>.

After I trace into the logs and the code, I found the problem.
Maybe I didn't describe the problem correctly. The title is also puzzling.

I try again to show the scenario of how to create the problem:

1.HMaster assigned the region A to RS1. So the RegionState was set to PENDING_OPEN.
2.For there's too many opening requests, the open process on RS1 was blocked.
3.Some time later, TimeoutMonitor found the assigning of A was timeout. For the RegionState was in PENDING_OPEN, went into the following handler process(Just put the region into an waiting-assigning set):

case PENDING_OPEN:
LOG.info("Region has been PENDING_OPEN for too " +
"long, reassigning region=" +
regionInfo.getRegionNameAsString());
assigns.put(regionState.getRegion(), Boolean.TRUE);
break;
So we can see that, under this case, we consider the ZK node state was OFFLINE. Indeed, in an normal disposal, it's OK.

4.But before the real-assigning, the requests of RS1 was disposed. So that affected the new-assigning. For it update the ZK node state from OFFLINE to OPENING.

5.The new assigning started, so it send region to open in RS2. But while the opening, it should update the ZK node state from OFFLINE to OPENING. For the current state is OPENING, so this operation failed.
So this region couldn't be open success anymore.

So I think, to void this problem , under the case of PENDING_OPEN of TiemoutMonitor, we should transform the ZK node state to OFFLINE first.

Thanks!

Jieshan Bean

------------------------

Hi,
During that time, there's too many regions were assigning.
I have read the related code, but the problem is still scratch my head over. The fact is the region could not open for the zk state is not the expect one.

2011-05-20 16:02:58,993 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x1300c11b4f30051 Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161

So the question is, under what condition could cause the inconsistently states?

This is the a segment of HMaster logs around that time(There's so many logs like this)

15:49:47,864 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr,051410,1305873959469.14cfc2222fff69c0b44bf2cdc9e20dd1. to 157-5-111-13,20020,1305877624933
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=157-5-111-14,20020,1305877627727, region=5910a81f573f8e9e255db473e9407ab4
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. state=PENDING_OPEN, ts=1305877600490
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 5910a81f573f8e9e255db473e9407ab4; deleting unassigned node
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. so generated a random one; hri=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb., src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) available servers

Regards,
Jieshan Bean

--------------

I was asking about what was going on in the master during that time, I
really would like to see it. It should be some time after that
exception:

2011-05-20 15:49:48,122 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.

About resetting the znode, as you can see in TimeoutMonitor we don't
really care if it was reset or not as it should take care of doing it.
The issue here is getting at the root of the problem.

J-D

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by bijieshan <bi...@huawei.com>.

Hi,
During that time, there's too many regions were assigning.
I have read the related code, but the problem is still scratch my head over. The fact is the region could not open for the zk state is not the expect one.

2011-05-20 16:02:58,993 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x1300c11b4f30051 Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161

So the question is, under what condition could cause the inconsistently states?

This is the a segment of HMaster logs around that time(There's so many logs like this)

15:49:47,864 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr,051410,1305873959469.14cfc2222fff69c0b44bf2cdc9e20dd1. to 157-5-111-13,20020,1305877624933
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=157-5-111-14,20020,1305877627727, region=5910a81f573f8e9e255db473e9407ab4
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. state=PENDING_OPEN, ts=1305877600490
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 5910a81f573f8e9e255db473e9407ab4; deleting unassigned node
2011-05-20 15:49:47,867 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. so generated a random one; hri=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb., src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) available servers

Regards,
Jieshan Bean



--------------

I was asking about what was going on in the master during that time, I
really would like to see it. It should be some time after that
exception:

2011-05-20 15:49:48,122 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.

About resetting the znode, as you can see in TimeoutMonitor we don't
really care if it was reset or not as it should take care of doing it.
The issue here is getting at the root of the problem.

J-D

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I was asking about what was going on in the master during that time, I
really would like to see it. It should be some time after that
exception:

2011-05-20 15:49:48,122 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.

About resetting the znode, as you can see in TimeoutMonitor we don't
really care if it was reset or not as it should take care of doing it.
The issue here is getting at the root of the problem.

J-D

On Thu, May 26, 2011 at 6:37 PM, bijieshan <bi...@huawei.com> wrote:
> Thanks J-D.
> When a region times out in OPENNING, TimeoutMonitor will find it and then assign again. There's no doubt here.
> But while deal with the ZK node state, it got something wrong. I can describe it in another way, while IOE happened in OpenRegion, the ZK node state should be return to the offline state, but in the current code,  it seems doing nothing but return directly.
> So the problem is not in the TimeoutMonitor handling process.
>
> Maybe that's not correct, please give out your viewpoint about that, thanks.
>
> Please see my comments on the code:
>
> OpenRegionHandler#process:
>
> /*****Before the opening, transition the ZK node state from offline to opening first***********/
> /*****So if it got failure, this node state should be reset to offline ************************/
> if (!transitionZookeeperOfflineToOpening(encodedName)) {
>      LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>          encodedName);
>      return;
> }
> /******The comments seems correct, but indeed it didn't handle the failures *******************/
> // Open region.  After a successful open, failures in subsequent
> // processing needs to do a close as part of cleanup.
> /*********while IOE happened in openRegion(), region is null************/
>      region = openRegion();
> /*********So here return but didn't reset the ZK state************/
>      if (region == null) return;
>      boolean failed = true;
>      if (tickleOpening("post_region_open")) {
>        if (updateMeta(region)) failed = false;
>      }
>
> Jieshan Bean
>
> ----- -----
> 发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
> 发送时间: 2011年5月27日 1:02
> 收件人: user@hbase.apache.org
> 主题: Re: HRegion.openHRegion IOException caused an endless loop of opening--opening failed
>
> What's going on in the master? When a region times out in OPENING this
> is what the TimeoutMonitor does:
>
>                  // Attempt to transition node into OFFLINE
>                  try {
>                    data = new RegionTransitionData(
>                      EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(),
>                      master.getServerName());
>                    if (ZKUtil.setData(watcher, node, data.getBytes(),
>                        stat.getVersion())) {
>                      // Node is now OFFLINE, let's trigger another assignment
>                      ZKUtil.getDataAndWatch(watcher, node); // re-set the watch
>                      LOG.info("Successfully transitioned region=" +
>                          regionInfo.getRegionNameAsString() + " into OFFLINE" +
>                          " and forcing a new assignment");
>                      assigns.put(regionState.getRegion(), Boolean.TRUE);
>                    }
>
> So the node should be moved into OFFLINE, looks like it's not happening?
>
> J-D
>
> On Thu, May 26, 2011 at 12:15 AM, bijieshan <bi...@huawei.com> wrote:
>> It caused the region couldn't been open anymore, for it has fallen into an loop of opening operations, but failed for each time. The Balancer would skip for the region still remain in RIT. So the regions looked un-balance between the regionservers.
>>
>> I describe the problem step by step as following:
>>
>> 1.HMaster send Msg to openregion on RS1.
>> 2.RS1 received the Msg, and start to open the region. Before the opening, update the state of ZK node from offline to opening.
>> 3.IOException happened while openRegion, so the opening failed.
>> 4.The ZK node state was still opening.
>> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the opening Msg again. Maybe it send to RS2
>> 6.RS2 execute the opening, while update the ZK node state, it got an unexpected state. So failed again.
>> 7.Loop the steps from 5 to 6.
>>
>> And from the code:
>>
>> OpenRegionHandler#process
>>      if (!transitionZookeeperOfflineToOpening(encodedName)) {
>>        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>>          encodedName);
>>        return;
>>      }
>>      /************************************************************************/
>>      /*********IOException happened, region is null***************************/
>>      /************************************************************************/
>>      region = openRegion();
>>      /************************************************************************/
>>      /*********(region == null) is true, so return directly*******************/
>>      /************************************************************************/
>>      if (region == null) return;
>>      boolean failed = true;
>>      if (tickleOpening("post_region_open")) {
>>        if (updateMeta(region)) failed = false;
>>      }
>>
>> OpenRegionHandler#openRegion
>>    HRegion region = null;
>>    try {
>>      /************************************************************************/
>>      /*********IOException happened here.. ***********************************/
>>      /************************************************************************/
>>        region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
>>        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
>>        new CancelableProgressable() {
>>          public boolean progress() {
>>            return tickleOpening("open_region_progress");
>>          }
>>        });
>>    } catch (IOException e) {
>>      LOG.error("Failed open of region=" +
>>        this.regionInfo.getRegionNameAsString(), e);
>>    }
>>
>>    return region;
>>
>> Here's the logs:
>> 2011-05-20 15:49:48,122 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
>> java.io.IOException: Exception occured while connecting to the server
>>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
>>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
>>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
>>        at $Proxy6.getFileInfo(Unknown Source)
>>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
>>        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
>>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
>>        at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
>>        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
>>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
>>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
>>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
>>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
>>        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>        at java.lang.Thread.run(Thread.java:662)
>> 2011-05-20 16:21:27,731 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
>> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
>> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
>> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
>> 2011-05-20 16:30:27,737 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
>> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
>> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
>> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
>> 2011-05-20 16:48:27,747 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
>> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
>> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
>> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
>> 2011-05-20 16:51:27,748 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>>
>

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by bijieshan <bi...@huawei.com>.

Thanks J-D.
When a region times out in OPENNING, TimeoutMonitor will find it and then assign again. There's no doubt here.
But while deal with the ZK node state, it got something wrong. I can describe it in another way, while IOE happened in OpenRegion, the ZK node state should be return to the offline state, but in the current code,  it seems doing nothing but return directly.
So the problem is not in the TimeoutMonitor handling process.

Maybe that's not correct, please give out your viewpoint about that, thanks.

Please see my comments on the code:

OpenRegionHandler#process:

/*****Before the opening, transition the ZK node state from offline to opening first***********/
/*****So if it got failure, this node state should be reset to offline ************************/
if (!transitionZookeeperOfflineToOpening(encodedName)) {
      LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
          encodedName);
      return;
}
/******The comments seems correct, but indeed it didn't handle the failures *******************/
// Open region.  After a successful open, failures in subsequent
// processing needs to do a close as part of cleanup.
/*********while IOE happened in openRegion(), region is null************/
      region = openRegion();
/*********So here return but didn't reset the ZK state************/
      if (region == null) return; 
      boolean failed = true;
      if (tickleOpening("post_region_open")) {
        if (updateMeta(region)) failed = false;
      } 

Jieshan Bean

----- -----
发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
发送时间: 2011年5月27日 1:02
收件人: user@hbase.apache.org
主题: Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

What's going on in the master? When a region times out in OPENING this
is what the TimeoutMonitor does:

                  // Attempt to transition node into OFFLINE
                  try {
                    data = new RegionTransitionData(
                      EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(),
                      master.getServerName());
                    if (ZKUtil.setData(watcher, node, data.getBytes(),
                        stat.getVersion())) {
                      // Node is now OFFLINE, let's trigger another assignment
                      ZKUtil.getDataAndWatch(watcher, node); // re-set the watch
                      LOG.info("Successfully transitioned region=" +
                          regionInfo.getRegionNameAsString() + " into OFFLINE" +
                          " and forcing a new assignment");
                      assigns.put(regionState.getRegion(), Boolean.TRUE);
                    }

So the node should be moved into OFFLINE, looks like it's not happening?

J-D

On Thu, May 26, 2011 at 12:15 AM, bijieshan <bi...@huawei.com> wrote:
> It caused the region couldn't been open anymore, for it has fallen into an loop of opening operations, but failed for each time. The Balancer would skip for the region still remain in RIT. So the regions looked un-balance between the regionservers.
>
> I describe the problem step by step as following:
>
> 1.HMaster send Msg to openregion on RS1.
> 2.RS1 received the Msg, and start to open the region. Before the opening, update the state of ZK node from offline to opening.
> 3.IOException happened while openRegion, so the opening failed.
> 4.The ZK node state was still opening.
> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the opening Msg again. Maybe it send to RS2
> 6.RS2 execute the opening, while update the ZK node state, it got an unexpected state. So failed again.
> 7.Loop the steps from 5 to 6.
>
> And from the code:
>
> OpenRegionHandler#process
>      if (!transitionZookeeperOfflineToOpening(encodedName)) {
>        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>          encodedName);
>        return;
>      }
>      /************************************************************************/
>      /*********IOException happened, region is null***************************/
>      /************************************************************************/
>      region = openRegion();
>      /************************************************************************/
>      /*********(region == null) is true, so return directly*******************/
>      /************************************************************************/
>      if (region == null) return;
>      boolean failed = true;
>      if (tickleOpening("post_region_open")) {
>        if (updateMeta(region)) failed = false;
>      }
>
> OpenRegionHandler#openRegion
>    HRegion region = null;
>    try {
>      /************************************************************************/
>      /*********IOException happened here.. ***********************************/
>      /************************************************************************/
>        region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
>        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
>        new CancelableProgressable() {
>          public boolean progress() {
>            return tickleOpening("open_region_progress");
>          }
>        });
>    } catch (IOException e) {
>      LOG.error("Failed open of region=" +
>        this.regionInfo.getRegionNameAsString(), e);
>    }
>
>    return region;
>
> Here's the logs:
> 2011-05-20 15:49:48,122 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
> java.io.IOException: Exception occured while connecting to the server
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
>        at $Proxy6.getFileInfo(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
>        at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
>        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
>        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-05-20 16:21:27,731 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,737 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,747 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:51:27,748 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>

Re: HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by Jean-Daniel Cryans <jd...@apache.org>.

What's going on in the master? When a region times out in OPENING this
is what the TimeoutMonitor does:

                  // Attempt to transition node into OFFLINE
                  try {
                    data = new RegionTransitionData(
                      EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(),
                      master.getServerName());
                    if (ZKUtil.setData(watcher, node, data.getBytes(),
                        stat.getVersion())) {
                      // Node is now OFFLINE, let's trigger another assignment
                      ZKUtil.getDataAndWatch(watcher, node); // re-set the watch
                      LOG.info("Successfully transitioned region=" +
                          regionInfo.getRegionNameAsString() + " into OFFLINE" +
                          " and forcing a new assignment");
                      assigns.put(regionState.getRegion(), Boolean.TRUE);
                    }

So the node should be moved into OFFLINE, looks like it's not happening?

J-D

On Thu, May 26, 2011 at 12:15 AM, bijieshan <bi...@huawei.com> wrote:
> It caused the region couldn't been open anymore, for it has fallen into an loop of opening operations, but failed for each time. The Balancer would skip for the region still remain in RIT. So the regions looked un-balance between the regionservers.
>
> I describe the problem step by step as following:
>
> 1.HMaster send Msg to openregion on RS1.
> 2.RS1 received the Msg, and start to open the region. Before the opening, update the state of ZK node from offline to opening.
> 3.IOException happened while openRegion, so the opening failed.
> 4.The ZK node state was still opening.
> 5.HMaster TimeoutMonitor found the region-opening timeout, so send the opening Msg again. Maybe it send to RS2
> 6.RS2 execute the opening, while update the ZK node state, it got an unexpected state. So failed again.
> 7.Loop the steps from 5 to 6.
>
> And from the code:
>
> OpenRegionHandler#process
>      if (!transitionZookeeperOfflineToOpening(encodedName)) {
>        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>          encodedName);
>        return;
>      }
>      /************************************************************************/
>      /*********IOException happened, region is null***************************/
>      /************************************************************************/
>      region = openRegion();
>      /************************************************************************/
>      /*********(region == null) is true, so return directly*******************/
>      /************************************************************************/
>      if (region == null) return;
>      boolean failed = true;
>      if (tickleOpening("post_region_open")) {
>        if (updateMeta(region)) failed = false;
>      }
>
> OpenRegionHandler#openRegion
>    HRegion region = null;
>    try {
>      /************************************************************************/
>      /*********IOException happened here.. ***********************************/
>      /************************************************************************/
>        region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
>        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
>        new CancelableProgressable() {
>          public boolean progress() {
>            return tickleOpening("open_region_progress");
>          }
>        });
>    } catch (IOException e) {
>      LOG.error("Failed open of region=" +
>        this.regionInfo.getRegionNameAsString(), e);
>    }
>
>    return region;
>
> Here's the logs:
> 2011-05-20 15:49:48,122 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
> java.io.IOException: Exception occured while connecting to the server
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
>        at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
>        at $Proxy6.getFileInfo(Unknown Source)
>        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
>        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
>        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
>        at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
>        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
>        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
>        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
>        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-05-20 16:21:27,731 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,737 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,747 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
> 2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
> 2011-05-20 16:51:27,748 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
>

HRegion.openHRegion IOException caused an endless loop of opening—opening failed

Posted by bijieshan <bi...@huawei.com>.

It caused the region couldn't been open anymore, for it has fallen into an loop of opening operations, but failed for each time. The Balancer would skip for the region still remain in RIT. So the regions looked un-balance between the regionservers.

I describe the problem step by step as following:

1.HMaster send Msg to openregion on RS1.
2.RS1 received the Msg, and start to open the region. Before the opening, update the state of ZK node from offline to opening.
3.IOException happened while openRegion, so the opening failed.
4.The ZK node state was still opening.
5.HMaster TimeoutMonitor found the region-opening timeout, so send the opening Msg again. Maybe it send to RS2
6.RS2 execute the opening, while update the ZK node state, it got an unexpected state. So failed again.
7.Loop the steps from 5 to 6.

And from the code:

OpenRegionHandler#process
      if (!transitionZookeeperOfflineToOpening(encodedName)) {
        LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
          encodedName);
        return;
      }
      /************************************************************************/
      /*********IOException happened, region is null***************************/
      /************************************************************************/
      region = openRegion(); 
      /************************************************************************/
      /*********(region == null) is true, so return directly*******************/    
      /************************************************************************/  
      if (region == null) return;
      boolean failed = true;
      if (tickleOpening("post_region_open")) {
        if (updateMeta(region)) failed = false;
      }

OpenRegionHandler#openRegion
    HRegion region = null;
    try {
      /************************************************************************/
      /*********IOException happened here.. ***********************************/    
      /************************************************************************/
        region = HRegion.openHRegion(this.regionInfo, this.rsServices.getWAL(),
        this.server.getConfiguration(), this.rsServices.getFlushRequester(),
        new CancelableProgressable() {
          public boolean progress() {
            return tickleOpening("open_region_progress");
          }
        });
    } catch (IOException e) {
      LOG.error("Failed open of region=" +
        this.regionInfo.getRegionNameAsString(), e);
    }
    
    return region;

Here's the logs:
2011-05-20 15:49:48,122 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa.
java.io.IOException: Exception occured while connecting to the server
	at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.retryOperation(RPCRetryAndSwitchInvoker.java:162)
	at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.handleFailure(RPCRetryAndSwitchInvoker.java:118)
	at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:95)
	at $Proxy6.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:889)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:724)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:812)
	at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:409)
	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:338)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2551)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2537)
	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:272)
	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:99)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
2011-05-20 16:21:27,731 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:21:27,731 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:21:27,732 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:30:27,737 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:30:27,738 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:30:27,738 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:48:27,747 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.
2011-05-20 16:48:27,747 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempting to transition node d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:20020-0x3300c164fe0002c Attempt to transition the unassigned node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161
2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:48:27,748 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Region was hijacked? It no longer exists, encodedName=d7555a12586e6c788ca55017224b5a51
2011-05-20 16:51:27,748 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: ufdr,001570,1305873689710.d7555a12586e6c788ca55017224b5a51.