You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/04/27 22:37:03 UTC

[jira] [Commented] (HBASE-3669) Region in PENDING_OPEN keeps being bounced between RS and master

    [ https://issues.apache.org/jira/browse/HBASE-3669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025987#comment-13025987 ] 

Jean-Daniel Cryans commented on HBASE-3669:
-------------------------------------------

I am witnessing a reproducible bug that looks like this one on the same cluster that gave me the original one, except I simply get it by booting up and the regions are assigning "correctly". In short, I see hundreds of regions timing out while PENDING_OPEN and going through that list and then sending the assigns takes so much time that it's possible for region servers to finally open the region and do the whole reporting process up to deleting the znode before the sending the corresponding assign. For example:

{quote}
2011-04-27 19:12:35,219 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2d627b72230ba8 Async create of unassigned node for 74189964eede07b304ee604043066bb9 with OFFLINE state
2011-04-27 19:12:36,240 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9. state=OFFLINE, ts=1303931555219, server=sv2borg185,60020,1303931540964
2011-04-27 19:12:37,953 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9. state=OFFLINE, ts=1303931555219
2011-04-27 19:13:13,079 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9. state=PENDING_OPEN, ts=1303931557954
2011-04-27 19:13:13,079 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9.
2011-04-27 19:13:17,617 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:60000-0x2d627b72230ba8 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/74189964eede07b304ee604043066bb9
2011-04-27 19:13:17,648 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x2d627b72230ba8 Retrieved 114 byte(s) of data from znode /hbase/unassigned/74189964eede07b304ee604043066bb9 and set watcher; region=badge_clicks,2��:�X,1288208776331.74189964eede07b304ee604043066bb9., server=sv2borg185,60020,1303931540964, state=RS_ZK_REGION_OPENED
2011-04-27 19:13:17,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=sv2borg185,60020,1303931540964, region=74189964eede07b304ee604043066bb9, which is more than 15 seconds late
2011-04-27 19:14:06,183 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 74189964eede07b304ee604043066bb9; deleting unassigned node
2011-04-27 19:14:06,183 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2d627b72230ba8 Deleting existing unassigned node for 74189964eede07b304ee604043066bb9 that is in expected state RS_ZK_REGION_OPENED
2011-04-27 19:14:06,207 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x2d627b72230ba8 Retrieved 114 byte(s) of data from znode /hbase/unassigned/74189964eede07b304ee604043066bb9; data=region=badge_clicks,2��:�X,1288208776331.74189964eede07b304ee604043066bb9., server=sv2borg185,60020,1303931540964, state=RS_ZK_REGION_OPENED
2011-04-27 19:14:06,315 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:60000-0x2d627b72230ba8 Received ZooKeeper Event, type=NodeDeleted, state=SyncConnected, path=/hbase/unassigned/74189964eede07b304ee604043066bb9
2011-04-27 19:14:06,315 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2d627b72230ba8 Successfully deleted unassigned node for region 74189964eede07b304ee604043066bb9 in expected state RS_ZK_REGION_OPENED
2011-04-27 19:14:06,330 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9. on sv2borg185,60020,1303931540964
2011-04-27 19:14:10,429 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9. so generated a random one; hri=badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9., src=, dest=sv2borg172,60020,1303931541060; 19 (online=19, exclude=null) available servers
2011-04-27 19:14:10,429 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region badge_clicks,\x012\xB7\xA3:\x00\x05\x88X,1288208776331.74189964eede07b304ee604043066bb9. to sv2borg172,60020,1303931541060
{quote}

The last two lines are from the same thread that said the region timed out at 19:13:13,079, which is seriously confusing. After that it just fails forever to go out of transition since the region servers see the unassigned znode doesn't exist and refuse to open the region (and can't tell the master).

The end result is that the regions are assigned, hbck reports 0 problem, but the master is stuck with regions in transition and can't balance (and bonus points for printing log lines like mad).

> Region in PENDING_OPEN keeps being bounced between RS and master
> ----------------------------------------------------------------
>
>                 Key: HBASE-3669
>                 URL: https://issues.apache.org/jira/browse/HBASE-3669
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.1
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.90.3, 0.92.0
>
>         Attachments: HBASE-3669-debug-v1.patch
>
>
> After going crazy killing region servers after HBASE-3668, most of the cluster recovered except for 3 regions that kept being refused by the region servers.
> One the master I would see:
> {code}
> 2011-03-17 22:23:14,828 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21. state=PENDING_OPEN, ts=1300400554826
> 2011-03-17 22:23:14,828 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21.
> 2011-03-17 22:23:14,828 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21. state=PENDING_OPEN, ts=1300400554826
> 2011-03-17 22:23:14,828 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21. so generated a random one; hri=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21., src=, dest=sv2borg171,60020,1300399357135; 17 (online=17, exclude=null) available servers
> 2011-03-17 22:23:14,828 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21. to sv2borg171,60020,1300399357135
> {code}
> Then on the region server:
> {code}
> 2011-03-17 22:23:14,829 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x22d627c142707d2 Attempting to transition node f11849557c64c4efdbe0498f3fe97a21 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2011-03-17 22:23:14,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x22d627c142707d2 Retrieved 166 byte(s) of data from znode /hbase/unassigned/f11849557c64c4efdbe0498f3fe97a21; data=region=supr_rss_items,ea0a3ac6c8779dab:872333599:ed1a7ad00f076fd98fcd3adcd98b62c6,1285707378709.f11849557c64c4efdbe0498f3fe97a21., server=sv2borg180,60020,1300384550966, state=RS_ZK_REGION_OPENING
> 2011-03-17 22:23:14,832 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x22d627c142707d2 Attempt to transition the unassigned node for f11849557c64c4efdbe0498f3fe97a21 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed but was in the state RS_ZK_REGION_OPENING
> 2011-03-17 22:23:14,832 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for region=f11849557c64c4efdbe0498f3fe97a21
> {code}
> I'm not sure I fully understand what was going on... the master was suppose to OFFLINE the znode but then that's not what the region server was seeing? In any case, I was able to recover by doing a force unassign for each region and then assign.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira