You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2018/06/04 15:42:00 UTC

[jira] [Commented] (HBASE-20680) Master hung during initialization waiting on hbase:meta to be assigned which never does

    [ https://issues.apache.org/jira/browse/HBASE-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500394#comment-16500394 ] 

Ted Yu commented on HBASE-20680:
--------------------------------

From regionserver-ctr-e138-1518143905142-340983-03-000014.hwx.site.log which restarted at Fri Jun  1 10:49:46 :
{code}
2018-06-01 10:49:52,588 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer: callId: 168 service: AdminService methodName: GetRegionInfo size: 47 connection: 172.27.86.210:47868 deadline: 1527850282584, exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online on ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527850188745
2018-06-01 10:49:52,693 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer: callId: 169 service: AdminService methodName: GetRegionInfo size: 47 connection: 172.27.86.210:47868 deadline: 1527850282693, exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online on ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527850188745
{code}
Such log repeated until:
{code}
2018-06-01 16:24:45,308 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=0,queue=0,port=16020] ipc.RpcServer: callId: 197326 service: AdminService methodName: GetRegionInfo size: 48          connection: 172.27.86.210:47868 deadline: 1527870375308, exception=org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online on ctr-e138-1518143905142-340983-03-    000014.hwx.site,16020,1527850188745
{code}

> Master hung during initialization waiting on hbase:meta to be assigned which never does
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-20680
>                 URL: https://issues.apache.org/jira/browse/HBASE-20680
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Priority: Critical
>         Attachments: 20680-logs.tar.gz
>
>
> When running IntegrationTestRSGroups, the test became hung waiting on the master to be initialized.
> The hbase cluster was launched without RSGroup config. The test script adds required RSGroup configs to hbase-site.xml and restarts the cluster.
> It seems that, at one point while the master was trying to assign meta, the destination regionserver was in the middle of going down. This has now left HBase in a state where it starts the regionserver recovery procedures, but never actually gets hbase:meta assigned.
> {code}
> 2018-06-01 10:47:50,024 INFO  [PEWorker-5] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740}]
> 2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] wal.WALProcedureStore: hsync completed for hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] procedure.MasterProcedureScheduler: pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                 region=1588230740 checking lock on 1588230740
> 2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 for rit=OFFLINE, location=ctr-  e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, table=hbase:meta, region=1588230740 last loc=null
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] assignment.AssignProcedure: Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                region=1588230740; rit=OFFLINE, location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190; forceNewPlan=false, retain=true target svr=null
> {code}
> At Fri Jun  1 10:48:04, master was restarted.
> The new master picked up pid=41:
> {code}
> 2018-06-01 10:48:47,971 INFO  [PEWorker-1] assignment.AssignProcedure: Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                region=1588230740; rit=OFFLINE, location=null; forceNewPlan=false, retain=false target svr=null
> {code}
> There was no further log for pid=41 after above.
> Later when master initiated another meta recovery procedure (pid=42), the second procedure seems to be locked out by the former:
> {code}
> 2018-06-01 10:49:34,292 INFO  [PEWorker-2] procedure.MasterProcedureScheduler: pid=43, ppid=42, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,                 region=1588230740, target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 checking lock on 1588230740
> 2018-06-01 10:49:34,293 DEBUG [PEWorker-2] assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, namespaceLocks={}, tableLocks={},                                 regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, waitingProcCount=1}}, peerLocks={}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)