You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Sanjeet Nishad (Jira)" <ji...@apache.org> on 2020/12/10 13:57:00 UTC

[jira] [Created] (HBASE-25381) RegionServer ignored a procedure (closeRegionProcedure) due to duplicate pid which lead region to stuck in RIT.

Sanjeet Nishad created HBASE-25381:
--------------------------------------

             Summary: RegionServer ignored a procedure (closeRegionProcedure) due to duplicate pid which lead region to stuck in RIT.
                 Key: HBASE-25381
                 URL: https://issues.apache.org/jira/browse/HBASE-25381
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.2.3
            Reporter: Sanjeet Nishad


Analysis:
1. After Hmaster failover, master in-memory proc-id was reset.
2. Upon new DisableTable client request, Master dispatched a closeRegionProcedure to RS and suspended the proc.
3. But RS ignored the current CloseRegionProcedure request without doing anything since RS had already executed a procedure with same id.

Since no UnAssignRegionHandler was created at Step-3, so RS did not send any reportRegionStateTransition to HM. And at HMaster side the procedure remain in suspended state because we awake the suspended procedure on reportRegionStateTransition. So region stuck in RIT forever until unless we restart HM or RS.

 

Observed following log RS side while trying to disable table 't2':
{code:java}
2020-12-08 10:18:23,216 | WARN | RpcServer.priority.RWQ.Fifo.read.handler=164,queue=2,port=21302 | Received procedure pid=13, which already executed, just ignore it | org.apache.hadoop.hbase.regionserver.HRegionServer.submitRegionProcedure(HRegionServer.java:4146){code}
This pid=13 was already used by RS for opening hbase:namespace:
{code:java}
2020-12-08 10:11:40,793 | INFO | RS_OPEN_PRIORITY_REGION-regionserver/a.b.c.d:efg-0 | Post open deploy tasks for hbase:namespace,,1607152197100.cffc166aa75ee4ddf8a210ca02da1ea1., pid=13, masterSystemTime=1607393499851 | org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:2422){code}
So the region of table='t2' was stuck in RIT because the closeRegionProcedure was stuck master side indefinitely:
{code:java}
2020-12-08 10:18:23,039 | INFO | PEWorker-15 | Updated tableName=t2, state=DISABLING in hbase:meta | org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1770)
2020-12-08 10:18:23,040 | INFO | PEWorker-15 | Set t2 to state=DISABLING | org.apache.hadoop.hbase.master.procedure.DisableTableProcedure.setTableStateToDisabling(DisableTableProcedure.java:296)
2020-12-08 10:18:23,042 | INFO | PEWorker-15 | Initialized subprocedures=[{pid=12, ppid=11, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=t2, region=213e5f89d48161a93b226ba2717b14fd, UNASSIGN}] | org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1704)
2020-12-08 10:18:23,045 | INFO | PEWorker-2 | Took xlock for pid=12, ppid=11, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=t2, region=213e5f89d48161a93b226ba2717b14fd, UNASSIGN | org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler.waitRegions(MasterProcedureScheduler.java:737)
2020-12-08 10:18:23,047 | INFO | PEWorker-2 | pid=12 updating hbase:meta row=213e5f89d48161a93b226ba2717b14fd, regionState=CLOSING, regionLocation=100-112-24-246,21302,1607392837508 | org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateUserRegionLocation(RegionStateStore.java:217)
2020-12-08 10:18:23,055 | INFO | PEWorker-2 | Initialized subprocedures=[{pid=13, ppid=12, state=RUNNABLE; CloseRegionProcedure 213e5f89d48161a93b226ba2717b14fd, server=100-112-24-246,21302,1607392837508}] | org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1704){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)