You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Josh Elser (Jira)" <ji...@apache.org> on 2019/09/10 19:39:00 UTC
[jira] [Created] (HBASE-23011) AP stuck in retry loop if underlying table no longer exists

Josh Elser created HBASE-23011:
----------------------------------

             Summary: AP stuck in retry loop if underlying table no longer exists
                 Key: HBASE-23011
                 URL: https://issues.apache.org/jira/browse/HBASE-23011
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.1.6, 2.0.6
            Reporter: Josh Elser
            Assignee: Josh Elser


Looking at a user's issue with [~wchevreuil]... While the details of how exactly we got into this situation are murky, I'm noticing that we have a situation where an AP can get stuck resubmitting itself over and over if, somehow, the table the region the AP is assigning gets deleted.
{noformat}
2019-08-25 23:33:54,588 WARN  [PEWorker-11] assignment.RegionTransitionProcedure: Failed transition, suspend 1secs pid=1100250, ppid=1100195, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure table=<tablename>, region=<regionid>; rit=OFFLINE, location=null; waiting on rectified condition fixed by other Procedure or operator intervention
org.apache.hadoop.hbase.master.TableStateManager$TableStateNotFoundException: monitoring:test1
	at org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:215)
	at org.apache.hadoop.hbase.master.assignment.AssignProcedure.assign(AssignProcedure.java:195)
	at org.apache.hadoop.hbase.master.assignment.AssignProcedure.startTransition(AssignProcedure.java:206)
	at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:364)
	at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:98)
	at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:958)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1836)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1596)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:80)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2141)
 {noformat}
Stack trace looks like similar to the above.

The problem appears to be that we don't catch the {{TableStateNotFoundException}} coming out of {{TableStateManager#getTableState(TableName)}}. This keeps the AP in a fail/resubmit loop (until, presumably, someone comes along with an `HBCK2 bypass`). This is only a problem in branch-2.0 and branch-2.1. {{TransitRegionStateProcedure}} in branch-2.2+ doesn't have the same issue (at least on the surface).

As mentioned earlier, it's not clear how we got this SCP(1100195)->AP(1100250) scheduled while the table itself is actually deleted. Some quick attempts to reproduce this locally weren't successful. I'm not sure if I can write a meaningful test. Need to try to look more closely at that, but will attach a patch which I think will work around the issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)