You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2019/09/30 07:57:00 UTC

[jira] [Resolved] (HBASE-23079) RegionRemoteProcedureBase should override setTimeoutFailure

     [ https://issues.apache.org/jira/browse/HBASE-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Duo Zhang resolved HBASE-23079.
-------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

Pushed to branch-2.2+.

Thanks [~zghao] for reviewing, and thanks [~Xiaolin Ha] and [~elserj] for analyzing the problem.

> RegionRemoteProcedureBase should override setTimeoutFailure
> -----------------------------------------------------------
>
>                 Key: HBASE-23079
>                 URL: https://issues.apache.org/jira/browse/HBASE-23079
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>            Reporter: Xiaolin Ha
>            Assignee: Duo Zhang
>            Priority: Blocker
>             Fix For: 3.0.0, 2.3.0, 2.2.2
>
>
> In our test environment, we found that, when OpenRegionProcedure failed by timeout, though it doesn't support rollback, ProcedureExcutor will delete the OPR node in the store in its own rollback steps and make the RIT stay there looping rollback exception. ServerCrashProcedure will detect this RIT but cannot deal with it, because no node in the store and it will encounters NPE, as a result, SCP aborts.
>  
> Logs are as follows,
> {quote}2019-08-29,07:27:35,710 INFO [PEWorker-15] org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true found RIT pid=56043, ppid=55625, state=ROLLEDBACK; TransitRegionStateProcedure table=c3srv_galaxy:tsdb, region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; rit=OPEN, location=c3-hadoop-srv-st297.bj,21600,1567012736142, table=c3srv_galaxy:tsdb, region=c1ccd94593bf1b87269cec98d6ffaaae
> 2019-08-29,07:27:35,711 ERROR [PEWorker-15] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.NullPointerException
>  at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140)
>  at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133)
>  at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782)
>  at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737)
>  at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604)
>  at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182)
>  at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239)
>  at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398)
>  at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
>  at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
>  at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
>  at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
>  at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,714 ERROR [PEWorker-15] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true:java.lang.NullPointerException; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
>  at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>  at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>  at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
>  at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,716 ERROR [PEWorker-15] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true:java.lang.NullPointerException; ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_SPLIT_LOGS
>  at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>  at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>  at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
>  at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> {quote}
>  
> -------------
> {quote}[work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' *
> hbase.4500.log:2019-08-29,01:40:31,103 INFO [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:31,107 INFO [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661, ppid=56659, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006, timestamp=1567014012995
> hbase.4500.log:2019-08-29,01:40:31,186 INFO [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE, location=null, table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459 to restore RIT
> hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT rollback...pid=56661, ppid=56659, state=FAILED, exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: Operation timed out after 19.6860sec; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: Operation timed out after 19.6860sec; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: Operation timed out after 19.6860sec; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7] org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7] org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459, ASSIGN
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)