You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2019/12/19 05:31:00 UTC

[jira] [Commented] (HBASE-23593) Stalled SCP Assigns

    [ https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999751#comment-16999751 ] 

Duo Zhang commented on HBASE-23593:
-----------------------------------

Have you checked the log at RegionServer side?

> Stalled SCP Assigns
> -------------------
>
>                 Key: HBASE-23593
>                 URL: https://issues.apache.org/jira/browse/HBASE-23593
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.2.3
>            Reporter: Michael Stack
>            Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem but from time to time I'll see the SCP stuck waiting on an Assign to finish. The assign seems stuck at the queuing of the OpenRegionProcedure. We've stored the procedure but then not a peek thereafter. Later we'll see complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it from the dead servers list
> {code}
>  
> ....even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: Archived processed log hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491 to hdfs://nameservice1/hbase/oldWALs/s011.example.                            org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after study.
> WALS split took two minutes to split and server was removed from dead servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 log files in [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting] in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition state=OPEN, location=s011.example.org,16020,1576561318119, table=t1, region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region which then makes a subtask to OpenRegionProcedure. This is where it gets stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-17 11:38:34,770 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=421761, ppid=402475, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th rollback step
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)