You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2020/06/12 16:38:00 UTC

[jira] [Commented] (HBASE-24545) Add backoff to SCP check on WAL split completion

    [ https://issues.apache.org/jira/browse/HBASE-24545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134366#comment-17134366 ] 

Michael Stack commented on HBASE-24545:
---------------------------------------

Just for illustration of the problem described, here is where a single thread was hanging out:
{code}
"KeepAlivePEWorker-158" #909 daemon prio=5 os_prio=0 tid=0x0000000001fb5000 nid=0x29e in Object.wait() [0x00007f73fda29000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
        - locked <0x00007f7c64048020> (a org.apache.zookeeper.ClientCnxn$Packet)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2587)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:283)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:502)
        at org.apache.hadoop.hbase.coordination.ZKSplitLogManagerCoordination.remainingTasksInCoordination(ZKSplitLogManagerCoordination.java:125)
        at org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:333)
        - locked <0x00007f76381dc690> (a org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
        at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:262)
        at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:350)
        at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:335)
        at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:272)
        at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:312)
        at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:197)
        at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
        at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
        at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1669)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1416)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:79)
        at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1986)
{code}

> Add backoff to SCP check on WAL split completion
> ------------------------------------------------
>
>                 Key: HBASE-24545
>                 URL: https://issues.apache.org/jira/browse/HBASE-24545
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> Crashed cluster. Lots of backed up WALs. Startup. Recover hundreds of servers; each has a running SCP. Taking a thread dump during recovery, I noticed that there were 160 threads each in SCP waiting on split WAL completion. Each thread was scanning zk splitWAL directory every 100ms. The dir had thousands of entries in it so each check was pulling down MB from zk... * 160 (max configured PE threads (16) * 10 for the KeepAlive factor that has us do 10 * configured PEs as max for PE worker pool).
> If lots of remaining WALs to split, have the SCP backoff on its wait so it checks less frequently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)