You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Zheng Hu (JIRA)" <ji...@apache.org> on 2018/12/06 09:58:00 UTC

[jira] [Comment Edited] (HBASE-21559) The RestoreSnapshotFromClientTestBase related UT are flaky

    [ https://issues.apache.org/jira/browse/HBASE-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711214#comment-16711214 ] 

Zheng Hu edited comment on HBASE-21559 at 12/6/18 9:57 AM:
-----------------------------------------------------------

Yeah, It's a dead lock, The SplitTableRegionProcedure grab the table lock and waiting for grab the SnapshotManager object lock, while the SnapshotManager grab the SnapshotManager and waiting for the table lock ? 

The SplitTableRegionProcedure stack: 
{code}
Thread 527 (PEWorker-1):
  State: BLOCKED
  Blocked count: 10
  Waited count: 89
  Blocked on org.apache.hadoop.hbase.master.snapshot.SnapshotManager@51c5c8d5
  Blocked by 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736)
  Stack:
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isTakingSnapshot(SnapshotManager.java:423)
    org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.prepareSplitRegion(SplitTableRegionProcedure.java:470)
    org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:244)
    org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:97)
    org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
    org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1723)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
{code}

And the SnapshotManager trace: 
{code}
Thread 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736):
  State: TIMED_WAITING
  Blocked count: 60
  Waited count: 359
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
    org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.tryAcquire(LockManager.java:162)
    org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.acquire(LockManager.java:123)
    org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.prepare(TakeSnapshotHandler.java:141)
    org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:60)
    org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:46)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:524)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotEnabledTable(SnapshotManager.java:510)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshotInternal(SnapshotManager.java:633)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshot(SnapshotManager.java:570)
    org.apache.hadoop.hbase.master.MasterRpcServices.snapshot(MasterRpcServices.java:1502)
    org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
    org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
    org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}


was (Author: openinx):
Yeah, It's a dead lock, The SplitTableRegionProcedure grabed the table lock and waiting for grab the SnapshotManager object lock, while the SnapshotManager grab the SnapshotManager and waiting for the table lock ? 

The SplitTableRegionProcedure stack: 
{code}
Thread 527 (PEWorker-1):
  State: BLOCKED
  Blocked count: 10
  Waited count: 89
  Blocked on org.apache.hadoop.hbase.master.snapshot.SnapshotManager@51c5c8d5
  Blocked by 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736)
  Stack:
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isTakingSnapshot(SnapshotManager.java:423)
    org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.prepareSplitRegion(SplitTableRegionProcedure.java:470)
    org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:244)
    org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:97)
    org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
    org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1723)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
    org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
{code}

And the SnapshotManager trace: 
{code}
Thread 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736):
  State: TIMED_WAITING
  Blocked count: 60
  Waited count: 359
  Stack:
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
    org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.tryAcquire(LockManager.java:162)
    org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.acquire(LockManager.java:123)
    org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.prepare(TakeSnapshotHandler.java:141)
    org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:60)
    org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:46)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:524)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotEnabledTable(SnapshotManager.java:510)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshotInternal(SnapshotManager.java:633)
    org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshot(SnapshotManager.java:570)
    org.apache.hadoop.hbase.master.MasterRpcServices.snapshot(MasterRpcServices.java:1502)
    org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
    org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
    org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
    org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}

> The RestoreSnapshotFromClientTestBase related UT are flaky
> ----------------------------------------------------------
>
>                 Key: HBASE-21559
>                 URL: https://issues.apache.org/jira/browse/HBASE-21559
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Zheng Hu
>            Assignee: Zheng Hu
>            Priority: Major
>             Fix For: 3.0.0, 2.1.2, 2.0.4, 2.0.5
>
>         Attachments: TEST-org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientAfterSplittingRegions.xml, org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientAfterSplittingRegions-output.txt, org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientAfterSplittingRegions.txt
>
>
> The  related UT are: 
> * TestRestoreSnapshotFromClientAfterSplittingRegions
> * TestRestoreSnapshotFromClientWithRegionReplicas
> * TestMobRestoreSnapshotFromClientAfterSplittingRegions
> I guess the main problem is:  a dead lock between SplitTableRegionProcedure and SnapshotProcedure.. 
> Attached logs from the failed UT. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)