You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Zheng Hu (JIRA)" <ji...@apache.org> on 2018/12/06 09:58:00 UTC
[jira] [Comment Edited] (HBASE-21559) The
RestoreSnapshotFromClientTestBase related UT are flaky
[ https://issues.apache.org/jira/browse/HBASE-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711214#comment-16711214 ]
Zheng Hu edited comment on HBASE-21559 at 12/6/18 9:57 AM:
-----------------------------------------------------------
Yeah, It's a dead lock, The SplitTableRegionProcedure grab the table lock and waiting for grab the SnapshotManager object lock, while the SnapshotManager grab the SnapshotManager and waiting for the table lock ?
The SplitTableRegionProcedure stack:
{code}
Thread 527 (PEWorker-1):
State: BLOCKED
Blocked count: 10
Waited count: 89
Blocked on org.apache.hadoop.hbase.master.snapshot.SnapshotManager@51c5c8d5
Blocked by 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736)
Stack:
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isTakingSnapshot(SnapshotManager.java:423)
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.prepareSplitRegion(SplitTableRegionProcedure.java:470)
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:244)
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:97)
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1723)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
{code}
And the SnapshotManager trace:
{code}
Thread 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736):
State: TIMED_WAITING
Blocked count: 60
Waited count: 359
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.tryAcquire(LockManager.java:162)
org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.acquire(LockManager.java:123)
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.prepare(TakeSnapshotHandler.java:141)
org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:60)
org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:46)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:524)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotEnabledTable(SnapshotManager.java:510)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshotInternal(SnapshotManager.java:633)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshot(SnapshotManager.java:570)
org.apache.hadoop.hbase.master.MasterRpcServices.snapshot(MasterRpcServices.java:1502)
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}
was (Author: openinx):
Yeah, It's a dead lock, The SplitTableRegionProcedure grabed the table lock and waiting for grab the SnapshotManager object lock, while the SnapshotManager grab the SnapshotManager and waiting for the table lock ?
The SplitTableRegionProcedure stack:
{code}
Thread 527 (PEWorker-1):
State: BLOCKED
Blocked count: 10
Waited count: 89
Blocked on org.apache.hadoop.hbase.master.snapshot.SnapshotManager@51c5c8d5
Blocked by 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736)
Stack:
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isTakingSnapshot(SnapshotManager.java:423)
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.prepareSplitRegion(SplitTableRegionProcedure.java:470)
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:244)
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:97)
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1723)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1462)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2039)
{code}
And the SnapshotManager trace:
{code}
Thread 412 (RpcServer.default.FPBQ.Fifo.handler=3,queue=0,port=53736):
State: TIMED_WAITING
Blocked count: 60
Waited count: 359
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.tryAcquire(LockManager.java:162)
org.apache.hadoop.hbase.master.locking.LockManager$MasterLock.acquire(LockManager.java:123)
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.prepare(TakeSnapshotHandler.java:141)
org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:60)
org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:46)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:524)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotEnabledTable(SnapshotManager.java:510)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshotInternal(SnapshotManager.java:633)
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshot(SnapshotManager.java:570)
org.apache.hadoop.hbase.master.MasterRpcServices.snapshot(MasterRpcServices.java:1502)
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}
> The RestoreSnapshotFromClientTestBase related UT are flaky
> ----------------------------------------------------------
>
> Key: HBASE-21559
> URL: https://issues.apache.org/jira/browse/HBASE-21559
> Project: HBase
> Issue Type: Bug
> Reporter: Zheng Hu
> Assignee: Zheng Hu
> Priority: Major
> Fix For: 3.0.0, 2.1.2, 2.0.4, 2.0.5
>
> Attachments: TEST-org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientAfterSplittingRegions.xml, org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientAfterSplittingRegions-output.txt, org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClientAfterSplittingRegions.txt
>
>
> The related UT are:
> * TestRestoreSnapshotFromClientAfterSplittingRegions
> * TestRestoreSnapshotFromClientWithRegionReplicas
> * TestMobRestoreSnapshotFromClientAfterSplittingRegions
> I guess the main problem is: a dead lock between SplitTableRegionProcedure and SnapshotProcedure..
> Attached logs from the failed UT.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)