You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Swapnil Bawaskar (JIRA)" <ji...@apache.org> on 2018/02/01 22:52:40 UTC
[jira] [Closed] (GEODE-3940) Backup can hang while trying to get a
lock
[ https://issues.apache.org/jira/browse/GEODE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Swapnil Bawaskar closed GEODE-3940.
-----------------------------------
> Backup can hang while trying to get a lock
> -------------------------------------------
>
> Key: GEODE-3940
> URL: https://issues.apache.org/jira/browse/GEODE-3940
> Project: Geode
> Issue Type: Bug
> Components: persistence
> Reporter: Lynn Gallinat
> Assignee: Darrel Schneider
> Priority: Major
> Fix For: 1.4.0
>
>
> {noformat}
> Backup can hang when createKrf cannot get the compactor lock.
> "Pooled Message Processor 2" #196 daemon prio=10 os_prio=0 tid=0x00007fef9801e000 nid=0x3cf3 waiting on condition [0x00007ff094cd5000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f1a72c60> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
> at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
> at org.apache.geode.internal.cache.Oplog.lockCompactor(Oplog.java:6046) <============ trying to get compactor lock
> at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4157) <============ in createKrf
> at org.apache.geode.internal.cache.Oplog.finishKrf(Oplog.java:7746)
> at org.apache.geode.internal.cache.BackupManager.backupOplog(BackupManager.java:580)
> at org.apache.geode.internal.cache.BackupManager.completeBackup(BackupManager.java:270)
> at org.apache.geode.internal.cache.BackupManager.doBackup(BackupManager.java:139)
> at org.apache.geode.admin.internal.FinishBackupRequest.createResponse(FinishBackupRequest.java:102)
> at org.apache.geode.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:39)
> at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
> at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
> at org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:923)
> at java.lang.Thread.run(Thread.java:748)
> The compactor thread already has the compactor lock.
> It is waiting for a DiskEntry lock.
> "Idle OplogCompactor" #379 daemon prio=10 os_prio=0 tid=0x00007ff020026000 nid=0x6916 waiting for monitor entry [0x00007ff08c7ce000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at org.apache.geode.internal.cache.Oplog.writeOneKeyEntryForKRF(Oplog.java:3924)
> - waiting to lock <0x00000000f1c55c70> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2) <====== waiting for DiskEntry lock
> at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4201) <======= already in createKrf
> - locked <0x00000000f1a72c30> (a java.util.concurrent.atomic.AtomicBoolean)
> at org.apache.geode.internal.cache.Oplog$2.run(Oplog.java:3875)
> at org.apache.geode.internal.cache.DiskStoreImpl$5.run(DiskStoreImpl.java:4386)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> The DiskEntry lock is held by this thread and is waiting for replies:
> "PartitionedRegion Message Processor26" #238 daemon prio=10 os_prio=0 tid=0x00007fef9c01f800 nid=0x3dd0 waiting on condition [0x00007ff08ebe9000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f1e42200> (a java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
> at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:718)
> at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:795)
> at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:771)
> at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:858)
> at org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:761)
> at org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:658)
> at org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:264)
> at org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:665)
> at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2839)
> - locked <0x00000000f1c55c70> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2) <======= has the DiskEntryLock
> at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
> at org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1222)
> at org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1205)
> at org.apache.geode.internal.cache.PartitionedRegionDataView.putEntryOnRemote(PartitionedRegionDataView.java:99)
> at org.apache.geode.internal.cache.partitioned.PutMessage.operateOnPartitionedRegion(PutMessage.java:744)
> at org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:333)
> at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
> at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
> at org.apache.geode.distributed.internal.DistributionManager$8$1.run(DistributionManager.java:1069)
> at java.lang.Thread.run(Thread.java:748)
> The member replying can't because it's waiting for the backup lock.
> "P2P message reader for rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8(gemfire3_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14506:14506)<ec><v1>:1028 shared ordered uid=6 port=40404" #152 daemon prio=10 os_prio=0 tid=0x00007ff91800d800 nid=0x3bf3 waiting on condition [0x00007ffa29d4a000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f0b63ad0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> at org.apache.geode.internal.cache.BackupLock.lock(BackupLock.java:95)
> at org.apache.geode.internal.cache.Oplog.basicModify(Oplog.java:4612)
> at org.apache.geode.internal.cache.Oplog.modify(Oplog.java:4450)
> at org.apache.geode.internal.cache.PersistentOplogSet.modify(PersistentOplogSet.java:188)
> at org.apache.geode.internal.cache.DiskStoreImpl.put(DiskStoreImpl.java:730)
> at org.apache.geode.internal.cache.DiskRegion.put(DiskRegion.java:351)
> at org.apache.geode.internal.cache.entries.DiskEntry$Helper.writeBytesToDisk(DiskEntry.java:828)
> at org.apache.geode.internal.cache.entries.DiskEntry$Helper.basicUpdate(DiskEntry.java:932)
> at org.apache.geode.internal.cache.entries.DiskEntry$Helper.update(DiskEntry.java:859)
> - locked <0x00000000f1610f70> (a org.apache.geode.internal.cache.DiskId$PersistenceWithIntOffset)
> at org.apache.geode.internal.cache.entries.AbstractDiskRegionEntry.setValue(AbstractDiskRegionEntry.java:40)
> at org.apache.geode.internal.cache.entries.AbstractRegionEntry.setValueWithTombstoneCheck(AbstractRegionEntry.java:307)
> at org.apache.geode.internal.cache.EntryEventImpl.setNewValueInRegion(EntryEventImpl.java:1651)
> at org.apache.geode.internal.cache.EntryEventImpl.putExistingEntry(EntryEventImpl.java:1527)
> at org.apache.geode.internal.cache.AbstractRegionMap.updateEntry(AbstractRegionMap.java:2993)
> at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2816)
> - locked <0x00000000f1610f28> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
> at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
> at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
> at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5584)
> at org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:165)
> at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:272)
> at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:243)
> at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1190)
> at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1091)
> at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
> at org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)
> at org.apache.geode.distributed.internal.DistributionManager.scheduleIncomingMessage(DistributionManager.java:3552)
> at org.apache.geode.distributed.internal.DistributionManager.handleIncomingDMsg(DistributionManager.java:3186)
> at org.apache.geode.distributed.internal.DistributionManager$MyListener.messageReceived(DistributionManager.java:4361)
> at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.dispatchMessage(GMSMembershipManager.java:1127)
> at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.handleOrDeferMessage(GMSMembershipManager.java:1045)
> at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager$MyDCReceiver.messageReceived(GMSMembershipManager.java:408)
> at org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:714)
> at org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:874)
> at org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3966)
> at org.apache.geode.internal.tcp.Connection.processNIOBuffer(Connection.java:3552)
> at org.apache.geode.internal.tcp.Connection.runNioReader(Connection.java:1828)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1689)
> at java.lang.Thread.run(Thread.java:748)
> Note that the replying member is also the member that initiated the backup:
> "vm_4_thr_43_client5_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14524" #480 daemon prio=5 os_prio=0 tid=0x00007ff99c0b2800 nid=0x6911 waiting on condition [0x00007ffa208cf000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000000f214ef08> (a java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:77)
> at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:694)
> at org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:644)
> at org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:624)
> at org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:519)
> at org.apache.geode.admin.internal.FinishBackupRequest.send(FinishBackupRequest.java:80)
> at org.apache.geode.admin.internal.BackupDataStoreHelper.backupAllMembers(BackupDataStoreHelper.java:47)
> at org.apache.geode.internal.cache.BackupUtil.backupAllMembers(BackupUtil.java:50)
> at org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2315)
> at org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2310)
> at org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2299)
> at util.PersistenceUtil.doOnlineBackup(PersistenceUtil.java:627)
> at parReg.ParRegTest.doConcOpsAndVerify(ParRegTest.java:1822)
> at parReg.ParRegTest.HydraTask_doConcOpsAndVerify(ParRegTest.java:1011)
> at sun.reflect.GeneratedMethodAccessor401.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at hydra.MethExecutor.execute(MethExecutor.java:182)
> at hydra.MethExecutor.execute(MethExecutor.java:150)
> at hydra.TestTask.execute(TestTask.java:191)
> at hydra.RemoteTestModule$1.run(RemoteTestModule.java:212)
> The replying member is still holding the backup lock because it waits for all other members to reply to the FinishBackupRequest message
> before it processes its backup and releases the lock:
> From FinishBackupRequest.java
> public static Map<DistributedMember, Set<PersistentID>> send(DM dm, Set recipients,
> File targetDir, File baselineDir, boolean abort) {
> FinishBackupRequest request = new FinishBackupRequest(targetDir, baselineDir, abort);
> request.setRecipients(recipients);
> FinishBackupReplyProcessor replyProcessor = new FinishBackupReplyProcessor(dm, recipients);
> request.msgId = replyProcessor.getProcessorId();
> dm.putOutgoing(request);
> try {
> replyProcessor.waitForReplies(); <======= wait for all other members (but they won't finish until this member does its work; this is where we are stuck)
> } catch (ReplyException e) {
> if (!(e.getCause() instanceof CancelException)) {
> throw e;
> }
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> AdminResponse response = request.createResponse((DistributionManager) dm); <====== this and the following lines do this member's work
> response.setSender(dm.getDistributionManagerId());
> replyProcessor.process(response);
> return replyProcessor.results;
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)