You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Alberto Gomez (Jira)" <ji...@apache.org> on 2022/09/06 13:16:00 UTC

[jira] [Updated] (GEODE-10403) Distributed deadlock when stopping gateway sender

     [ https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alberto Gomez updated GEODE-10403:
----------------------------------
    Fix Version/s: 1.16.0

> Distributed deadlock when stopping gateway sender
> -------------------------------------------------
>
>                 Key: GEODE-10403
>                 URL: https://issues.apache.org/jira/browse/GEODE-10403
>             Project: Geode
>          Issue Type: Bug
>          Components: wan
>    Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>            Reporter: Alberto Gomez
>            Assignee: Alberto Gomez
>            Priority: Major
>              Labels: needsTriage, pull-request-available
>             Fix For: 1.16.0
>
>
> A distributed deadlock has been found during some tests of a Geode system with WAN replication when stopping the gateway sender while sending a fair amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging forever and by all normal Geode operations from clients (gets, puts,...) not being responded.
> The situation is provoked by the Gateway sender stop command that first takes the lifecycle lock and then, at a given point, tries to retrieve the size of the gateway sender. This operation, that requires communication with the other peers never finishes, probably because the response from one of the peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in the DistributedCacheOperation._distribute() call waiting for a response from another peer.
> Thread dump section from blocked gateway sender stop command in call to get size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x00007f92bc1c2000 nid=0x2154 waiting on condition  [0x00007f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x000000031ca2be50> (a java.util.concurrent.CountDownLatch$Sync)}}
> {{        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)}}
> {{        at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)}}
> {{        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}}
> {{        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}
>  
> Thread dump section from blocked call to AbstractGatewaySender.distribute() call trying to acquire the lifecycle lock:
> {{"P2P message reader for 192.168.78.164(eric-data-kvdb-ag-server-0:1)<v31>:41000 shared ordered uid=6 local port=60360 remote port=57246" #56 daemon prio=10 os_prio=0 cpu=462104.83ms elapsed=7095.02s tid=0x00007f93a8007800 nid=0x50 waiting on condition  [0x00007f93e59d0000]}}
> {{   java.lang.Thread.State: WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00000000ed9cb9f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)}}
> {{        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.11/AbstractQueuedSynchronizer.java:885)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1009)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1324)}}
> {{        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11.0.11/ReentrantReadWriteLock.java:738)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1104)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6144)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6108)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.notifyGatewaySender(BucketRegion.java:719)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5775)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:704)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$515/0x00000008006e0440.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}}
> {{        - locked <0x0000000136123330> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}}
> {{        - locked <0x0000000136123330> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006e0040.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006ca440.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}}
> {{        at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}}
> {{        at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}}
> {{        at org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:194)}}
> {{        at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)}}
> {{        at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:278)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1208)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1110)}}
> {{        at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376)}}
> {{        at org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)}}
> {{        at org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2060)}}
> {{        at org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1826)}}
> {{        at org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$178/0x0000000800380440.messageReceived(Unknown Source)}}
> {{        at org.apache.geode.distributed.internal.membership.gms.GMSMembership.dispatchMessage(GMSMembership.java:936)}}
> {{        at org.apache.geode.distributed.internal.membership.gms.GMSMembership.handleOrDeferMessage(GMSMembership.java:867)}}
> {{        at org.apache.geode.distributed.internal.membership.gms.GMSMembership.processMessage(GMSMembership.java:1209)}}
> {{        at org.apache.geode.distributed.internal.DistributionImpl$MyDCReceiver.messageReceived(DistributionImpl.java:828)}}
> {{        at org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:614)}}
> {{        at org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:679)}}
> {{        at org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3261)}}
> {{        at org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:2988)}}
> {{        at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2794)}}
> {{        at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1648)}}
> {{        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1479)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}
>  
> Thread dump section from blocked calls to DistributedCacheOperation._distribute() waiting a for a response from a remote peer:
> {{"ServerConnection on port 40404 Thread 2" #88 daemon prio=5 os_prio=0 cpu=81268.62ms elapsed=7050.08s tid=0x00007f8160001800 nid=0x73 waiting on condition  [0x00007f8196f57000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x000000031befad38> (a java.util.concurrent.CountDownLatch$Sync)}}
> {{        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:694)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006c9c40.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}}
> {{        - locked <0x00000001eb353770> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}}
> {{        - locked <0x00000001eb353770> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006c9840.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$512/0x00000008006ca440.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}}
> {{        at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5578)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1213)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:3005)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2215)}}
> {{        at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5531)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5210)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1181)}}
> {{        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}}
> {{        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:691)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$$Lambda$495/0x00000008006be440.invoke(Unknown Source)}}
> {{        at org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:120)}}
> {{        at org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$166/0x000000080034c040.run(Unknown Source)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)