You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/08/18 07:43:00 UTC

[jira] [Commented] (GEODE-10403) Distributed deadlock when stopping gateway sender

    [ https://issues.apache.org/jira/browse/GEODE-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581208#comment-17581208 ] 

ASF subversion and git services commented on GEODE-10403:
---------------------------------------------------------

Commit 6c257d7c284ffc9328905ae9ca8fd786b9428ede in geode's branch refs/heads/develop from Alberto Gomez
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=6c257d7c28 ]

GEODE-10403: Fix distributed deadlock with stop gw sender (#7830)

There is a distributed deadlock that can appear
when stopping the gateway sender if a race condition
happens in which the stop gateway sender command gets blocked
indefinitely trying to get the size of the queue from remote peers
(ParallelGatewaySenderQueue.size() call) and
also one call to store one event in the queue tries to get
the lifecycle lock (acquired by the gateway sender command).

These two calls could get into a deadlock under heavy load and
make the system unresponsive for any traffic request (get, put, ...).

In order to avoid it, in the storage of the event in the gateway
sender queue (AbstractGatewaySender.distribute() call),
instead to trying to get the lifecycle lock without
any timeout, a try with a timeout is added. If the
try returns false it is checked if the gateway sender is running. If
it is not running, the event is dropped and there is no need to get the lock.
Otherwise, the lifecycle lock acquire is retried until it succeeds or
the gateway sender is stopped.

> Distributed deadlock when stopping gateway sender
> -------------------------------------------------
>
>                 Key: GEODE-10403
>                 URL: https://issues.apache.org/jira/browse/GEODE-10403
>             Project: Geode
>          Issue Type: Bug
>          Components: wan
>    Affects Versions: 1.12.9, 1.13.8, 1.14.4, 1.15.0
>            Reporter: Alberto Gomez
>            Assignee: Alberto Gomez
>            Priority: Major
>              Labels: needsTriage, pull-request-available
>
> A distributed deadlock has been found during some tests of a Geode system with WAN replication when stopping the gateway sender while sending a fair amount of operations to the servers.
> The distributed deadlock manifests in the gateway sender stop command hanging forever and by all normal Geode operations from clients (gets, puts,...) not being responded.
> The situation is provoked by the Gateway sender stop command that first takes the lifecycle lock and then, at a given point, tries to retrieve the size of the gateway sender. This operation, that requires communication with the other peers never finishes, probably because the response from one of the peers is never received.
> Another thread is blocked when trying to acquire the lifecycle lock in AbstractGatewaySender.distribute().
> Finally many threads handling Geode operations (get, put...) get blocked in the DistributedCacheOperation._distribute() call waiting for a response from another peer.
> Thread dump section from blocked gateway sender stop command in call to get size of queue:
> {{"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread1" #1316 daemon prio=10 os_prio=0 cpu=45.55ms elapsed=4152.76s tid=0x00007f92bc1c2000 nid=0x2154 waiting on condition  [0x00007f9179cd2000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x000000031ca2be50> (a java.util.concurrent.CountDownLatch$Sync)}}
> {{        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6758)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6709)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6691)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6663)}}
> {{        at org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8301)}}
> {{        at org.apache.geode.internal.cache.wan.parallel.ParallelGatewaySenderQueue.size(ParallelGatewaySenderQueue.java:1670)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.closeProcessor(AbstractGatewaySenderEventProcessor.java:1259)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor.stopProcessing(AbstractGatewaySenderEventProcessor.java:1247)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1399)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySenderEventProcessor$SenderStopperCallable.call(AbstractGatewaySenderEventProcessor.java:1387)}}
> {{        at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)}}
> {{        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}}
> {{        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}
>  
> Thread dump section from blocked call to AbstractGatewaySender.distribute() call trying to acquire the lifecycle lock:
> {{"P2P message reader for 192.168.78.164(eric-data-kvdb-ag-server-0:1)<v31>:41000 shared ordered uid=6 local port=60360 remote port=57246" #56 daemon prio=10 os_prio=0 cpu=462104.83ms elapsed=7095.02s tid=0x00007f93a8007800 nid=0x50 waiting on condition  [0x00007f93e59d0000]}}
> {{   java.lang.Thread.State: WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x00000000ed9cb9f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)}}
> {{        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.11/AbstractQueuedSynchronizer.java:885)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1009)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11.0.11/AbstractQueuedSynchronizer.java:1324)}}
> {{        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11.0.11/ReentrantReadWriteLock.java:738)}}
> {{        at org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1104)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6144)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6108)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.notifyGatewaySender(BucketRegion.java:719)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5775)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:704)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$515/0x00000008006e0440.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}}
> {{        - locked <0x0000000136123330> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}}
> {{        - locked <0x0000000136123330> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006e0040.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006ca440.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}}
> {{        at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}}
> {{        at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}}
> {{        at org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:194)}}
> {{        at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)}}
> {{        at org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:278)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1208)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1110)}}
> {{        at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376)}}
> {{        at org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)}}
> {{        at org.apache.geode.distributed.internal.ClusterDistributionManager.scheduleIncomingMessage(ClusterDistributionManager.java:2060)}}
> {{        at org.apache.geode.distributed.internal.ClusterDistributionManager.handleIncomingDMsg(ClusterDistributionManager.java:1826)}}
> {{        at org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$178/0x0000000800380440.messageReceived(Unknown Source)}}
> {{        at org.apache.geode.distributed.internal.membership.gms.GMSMembership.dispatchMessage(GMSMembership.java:936)}}
> {{        at org.apache.geode.distributed.internal.membership.gms.GMSMembership.handleOrDeferMessage(GMSMembership.java:867)}}
> {{        at org.apache.geode.distributed.internal.membership.gms.GMSMembership.processMessage(GMSMembership.java:1209)}}
> {{        at org.apache.geode.distributed.internal.DistributionImpl$MyDCReceiver.messageReceived(DistributionImpl.java:828)}}
> {{        at org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:614)}}
> {{        at org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:679)}}
> {{        at org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3261)}}
> {{        at org.apache.geode.internal.tcp.Connection.readMessage(Connection.java:2988)}}
> {{        at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2794)}}
> {{        at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1648)}}
> {{        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1479)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}
>  
> Thread dump section from blocked calls to DistributedCacheOperation._distribute() waiting a for a response from a remote peer:
> {{"ServerConnection on port 40404 Thread 2" #88 daemon prio=5 os_prio=0 cpu=81268.62ms elapsed=7050.08s tid=0x00007f8160001800 nid=0x73 waiting on condition  [0x00007f8196f57000]}}
> {{   java.lang.Thread.State: TIMED_WAITING (parking)}}
> {{        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)}}
> {{        - parking to wait for  <0x000000031befad38> (a java.util.concurrent.CountDownLatch$Sync)}}
> {{        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1079)}}
> {{        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.11/AbstractQueuedSynchronizer.java:1369)}}
> {{        at java.util.concurrent.CountDownLatch.await(java.base@11.0.11/CountDownLatch.java:278)}}
> {{        at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)}}
> {{        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)}}
> {{        at org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:694)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$514/0x00000008006c9c40.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)}}
> {{        - locked <0x00000001eb353770> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)}}
> {{        - locked <0x00000001eb353770> (a org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapObjectKey)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$513/0x00000008006c9840.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$512/0x00000008006ca440.run(Unknown Source)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)}}
> {{        at org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)}}
> {{        at org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)}}
> {{        at org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2033)}}
> {{        at org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:530)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5578)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1213)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.putInBucket(PartitionedRegion.java:3005)}}
> {{        at org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2215)}}
> {{        at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5571)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5531)}}
> {{        at org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5210)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.command.Put65.cmdExecute(Put65.java:411)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:183)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:848)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:72)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1181)}}
> {{        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)}}
> {{        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:691)}}
> {{        at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl$$Lambda$495/0x00000008006be440.invoke(Unknown Source)}}
> {{        at org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:120)}}
> {{        at org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$166/0x000000080034c040.run(Unknown Source)}}
> {{        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)