You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/08/31 21:07:00 UTC
[jira] [Commented] (GEODE-8473) Hang in ReplyProcessor21 when forced-disconnect does not establish a cancellation cause

    [ https://issues.apache.org/jira/browse/GEODE-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188002#comment-17188002 ] 

ASF GitHub Bot commented on GEODE-8473:
---------------------------------------

bschuchardt opened a new pull request #5491:
URL: https://github.com/apache/geode/pull/5491


   ReplyProcessor21 will not stop waiting for responses to a message during a Forced Disconnect unless ClusterDistributionManager is informed of the disconnect.  It sets a rootCause in its CancelCriterion that is polled by ReplyProcessor21's StoppableCountDownLatch.
   This commit ensures that ClusterDistributionManager is notified of the disconnect so that it can perform this action.
   
   This is a follow-up PR to GEODE-8467, which ensures that a DisconnectThread is launched to execute the GMSMembership.uncleanShutdown() method.
   
   @kamilla1201 
   
   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Hang in ReplyProcessor21 when forced-disconnect does not establish a cancellation cause
> ---------------------------------------------------------------------------------------
>
>                 Key: GEODE-8473
>                 URL: https://issues.apache.org/jira/browse/GEODE-8473
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.13.0
>            Reporter: Bruce J Schuchardt
>            Priority: Major
>
> I suspect this is due to the recent Membership refactoring.  In a test that exposed GEODE-8467 I saw an application thread from before the forced-disconnect still hanging around waiting for a response.
> {noformat}
>    java.lang.Thread.State: TIMED_WAITING (parking)   java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for  <0x00000000ea5c43c0> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865) at org.apache.geode.internal.cache.partitioned.SizeMessage$SizeResponse.waitBucketSizes(SizeMessage.java:344) at org.apache.geode.internal.cache.PartitionedRegion.getSizeRemotely(PartitionedRegion.java:6752) at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6703) at org.apache.geode.internal.cache.PartitionedRegion.entryCount(PartitionedRegion.java:6685) at org.apache.geode.internal.cache.PartitionedRegion.getRegionSize(PartitionedRegion.java:6657) at org.apache.geode.internal.cache.LocalRegionDataView.entryCount(LocalRegionDataView.java:99) at org.apache.geode.internal.cache.LocalRegion.entryCount(LocalRegion.java:2078) at org.apache.geode.internal.cache.LocalRegion.size(LocalRegion.java:8288) at util.TestHelper.getRegionStr(TestHelper.java:1669) at util.TestHelper.regionHierarchyToString(TestHelper.java:1654) at util.TestHelper.logRegionHierarchy(TestHelper.java:1639) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at hydra.MethExecutor.execute(MethExecutor.java:173) at hydra.MethExecutor.execute(MethExecutor.java:141) at hydra.TestTask.execute(TestTask.java:197) at hydra.RemoteTestModule$1.run(RemoteTestModule.java:213) {noformat}
> ReplyProcessor21 uses a StoppableCountdownLatch to wait for a response.  This latch loops waiting for countdown but also checks ClusterDistributionManager's CancelCriterion to see if the system is shutting down.  If so it stops waiting for a response.
> Due to GEODE-8467 the thread that sets the CancelCriterion's shutdown "rootCause" is never started.  Either Membership needs to ensure that this upward notification happens or ClusterDistributionManager's CancelCriterion needs to check with the Services.Stopper in GMSMembership to see if a "rootCause" has been established there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)