You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Nabarun Nag (Jira)" <ji...@apache.org> on 2022/05/28 20:00:00 UTC
[jira] [Resolved] (GEODE-10330) Resource issues lead to "MemberDisconnectedException: Member isn't responding to heartbeat requests"

     [ https://issues.apache.org/jira/browse/GEODE-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nabarun Nag resolved GEODE-10330.
---------------------------------
    Resolution: Won't Fix

The old tests have been refactored with the new tests using the new test framework

> Resource issues lead to "MemberDisconnectedException: Member isn't responding to heartbeat requests"
> ----------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-10330
>                 URL: https://issues.apache.org/jira/browse/GEODE-10330
>             Project: Geode
>          Issue Type: Bug
>    Affects Versions: 1.16.0
>            Reporter: Donal Evans
>            Assignee: Nabarun Nag
>            Priority: Major
>              Labels: needsTriage
>
> A failure was observed in 
> DistributedMulticastRegionWithUDPSecurityDUnitTest > testMulticastAfterReconnect due to suspect strings with fatal-level logging of "Membership service failure: Member isn't responding to heartbeat requests".
> Investigating the logs showed all members reporting long statistics sampling wakeup delays, indicating resource issues:
> {code:java}
> [vm3] [warn 2022/05/21 07:28:16.251 UTC LocatorWithMcast <StatSampler> tid=0xb8] Statistics sampling thread detected a wakeup delay of 4760 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
> ...
> [locator] [warn 2022/05/21 07:28:20.288 UTC  <StatSampler> tid=0x3b] Statistics sampling thread detected a wakeup delay of 12400 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
> ...
> [vm1] [warn 2022/05/21 07:28:20.969 UTC vm1 <StatSampler> tid=0xda] Statistics sampling thread detected a wakeup delay of 13738 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
> ...
> [vm0] [warn 2022/05/21 07:28:22.226 UTC vm0 <StatSampler> tid=0xa9] Statistics sampling thread detected a wakeup delay of 15110 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics. {code}
>  
> After downloading the test artifacts and using the progress tool from the dev-tools directory in the Geode repository, the following tests were found to be running during the resource issues, possibly indicating that one or more of them are particularly resource-intensive:
> {noformat}
> $> progress -r '2022-05-21 07:28:16.251 -0000' | grep org | sort{noformat}
> {code:java}
> org.apache.geode.cache.PRCacheListenerWithInterestPolicyAllDistributedTest.afterUpdateIsInvokedInEveryMember[0: redundancy=0] org.apache.geode.cache.lucene.LuceneQueriesReindexDUnitTest.recreateIndexWithDifferentFieldsShouldFail(PARTITION_OVERFLOW_TO_DISK) [2] org.apache.geode.cache.query.cq.dunit.CqDataUsingPoolOptimizedExecuteDUnitTest.testCQHAWithState org.apache.geode.cache.query.cq.dunit.PartitionedRegionCqQueryDUnitTest.testPartitionedCqOnAccessorBridgeServer org.apache.geode.cache30.CallbackArgDUnitTest.testForCA org.apache.geode.cache30.DistributedMulticastRegionWithUDPSecurityDUnitTest.testMulticastAfterReconnect org.apache.geode.cache30.DistributedNoAckRegionCCEOffHeapDUnitTest.testDistributedInvalidate org.apache.geode.cache30.GlobalRegionOffHeapDUnitTest.testOrderedUpdates org.apache.geode.cache30.ReconnectWithClusterConfigurationDUnitTest.testReconnectAfterMeltdown org.apache.geode.distributed.internal.P2PMessagingConcurrencyDUnitTest.testP2PMessaging(true, false, 32768, 65536) [6] org.apache.geode.disttx.PRDistTXDUnitTest.testSimulaneousChildRegionCreation org.apache.geode.internal.cache.ClientServerTransactionCCEDUnitTest.testClientCommitFunctionWithFailure org.apache.geode.internal.cache.eviction.OffHeapEvictionStatsDUnitTest.testHeapLruCounter org.apache.geode.internal.cache.wan.concurrent.ConcurrentParallelGatewaySenderOperation_1_DUnitTest.testParallelPropagationSenderStartAfterStopOnAccessorNode org.apache.geode.internal.cache.wan.offheap.ParallelGatewaySenderOperationsOffHeapDistributedTest.testParallelGatewaySenderStartOnAccessorNode org.apache.geode.internal.cache.wan.serial.SerialWANPropagation_PartitionedRegionDUnitTest.testPartitionedSerialPropagationHA org.apache.geode.internal.tcp.TCPConduitDUnitTest.basicAcceptConnection[0] org.apache.geode.management.internal.configuration.ClusterConfigImportDUnitTest.importFailWithExistingRegion org.apache.geode.rest.internal.web.controllers.RestAPIsOnGroupsFunctionExecutionDUnitTest.testBasicP2PFunctionSelectedGroup[1] org.apache.geode.session.tests.Jetty9CachingClientServerTest.failureShouldStillAllowOtherContainersDataAccess org.apache.geode.session.tests.Tomcat8ClientServerCustomCacheXmlTest.containersShouldExpireInSetTimeframe org.apache.geode.session.tests.Tomcat8Test.containersShouldReplicateCookies org.apache.geode.session.tests.Tomcat9ClientServerTest.invalidationShouldRemoveValueAccessForAllContainers
> {code}
> Future failures due to this sort of resource issue should also list concurrently running tests so that repeat appearances by individual tests can be used to identify the culprits.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)