You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Shelley Lynn Hughes-Godfrey (JIRA)" <ji...@apache.org> on 2018/03/19 18:31:00 UTC

[jira] [Created] (GEODE-4892) CI failure: geode-wan distributedTest hung in org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember Ended

Shelley Lynn Hughes-Godfrey created GEODE-4892:
--------------------------------------------------

             Summary: CI failure: geode-wan distributedTest hung in org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember Ended
                 Key: GEODE-4892
                 URL: https://issues.apache.org/jira/browse/GEODE-4892
             Project: Geode
          Issue Type: Bug
          Components: distributed lock service, gfsh
            Reporter: Shelley Lynn Hughes-Godfrey


geode-wan distributed tests hung after 8 hours (usually takes 5-6 hours)

https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/207

The progress files show this DistributedTest run starting ~4 am ... but it also shows this test completing at 9:05 ... which is only 5 hours.  Are we somehow stuck in cache close for 3 hours before the hang is declared while trying to shut down the DUnit VM.

dunit-hangs.txt
```
Started @ 2018-03-17 03:57:50.567 +0000
Ended @ 2018-03-17 03:59:28.268 +0000

Started @ 2018-03-17 04:00:05.872 +0000
Ended @ 2018-03-17 04:52:41.879 +0000

Started @ 2018-03-17 04:53:00.358 +0000
Ended @ 2018-03-17 04:55:07.622 +0000

Started @ 2018-03-17 04:55:33.550 +0000
Ended @ 2018-03-17 07:34:11.639 +0000

Started @ 2018-03-17 07:34:51.022 +0000
Ended @ 2018-03-17 08:08:34.849 +0000

Started @ 2018-03-17 08:09:33.853 +0000
Ended @ 2018-03-17 08:31:43.646 +0000

Started @ 2018-03-17 08:31:54.175 +0000
Ended @ 2018-03-17 08:32:02.523 +0000

Started @ 2018-03-17 08:32:13.946 +0000
Ended @ 2018-03-17 08:32:34.626 +0000

Started @ 2018-03-17 08:33:20.100 +0000
2018-03-17 08:56:15.914 +0000  org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember
Ended @ 2018-03-17 09:05:29.197 +0000
```

Last distributedTest running:
```
Started @ 2018-03-17 08:33:20.100 +0000
2018-03-17 08:56:15.914 +0000  org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember
Ended @ 2018-03-17 09:05:29.197 +0000
```

Stack dumps are taken at 5-10 second intervals and they all look the same ... so it looks like  433 is stuck in waiting during cache close => distributedLockServiceDestroys waiting for replies from 493 for ElderInit.  493 is in ClusterDistributionManager.elderChangeWait().

********* Dumping stack for process 433:
2018-03-17 10:55:59
```
"RMI TCP Connection(1)-172.17.0.3" #30 daemon prio=5 os_prio=0 tid=0x00007f9760001800 nid=0x1ec waiting on condition [0x00007f975bffb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000e08dd870> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
        at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
        at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715)
        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790)
        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766)
        at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853)
        at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72)
        at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56)
        at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderStateWithTryLock(ClusterDistributionManager.java:3359)
        at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:3309)
        at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:238)
        at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:347)
        at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:327)
        at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:318)
        at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:872)
        at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1227)
        - locked <0x00000000e09a8768> (a org.apache.geode.distributed.internal.locks.DLockGrantor)
        at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:646)
        at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2358)
        at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2276)
        - locked <0x00000000e040b400> (a java.lang.Object)
        at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2214)
        at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84)
        at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderImpl.stop(SerialGatewaySenderImpl.java:136)
        at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2169)
        - locked <0x00000000e03ddf60> (a java.lang.Class for org.apache.geode.internal.cache.GemFireCacheImpl)
        at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1961)
        at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1951)
        at org.apache.geode.test.junit.rules.ServerStarterRule.stopMember(ServerStarterRule.java:102)
        at org.apache.geode.test.junit.rules.MemberStarterRule.after(MemberStarterRule.java:83)
        at org.apache.geode.test.dunit.rules.ClusterStartupRule.stopElementInsideVM(ClusterStartupRule.java:326)
        at org.apache.geode.test.junit.rules.VMProvider.lambda$stopVM$fe0d42dc$1(VMProvider.java:35)
        at org.apache.geode.test.junit.rules.VMProvider$$Lambda$79/501183203.run(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
```

********* Dumping stack for process 493:
2018-03-17 10:56:02

```
"Pooled Message Processor 1" #179 daemon prio=10 os_prio=0 tid=0x00007f3640002800 nid=0x763 in Object.wait() [0x00007f35edd65000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.geode.distributed.internal.ClusterDistributionManager.elderChangeWait(ClusterDistributionManager.java:3487)
        at org.apache.geode.distributed.internal.ClusterDistributionManager.waitForElder(ClusterDistributionManager.java:3431)
        - locked <0x00000000e069bde0> (a java.lang.Object)
        at org.apache.geode.distributed.internal.locks.ElderInitProcessor$ElderInitMessage.process(ElderInitProcessor.java:178)
        at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
        at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1118)
        at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
        at org.apache.geode.distributed.internal.ClusterDistributionManager$4$1.run(ClusterDistributionManager.java:789)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - <0x00000000e07073a8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
```




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)