You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Dick Cavender (Jira)" <ji...@apache.org> on 2019/09/26 18:05:09 UTC

[jira] [Closed] (GEODE-6904) Reconnecting locator has many hung threads, causing members to startup without cluster configuration

     [ https://issues.apache.org/jira/browse/GEODE-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dick Cavender closed GEODE-6904.
--------------------------------

> Reconnecting locator has many hung threads, causing members to startup without cluster configuration
> ----------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-6904
>                 URL: https://issues.apache.org/jira/browse/GEODE-6904
>             Project: Geode
>          Issue Type: Bug
>          Components: configuration, membership
>            Reporter: Dan Smith
>            Priority: Major
>             Fix For: 1.10.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With the following steps, a locator can get into state where it is stuck in the middle of reconnecting. It allows members to join the system, but they timeout sending it startup messages and start up without cluster configuration, resulting not being able to restart the cluster.
> # Start 2 locators and some number of servers
> # Kill one locator and trigger a force disconnect in the remaining locators and servers at the same time
> # Have one of the members take a little bit of time before reconnecting, to let the locator get to recovering the _ConfurationRegion before that remaining member joins.
>  When this happens, the remaining locator gets hung trying to reconnect the system, waiting in initialization of _ConfigurationRegion for persistent data from the missing locator.
> {noformat}
> "Location services restart thread" #98 daemon prio=5 os_prio=31 tid=0x00007fa382944800 nid=0x9a07 in Object.wait() [0x0000700008943000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	at org.apache.geode.internal.cache.persistence.MembershipChangeListener.waitForChange(MembershipChangeListener.java:62)
> 	- locked <0x00000007be285800> (a org.apache.geode.internal.cache.persistence.MembershipChangeListener)
> 	at org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.waitForMembershipChangeForMissingDiskStores(PersistenceInitialImageAdvisor.java:218)
> 	at org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.getAdvice(PersistenceInitialImageAdvisor.java:118)
> 	at org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:835)
> 	at org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
> 	at org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1189)
> 	at org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1074)
> 	at org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3002)
> 	at org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.getConfigurationRegion(InternalConfigurationPersistenceService.java:840)
> 	at org.apache.geode.distributed.internal.InternalConfigurationPersistenceService.initSharedConfiguration(InternalConfigurationPersistenceService.java:487)
> 	at org.apache.geode.distributed.internal.InternalLocator.startConfigurationPersistenceService(InternalLocator.java:1465)
> 	at org.apache.geode.distributed.internal.InternalLocator.startClusterManagementService(InternalLocator.java:687)
> 	at org.apache.geode.distributed.internal.InternalLocator.restartWithDS(InternalLocator.java:1126)
> 	- locked <0x00000007a0c313b8> (a java.lang.Object)
> 	at org.apache.geode.distributed.internal.InternalLocator.attemptReconnect(InternalLocator.java:1065)
> 	at org.apache.geode.distributed.internal.InternalLocator.lambda$launchRestartThread$1(InternalLocator.java:986)
> 	at org.apache.geode.distributed.internal.InternalLocator$$Lambda$195/681333823.run(Unknown Source)
> 	at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The above thread holds a static lock, which causes many of the messages that get sent to the locator to hang.
> One of these messages is a StartupMessage. If that message hangs, the member that sent the message will timeout after 15 seconds and then start up without cluster configuration.
> {noformat}
> [vm4] [warn 2019/06/24 16:26:58.742 PDT <RMI TCP Connection(8)-10.118.20.154> tid=0x15] Membership: startup timed out after waiting 15000 milliseconds for responses from [10.118.20.154(locator-1:66321:locator)<ec><v0>:41002]
> ---This message is logged because by not waiting for the reply from the startup message, we do not discover that the locator has cluster configuration.
> [vm4] [info 2019/06/24 16:26:58.827 PDT <RMI TCP Connection(8)-10.118.20.154> tid=0x15] No locator(s) found with cluster configuration service
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)