You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Ernest Burghardt (Jira)" <ji...@apache.org> on 2021/09/09 18:49:00 UTC

[jira] [Assigned] (GEODE-9402) Automatic Reconnect Failure: Address already in use

     [ https://issues.apache.org/jira/browse/GEODE-9402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ernest Burghardt reassigned GEODE-9402:
---------------------------------------

    Assignee:     (was: Nabarun Nag)

> Automatic Reconnect Failure: Address already in use
> ---------------------------------------------------
>
>                 Key: GEODE-9402
>                 URL: https://issues.apache.org/jira/browse/GEODE-9402
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Juan Ramos
>            Priority: Major
>         Attachments: cluster_logs_gke_latest_54.zip, cluster_logs_pks_121.zip
>
>
> There are 2 locators and 4 servers during the test, once they're all up and running the test drops the network connectivity between all members to generate a full network partition and cause all members to shutdown and go into reconnect mode. Upon reaching the mentioned state, the test automatically restores the network connectivity and expects all members to automatically go up again and re-form the distributed system.
>  This works fine most of the time, and we see every member successfully reconnecting to the distributed system:
> {noformat}
> [info 2021/06/23 15:58:12.981 GMT gemfire-cluster-locator-0 <ReconnectThread> tid=0x87] Reconnect completed.
> [info 2021/06/23 15:58:14.726 GMT gemfire-cluster-locator-1 <ReconnectThread> tid=0x86] Reconnect completed.
> [info 2021/06/23 15:58:46.702 GMT gemfire-cluster-server-0 <ReconnectThread> tid=0x94] Reconnect completed.
> [info 2021/06/23 15:58:46.485 GMT gemfire-cluster-server-1 <ReconnectThread> tid=0x96] Reconnect completed.
> [info 2021/06/23 15:58:46.273 GMT gemfire-cluster-server-2 <ReconnectThread> tid=0x97] Reconnect completed.
> [info 2021/06/23 15:58:46.902 GMT gemfire-cluster-server-3 <ReconnectThread> tid=0x95] Reconnect completed.
> {noformat}
> In some rare occasions, though, one of the servers fails during the reconnect phase with the following exception:
> {noformat}
> [error 2021/06/09 18:48:52.872 GMT gemfire-cluster-server-1 <ReconnectThread> tid=0x91] Cache initialization for GemFireCache[id = 575310555; isClosing = false; isShutDownAll = false; created = Wed Jun 09 18:46:49 GMT 2021; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because:
> org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=.
> 	at org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:800)
> 	at org.apache.geode.internal.cache.xmlcache.CacheCreation.create(CacheCreation.java:599)
> 	at org.apache.geode.internal.cache.xmlcache.CacheXmlParser.create(CacheXmlParser.java:339)
> 	at org.apache.geode.internal.cache.GemFireCacheImpl.loadCacheXml(GemFireCacheImpl.java:4207)
> 	at org.apache.geode.internal.cache.ClusterConfigurationLoader.applyClusterXmlConfiguration(ClusterConfigurationLoader.java:197)
> 	at org.apache.geode.internal.cache.GemFireCacheImpl.applyJarAndXmlFromClusterConfig(GemFireCacheImpl.java:1497)
> 	at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1449)
> 	at org.apache.geode.internal.cache.InternalCacheBuilder.create(InternalCacheBuilder.java:191)
> 	at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2668)
> 	at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2426)
> 	at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1277)
> 	at org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
> 	at org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1183)
> 	at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1807)
> 	at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.net.BindException: Address already in use (Bind failed)
> 	at java.base/java.net.PlainSocketImpl.socketBind(Native Method)
> 	at java.base/java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:436)
> 	at java.base/java.net.ServerSocket.bind(ServerSocket.java:395)
> 	at org.apache.geode.internal.net.SCClusterSocketCreator.createServerSocket(SCClusterSocketCreator.java:70)
> 	at org.apache.geode.internal.net.SocketCreator.createServerSocket(SocketCreator.java:529)
> 	at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.<init>(AcceptorImpl.java:573)
> 	at org.apache.geode.internal.cache.tier.sockets.AcceptorBuilder.create(AcceptorBuilder.java:291)
> 	at org.apache.geode.internal.cache.CacheServerImpl.createAcceptor(CacheServerImpl.java:420)
> 	at org.apache.geode.internal.cache.CacheServerImpl.start(CacheServerImpl.java:377)
> 	at org.apache.geode.internal.cache.xmlcache.CacheCreation.startCacheServers(CacheCreation.java:796)
> 	... 14 more
> {noformat}
> It seems that the server is trying to bind the port before the old instance has finished shutting down and cleaning up resources, causing the reconnect process to halt and stop re-trying, and leaving the cluster with one less member.
> We've been able to reproduce the problem only twice in the past few weeks, I've attached the two set of artefacts to the ticket:
>  - _*cluster_logs_pks_121*_: the member that throws the {{BindException}} during reconnect is {{gemfire-cluster-server-1}}.
>  - _*cluster_logs_gke_latest_54*_: the member that throws the {{BindException}} during reconnect is {{gemfire-cluster-server-0}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)