You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Dave Barnes (Jira)" <ji...@apache.org> on 2020/09/10 15:52:13 UTC

[jira] [Closed] (GEODE-8467) server fails to notify of a ForcedDisconnect and fails to tear down the cache

     [ https://issues.apache.org/jira/browse/GEODE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dave Barnes closed GEODE-8467.
------------------------------

> server fails to notify of a ForcedDisconnect and fails to tear down the cache
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-8467
>                 URL: https://issues.apache.org/jira/browse/GEODE-8467
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0
>            Reporter: Bruce J Schuchardt
>            Assignee: Bruce J Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0, 1.14.0
>
>
> A test having auto-reconnect enabled failed while restarting a server and hung.  The restarting server was building its cache when it was kicked out of the cluster due to very high load on the test machine.  Membership initiated a forced-disconnect
> {noformat}
> [fatal 2020/08/22 00:51:04.508 PDT <unicast receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] Membership service failure: Member isn't responding to heartbeat requests
> org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: Member isn't responding to heartbeat requests
>         at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2012)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1085)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:688)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1331)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1267)
>  {noformat}
>  
> and then logged that it was generating a description of the cache
> {noformat}
> [info 2020/08/22 00:51:05.933 PDT <unicast receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] generating XML to rebuild the cache after reconnect completes {noformat}
>  
> but it never logged completion of this step and never forked a thread to tear down the cache.  Any exception thrown by XML generation would have been caught by JGroups code, which logs the problem at a WARNING level.  We have JGroups logging set to FATAL level so you wouldn't see the issue.
> We need to add exception handling around XML generation and, if detected, disable reconnect attempts and have the server shut down.
> The bug isn't easy to hit.  I've run the test that failed over 5000 times without encountering it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)