You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/08/31 20:19:00 UTC

[jira] [Commented] (GEODE-8467) server fails to notify of a ForcedDisconnect and fails to tear down the cache

    [ https://issues.apache.org/jira/browse/GEODE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187991#comment-17187991 ] 

ASF GitHub Bot commented on GEODE-8467:
---------------------------------------

bschuchardt opened a new pull request #5490:
URL: https://github.com/apache/geode/pull/5490


   Catch exceptions that occur during XML generation and disable auto
   reconnect.
   
   Ensure that the DisconnectThread is launched by placing it in a
   "finally" block.
   
   @kamilla1201 @Bill
   
   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> server fails to notify of a ForcedDisconnect and fails to tear down the cache
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-8467
>                 URL: https://issues.apache.org/jira/browse/GEODE-8467
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0
>            Reporter: Bruce J Schuchardt
>            Assignee: Bruce J Schuchardt
>            Priority: Major
>
> A test having auto-reconnect enabled failed while restarting a server and hung.  The restarting server was building its cache when it was kicked out of the cluster due to very high load on the test machine.  Membership initiated a forced-disconnect
> {noformat}
> [fatal 2020/08/22 00:51:04.508 PDT <unicast receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] Membership service failure: Member isn't responding to heartbeat requests
> org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException: Member isn't responding to heartbeat requests
>         at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2012)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1085)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:688)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1331)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1267)
>  {noformat}
>  
> and then logged that it was generating a description of the cache
> {noformat}
> [info 2020/08/22 00:51:05.933 PDT <unicast receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] generating XML to rebuild the cache after reconnect completes {noformat}
>  
> but it never logged completion of this step and never forked a thread to tear down the cache.  Any exception thrown by XML generation would have been caught by JGroups code, which logs the problem at a WARNING level.  We have JGroups logging set to FATAL level so you wouldn't see the issue.
> We need to add exception handling around XML generation and, if detected, disable reconnect attempts and have the server shut down.
> The bug isn't easy to hit.  I've run the test that failed over 5000 times without encountering it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)