You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/10 16:05:00 UTC

[jira] [Commented] (GEODE-8238) message loss during shutdown in Shutdown Hook when JVM exits

    [ https://issues.apache.org/jira/browse/GEODE-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130831#comment-17130831 ] 

ASF GitHub Bot commented on GEODE-8238:
---------------------------------------

bschuchardt opened a new pull request #5232:
URL: https://github.com/apache/geode/pull/5232


   Remove invocation of removeEndpoint when a shared/unordered connection
   shuts down.  Endpoint cleanup is already initiated by DistributionImpl
   during membership view installation, so it isn't needed here.
   
   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> message loss during shutdown in Shutdown Hook when JVM exits
> ------------------------------------------------------------
>
>                 Key: GEODE-8238
>                 URL: https://issues.apache.org/jira/browse/GEODE-8238
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, messaging
>            Reporter: Bruce J Schuchardt
>            Priority: Major
>
> In a test I was running a JVM was told to exit and Geode's Shutdown Hook initiated cache shutdown.  This thread hung once in a while either waiting for a reply to a release of a distributed lock or for a reply to a region-destroy message.
> I traced this down by adding some logging to TCPConduit and DirectChannel and it's due to the changes for GEODE-7727 ("modify sender thread to detect relese of connection").  Those changes cause the P2P Handshake thread to stay active reading from a shared Connection socket.  Unfortunately, when this thread eventually exits it is invoking removeEndpoint, which closes all the other connections to the other node.  This is causing messages to be lost.
> Here's an example:
> one node (19919) fails to form a connection and invokes removeEndpoint for the other node (23898)
> {noformat}
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, uid=4 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005 isReceiver=true
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> 	at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> The other node's shared/unordered connection was unexpectedly terminated, causing it to also invoke removeEndpoint():
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, uid=4 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 isReceiver=false
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 reason=SocketChannel.read returned EOF notifyDisconnect=true
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> An unordered message like a Reply can be lost in this case if it is written to the Connection's socket but one of these background threads then closes the socket.
> Here the logs show a reply being sent on a Connection with a UID  (unique ID) of 69.  The message is never received because, as you can see, the other node closes the socket for Connection with UID 69.
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:09.906 PDT <Pooled Waiting Message Processor 3> tid=0x60] BRUCE: sending DLockReleaseReplyMessage (serviceName=__PRLS; replyCode=OK; sender=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005; processorId=212756) to [rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002(uid=69)]
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:09.907 PDT <P2P handshake reader@2c6c307c-65> tid=0x1b0] BRUCE: asyncClose closing Connection, uid=69 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005 isReceiver=true
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> 	at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> I don't think Connection termination should be invoking removeEndpoint at all.  Endpoints should only be removed in response to membership changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)