You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Bruce J Schuchardt (Jira)" <ji...@apache.org> on 2020/06/10 18:55:00 UTC

[jira] [Resolved] (GEODE-8238) message loss during shutdown in Shutdown Hook when JVM exits

     [ https://issues.apache.org/jira/browse/GEODE-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce J Schuchardt resolved GEODE-8238.
---------------------------------------
    Fix Version/s: 1.14.0
       Resolution: Fixed

> message loss during shutdown in Shutdown Hook when JVM exits
> ------------------------------------------------------------
>
>                 Key: GEODE-8238
>                 URL: https://issues.apache.org/jira/browse/GEODE-8238
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, messaging
>            Reporter: Bruce J Schuchardt
>            Priority: Major
>             Fix For: 1.14.0
>
>
> In a test I was running a JVM was told to exit and Geode's Shutdown Hook initiated cache shutdown.  This thread hung once in a while either waiting for a reply to a release of a distributed lock or for a reply to a region-destroy message.
> I traced this down by adding some logging to TCPConduit and DirectChannel and it's due to the changes for GEODE-7727 ("modify sender thread to detect relese of connection").  Those changes cause the P2P Handshake thread to stay active reading from a shared Connection socket.  Unfortunately, when this thread eventually exits it is invoking removeEndpoint, which closes all the other connections to the other node.  This is causing messages to be lost.
> Here's an example:
> one node (19919) fails to form a connection and invokes removeEndpoint for the other node (23898)
> {noformat}
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, uid=4 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005 isReceiver=true
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> 	at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> The other node's shared/unordered connection was unexpectedly terminated, causing it to also invoke removeEndpoint():
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, uid=4 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 isReceiver=false
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 reason=SocketChannel.read returned EOF notifyDisconnect=true
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> An unordered message like a Reply can be lost in this case if it is written to the Connection's socket but one of these background threads then closes the socket.
> Here the logs show a reply being sent on a Connection with a UID  (unique ID) of 69.  The message is never received because, as you can see, the other node closes the socket for Connection with UID 69.
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:09.906 PDT <Pooled Waiting Message Processor 3> tid=0x60] BRUCE: sending DLockReleaseReplyMessage (serviceName=__PRLS; replyCode=OK; sender=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005; processorId=212756) to [rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002(uid=69)]
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:09.907 PDT <P2P handshake reader@2c6c307c-65> tid=0x1b0] BRUCE: asyncClose closing Connection, uid=69 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005 isReceiver=true
> java.lang.Exception: stack trace
> 	at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> 	at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> 	at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> 	at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> 	at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> 	at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> I don't think Connection termination should be invoking removeEndpoint at all.  Endpoints should only be removed in response to membership changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)