You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Bruce J Schuchardt (Jira)" <ji...@apache.org> on 2020/06/10 18:55:00 UTC
[jira] [Resolved] (GEODE-8238) message loss during shutdown in
Shutdown Hook when JVM exits
[ https://issues.apache.org/jira/browse/GEODE-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce J Schuchardt resolved GEODE-8238.
---------------------------------------
Fix Version/s: 1.14.0
Resolution: Fixed
> message loss during shutdown in Shutdown Hook when JVM exits
> ------------------------------------------------------------
>
> Key: GEODE-8238
> URL: https://issues.apache.org/jira/browse/GEODE-8238
> Project: Geode
> Issue Type: Bug
> Components: membership, messaging
> Reporter: Bruce J Schuchardt
> Priority: Major
> Fix For: 1.14.0
>
>
> In a test I was running a JVM was told to exit and Geode's Shutdown Hook initiated cache shutdown. This thread hung once in a while either waiting for a reply to a release of a distributed lock or for a reply to a region-destroy message.
> I traced this down by adding some logging to TCPConduit and DirectChannel and it's due to the changes for GEODE-7727 ("modify sender thread to detect relese of connection"). Those changes cause the P2P Handshake thread to stay active reading from a shared Connection socket. Unfortunately, when this thread eventually exits it is invoking removeEndpoint, which closes all the other connections to the other node. This is causing messages to be lost.
> Here's an example:
> one node (19919) fails to form a connection and invokes removeEndpoint for the other node (23898)
> {noformat}
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@31e492f-45> tid=0x14f] BRUCE: asyncClose closing Connection, uid=4 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005 isReceiver=true
> java.lang.Exception: stack trace
> at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1661)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> The other node's shared/unordered connection was unexpectedly terminated, causing it to also invoke removeEndpoint():
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@2e1ef958-4> tid=0x36] BRUCE: asyncClose closing Connection, uid=4 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 isReceiver=false
> java.lang.Exception: stack trace
> at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:08.862 PDT <P2P handshake reader@2e1ef958-4> tid=0x36] BRUCE: removeEndpoint invoked for rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002 reason=SocketChannel.read returned EOF notifyDisconnect=true
> java.lang.Exception: stack trace
> at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:758)
> at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> An unordered message like a Reply can be lost in this case if it is written to the Connection's socket but one of these background threads then closes the socket.
> Here the logs show a reply being sent on a Connection with a UID (unique ID) of 69. The message is never received because, as you can see, the other node closes the socket for Connection with UID 69.
> {noformat}
> bridgegemfire4_23898/system.log: [info 2020/06/09 11:05:09.906 PDT <Pooled Waiting Message Processor 3> tid=0x60] BRUCE: sending DLockReleaseReplyMessage (serviceName=__PRLS; replyCode=OK; sender=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005; processorId=212756) to [rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire1_host1_19919:19919)<ec><v1>:41002(uid=69)]
> bridgegemfire1_19919/system.log: [info 2020/06/09 11:05:09.907 PDT <P2P handshake reader@2c6c307c-65> tid=0x1b0] BRUCE: asyncClose closing Connection, uid=69 shared=true ordered=false remoteAddr=rs-Awesome-14-1023a0i3xlarge-hydra-client-8(bridgegemfire4_host1_23898:23898)<ec><v13>:41005 isReceiver=true
> java.lang.Exception: stack trace
> at org.apache.geode.internal.tcp.Connection.asyncClose(Connection.java:833)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1338)
> at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1276)
> at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612)
> at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604)
> at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:851)
> at org.apache.geode.internal.tcp.ConnectionTable.removeEndpoint(ConnectionTable.java:751)
> at org.apache.geode.internal.tcp.Connection.close(Connection.java:1400)
> at org.apache.geode.internal.tcp.Connection.requestClose(Connection.java:1268)
> at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1619)
> at org.apache.geode.internal.tcp.Connection.run(Connection.java:1460)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {noformat}
> I don't think Connection termination should be invoking removeEndpoint at all. Endpoints should only be removed in response to membership changes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)