You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Bill Burcham (Jira)" <ji...@apache.org> on 2021/01/29 15:49:00 UTC

[jira] [Commented] (GEODE-8817) server hangs in cache close with ssl enabled due to active client connection; client side (CacheClientUpdater.close()) is hung in SSLSocketImpl$AppInputStream.deplete()

    [ https://issues.apache.org/jira/browse/GEODE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274531#comment-17274531 ] 

Bill Burcham commented on GEODE-8817:
-------------------------------------

This hang happens on 1.14.0 (currently on the {{develop}} branch). It does not occur on the latest {{support/1.13}} branch.

Bisection on the {{develop}} branch shows the hang happens on the third commit on that branch after the {{support/1.13}} branch was cut. That commit {{02839eae79ae078c30a422b1b0ff3c23866ccdbd}} was one of three on May 4, 2020—the day {{support/1.13}} was cut. I didn't bother to test either of the previous two commits.

The hang is still present as of the January 8, 2021 commit {{762db7e86d3969205e2729e4900ff997ca659b74}}

Oddly the hang is _not_ happening on recent commits on the {{support/1.13}} branch! How can this be? (it's surprising because, if the problem was introduced before {{support/1.13}} was cut, as it seems to have been, then {{support/1.13}} should exhibit the problem)

Bisection on {{support/1.13}} shows that commit {{32a8a55df4c0a5b8b1e5cfb746070804cbce5e0a}} "fixes" this hang. Paradoxically, that commit made to {{support/1.13}} on November 6, 2020:

{code}
32a8a55df4c0a5b8b1e5cfb746070804cbce5e0a GEODE-8652 NIOSslEngine.close() bypasses locks cherry-picked from af267c005a63317cbb8528cdb38eccf6a8747818
{code}

…is addressing a hang in P2P messaging, whereas this ticket pertains to a hang in client-server messaging. Also paradoxical is that that commit was cherry-picked from the {{develop}} (at that time 1.14.0) branch—but that commit on {{develop}}/1.14.0 does _not_ fix the hang there! To summarize: *the fix for GEODE-8652 fixes the hang on {{support/1.13}} but not on {{develop}}/1.14.0*.

How can this be? All I can think of is that there is some _interaction_ between the GEODE-8652 fix that changes the timing (of the server-side of the client-server code) in a way that, due to _other differences_ between the branches, causes the test to hang reliably on {{develop}} but not on {{support/1.13}}. This is not a comforting answer.

> server hangs in cache close with ssl enabled due to active client connection; client side (CacheClientUpdater.close()) is hung in SSLSocketImpl$AppInputStream.deplete()
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-8817
>                 URL: https://issues.apache.org/jira/browse/GEODE-8817
>             Project: Geode
>          Issue Type: Bug
>          Components: client/server, security
>    Affects Versions: 1.14.0
>            Reporter: Bill Burcham
>            Assignee: Bill Burcham
>            Priority: Major
>              Labels: blocks-1.14.0​
>
> A proprietary TLS/SSL-enabled application encountered a network partition. A server hangs in cache close due to active client connection; client side ({{CacheClientUpdater.close()}}) is hung in {{SSLSocketImpl$AppInputStream.deplete()}}
> The configuration is:
> {noformat}
> ==========================================
> losingSide              |    survivingSide
> ==========================================
> 11110                   |    10627
> 11115                   |    10632
> ------------------------------------------
> 11139                   |    10655
>                         |    10662
> ------------------------------------------
> {noformat}
> The stuck threads were stuck in sun's SSL code. Geode's client/Server framework uses old I/O and that was also part of where they were stuck. If the clients had closed their connections to the server then it would not have been stuck here. But the server shutdown shouldn't hang because of client that refuses to disconnect.
> The Geode client-side of the connection is hung here:
> {code:java}
> \[warn 2020/11/06 14:56:56.577 PST <ThreadsMonitor> tid=0x18] Thread <50> (0x32) that was executed at <06 Nov 2020 14:55:43 PST> has been stuck for <72.81 seconds> and number of thread monitor iteration <1>
> Thread Name <poolTimer-brclient-7> state <BLOCKED>
> Waiting on <su...@3a9b72d3>
> Owned By <Cache Client Updater Thread  on 10.32.108.224(bridgep2_host2_10627:10627)<v3>:41003 port 27636> with ID <43>
> Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
> Monitored metric <ResourceManagerStats.numThreadsStuck>
> Thread stack:
> sun.security.ssl.SSLSocketImpl$AppInputStream.deplete(SSLSocketImpl.java:1016)
> sun.security.ssl.SSLSocketImpl$AppInputStream.access$100(SSLSocketImpl.java:816)
> sun.security.ssl.SSLSocketImpl.bruteForceCloseInput(SSLSocketImpl.java:702)
> sun.security.ssl.SSLSocketImpl.duplexCloseOutput(SSLSocketImpl.java:553)
> sun.security.ssl.SSLSocketImpl.close(SSLSocketImpl.java:485)
> org.apache.geode.internal.cache.tier.sockets.CacheClientUpdater.close(CacheClientUpdater.java:546)
> org.apache.geode.cache.client.internal.QueueConnectionImpl.internalDestroy(QueueConnectionImpl.java:112)
> org.apache.geode.cache.client.internal.QueueManagerImpl.endpointCrashed(QueueManagerImpl.java:379)
> org.apache.geode.cache.client.internal.QueueManagerImpl.connectionCrashed(QueueManagerImpl.java:357)
> org.apache.geode.cache.client.internal.QueueConnectionImpl.destroy(QueueConnectionImpl.java:88)
> org.apache.geode.cache.client.internal.OpExecutorImpl.handleException(OpExecutorImpl.java:645)
> org.apache.geode.cache.client.internal.OpExecutorImpl.handleException(OpExecutorImpl.java:504)
> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:334)
> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:303)
> org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:839)
> org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:38)
> org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
> org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1329)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:279)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> Lock owner thread stack
> java.net.SocketInputStream.socketRead0(Native Method)
> java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> java.net.SocketInputStream.read(SocketInputStream.java:171)
> java.net.SocketInputStream.read(SocketInputStream.java:141)
> sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:475)
> sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:469)
> sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:69)
> sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1228)
> sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:75)
> sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:915)
> org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:830)
> org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:677)
> org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1140)
> org.apache.geode.internal.cache.tier.sockets.CacheClientUpdater.processMessages(CacheClientUpdater.java:1591)
> org.apache.geode.internal.cache.tier.sockets.CacheClientUpdater.run(CacheClientUpdater.java:483)
> {code}
> From the Geode log we can see the expected exceptions for messages to losing side (Operation not permitted) after the network is dropped:
> {code:java}
> [warn 2020/11/06 14:50:20.865 PST <Geode Failure Detection thread 3> tid=0x9b] Unable to send message to 10.32.110.94(bridgep1_host1_11115:11115)<v4>:41001
> java.io.IOException: Operation not permitted (sendto failed)
>         at java.net.PlainDatagramSocketImpl.send(Native Method)
>         at java.net.DatagramSocket.send(DatagramSocket.java:693)
>         at org.jgroups.protocols.UDP._send(UDP.java:224)
>         at org.jgroups.protocols.UDP.sendUnicast(UDP.java:215)
>         at org.jgroups.protocols.TP.sendToSingleMember(TP.java:1985)
>         at org.jgroups.protocols.TP.doSend(TP.java:1962)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.doSend(Transport.java:85)
>         at org.jgroups.protocols.TP.send(TP.java:1948)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.Transport._send(Transport.java:57)
>         at org.jgroups.protocols.TP.down(TP.java:1515)
>         at org.jgroups.stack.Protocol.down(Protocol.java:439)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.down(StatRecorder.java:87)
>         at org.jgroups.protocols.UNICAST3.down(UNICAST3.java:646)
>         at org.jgroups.protocols.FlowControl.down(FlowControl.java:347)
>         at org.jgroups.protocols.FRAG2.down(FRAG2.java:136)
>         at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1039)
>         at org.jgroups.JChannel.down(JChannel.java:790)
>         at org.jgroups.JChannel.send(JChannel.java:426)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.send(JGroupsMessenger.java:819)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.send(JGroupsMessenger.java:670)
>         at org.apache.geode.distributed.internal.membership.gms.fd.GMSHealthMonitor.doCheckMember(GMSHealthMonitor.java:504)
>         at org.apache.geode.distributed.internal.membership.gms.fd.GMSHealthMonitor.lambda$checkMember$1(GMSHealthMonitor.java:455)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)