You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Loredana Radulescu Ivanoff <lr...@tibco.com> on 2018/11/27 20:03:08 UTC

Re: Cluster freeze with SSL enabled and JDK 11

Hello again,

I haven't been able to solve this issue on my own, so I'm hoping you'd be
able to take another look.

To recap: only with Java 11 and TLS enabled, the second node I bring in the
cluster never starts up, and remains stuck at "Still waiting for initial
partition map exchange". The first nodes  keeps logging "Unable to await
partitions release latch within timeout". To me, this looks like an Ignite
issue, and no matter what causes the situation (arguably in this case an
SSL error), there should be a more elegant exit out of it, i.e. the second
node should give up after a while, if there isn't a better way to retry and
achieve successful communication. The two nodes are able to communicate,
and increasing various network timeouts/failure detection timeout does not
help.

Previously it was mentioned that the Ignite unit test did not show a repro.
I suggest running a test that uses two different machines, because when I
run the nodes on the same machine, I do not get a repro either.

Attaching here logging from the two nodes including SSL messages.

Is Ignite support for Java 11 going to be available before Oracle ends free
commercial support for Java 1.8 in Jan 2019?

Thank you.

On Thu, Oct 25, 2018 at 9:29 AM Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> I have tried to run the test with protocol "TLSv1.2", didn't see any
> difference.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> ср, 24 окт. 2018 г. в 20:23, Loredana Radulescu Ivanoff <
> lradules@tibco.com>:
>
>> Hello again,
>>
>> I am working on getting the full SSL logs over to you, but I have one
>> more question in between: TLS v1.3 is enabled by default in JDK 11, and my
>> app is using TLS v1.2 specifically. There's a known issue that's recently
>> addressed by the JDK related to TLS v1.3 half close policy, details here:
>> https://bugs.java.com/view_bug.do?bug_id=8207009
>>
>> Would you be able to confirm whether your SSL test runs successfully when
>> the connecting client/server use TLS v.12 specifically ?
>>
>> FYI, I have tried specifically disabling TLS v1.3 using both the
>> "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system
>> properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on
>> both sides as indicated here -
>> https://bugs.java.com/view_bug.do?bug_id=8208526
>>
>> Based on my explorations so far, I think this may be a JDK issue
>> (specifically in the JSSE provider) that has not been addressed yet. Either
>> way, do you think three is anything that could be done in Ignite to
>> explicitly close the connection on both sides in a scenario like this ?
>>
>> What I can safely share on the SSL logs so far is this (both nodes get
>> stuck, node 1 in failing to close the SSL connection, node 2 in waiting for
>> partition exchange)
>>
>> Node 1:
>>
>> "2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of
>> SSLEngine
>> javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23
>> 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing
>> inbound before receiving peer's close_notify (
>> "throwable" : {
>>   javax.net.ssl.SSLException: closing inbound before receiving peer's
>> close_notify
>>   at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129)
>>   at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
>>   at
>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
>>   at
>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
>>   at
>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
>>   at
>> java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724)
>>   at
>> org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185)
>>   at
>> org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223)
>>   at
>> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
>>   at
>> org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447)
>>   at
>> org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
>>   at
>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689)
>>   at
>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002)
>>   at
>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
>>   at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>>   at java.base/java.lang.Thread.run(Thread.java:834)}
>>
>> )
>> 2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated session:
>> Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
>> DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable
>> to correctly close inbound data stream (will ignore) "
>>
>>
>> Server 2
>>
>> "javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23
>> 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12
>> application_data, length = 1
>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194
>> PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION (
>>   0000: 01                                                 .
>> )
>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194
>> PDT|SSLSocketOutputRecord.java:324|Raw write (
>>   0000: 17 03 03 00 19 00 00 00   00 00 00 00 A1 45 E5 C8
>> .............E..
>>   0010: 27 3D EE E2 89 4A 0F DD   B6 9A 2A A8 5F 6B        '=...J....*._k
>> )
>> WARN 14:18:10,585-0700 [*Initialization*]
>> processors.cache.GridCachePartitionExchangeManager: Still waiting for
>> initial partition map exchange "
>>
>> On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev <
>> ilya.kasnacheev@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.
>>>
>>>
>>> https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html
>>>
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>>
>>>
>>> пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <
>>> lradules@tibco.com>:
>>>
>>>> Definitely - do you want debug logs, and if yes, should I enable them
>>>> on all Ignite packages, or just for certain ones? Any other diagnostic
>>>> flags that I should set?
>>>>
>>>> On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <
>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> Apache Ignite does not claim support JDK 11 at this moment. However, I
>>>>> was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it
>>>>> turned out mostly fine.
>>>>>
>>>>> More info is needed from your side, such as full instances logs.
>>>>>
>>>>> Regards,
>>>>> --
>>>>> Ilya Kasnacheev
>>>>>
>>>>>
>>>>> пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <
>>>>> lradules@tibco.com>:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and
>>>>>> SSL enabled:
>>>>>>
>>>>>>
>>>>>>    - the second node that I bring up joins, and then shortly after
>>>>>>    freezes and prints this message every minute:
>>>>>>
>>>>>> "WARN ...[*Initialization*]
>>>>>> processors.cache.GridCachePartitionExchangeManager: Still waiting for
>>>>>> initial partition map exchange"
>>>>>>
>>>>>>
>>>>>>    - once the second node joins, the first node starts experiencing
>>>>>>    very frequent 100% CPU spikes; these are the messages I see:
>>>>>>
>>>>>> WARN 2018-10-18T13:50:52,728-0700 []
>>>>>> communication.tcp.TcpCommunicationSpi: Communication SPI session write
>>>>>> timed out (consider increasing 'socketWriteTimeout' configuration property)
>>>>>> [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
>>>>>> WARN 2018-10-18T13:50:52,737-0700 []
>>>>>> communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session
>>>>>> gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL
>>>>>> engine status after closeOutbound call [status=OK,
>>>>>> handshakeStatus=NEED_WRAP,
>>>>>> WARN 2018-10-18T13:51:01,441-0700 []
>>>>>> dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions
>>>>>> release latch within timeout: ServerLatch [permits=1,
>>>>>> pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch
>>>>>> [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]
>>>>>>
>>>>>> Other observations:
>>>>>>
>>>>>> I can reproduce this every time I start the nodes, and it doesn't
>>>>>> matter which node comes up first.
>>>>>>
>>>>>>
>>>>>> The issue goes away if I disable SSL.
>>>>>>
>>>>>>
>>>>>> Increasing the socketWriteTimeout, networkTimeout or the
>>>>>> failureDetectionTimeout does not help.
>>>>>>
>>>>>> It seems to be happening only with JDK 11, and not with JDK 8.
>>>>>>
>>>>>>
>>>>>> Do you have any suggestions/known issues about this?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Loredana
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Cluster freeze with SSL enabled and JDK 11

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I have tried running SSL tests again and they seem to pass (only one test
fails for some different reason)

Can you try running those 2 nodes in stand-alone processes, see if problem
persists? I can see that you have SSL-enabled Tomcat running in same VM,
which I imaging could interfere with Ignite's SSL.

Note that you will need to do some load (such as REST cache operations) to
see if communication indeed works (or doesn't).

Regards,
-- 
Ilya Kasnacheev


ср, 28 нояб. 2018 г. в 16:01, Loredana Radulescu Ivanoff <lradules@tibco.com
>:

> Hello again,
>
> I haven't been able to solve this issue on my own, so I'm hoping you'd be
> able to take another look.
>
> To recap: only with Java 11 and TLS enabled, the second node I bring in
> the cluster never starts up, and remains stuck at "Still waiting for
> initial partition map exchange". The first nodes  keeps logging "Unable to
> await partitions release latch within timeout". To me, this looks like an
> Ignite issue, and no matter what causes the situation (arguably in this
> case an SSL error), there should be a more elegant exit out of it, i.e. the
> second node should give up after a while, if there isn't a better way to
> retry and achieve successful communication. The two nodes are able to
> communicate, and increasing various network timeouts/failure detection
> timeout does not help.
>
> Previously it was mentioned that the Ignite unit test did not show a
> repro. I suggest running a test that uses two different machines, because
> when I run the nodes on the same machine, I do not get a repro either.
>
> Attaching here logging from the two nodes including SSL messages.
>
> Is Ignite support for Java 11 going to be available before Oracle ends
> free commercial support for Java 1.8 in Jan 2019?
>
> Thank you.
>
> On Thu, Oct 25, 2018 at 9:29 AM Ilya Kasnacheev <il...@gmail.com>
> wrote:
>
>> Hello!
>>
>> I have tried to run the test with protocol "TLSv1.2", didn't see any
>> difference.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 24 окт. 2018 г. в 20:23, Loredana Radulescu Ivanoff <
>> lradules@tibco.com>:
>>
>>> Hello again,
>>>
>>> I am working on getting the full SSL logs over to you, but I have one
>>> more question in between: TLS v1.3 is enabled by default in JDK 11, and my
>>> app is using TLS v1.2 specifically. There's a known issue that's recently
>>> addressed by the JDK related to TLS v1.3 half close policy, details here:
>>> https://bugs.java.com/view_bug.do?bug_id=8207009
>>>
>>> Would you be able to confirm whether your SSL test runs successfully
>>> when the connecting client/server use TLS v.12 specifically ?
>>>
>>> FYI, I have tried specifically disabling TLS v1.3 using both the
>>> "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system
>>> properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on
>>> both sides as indicated here -
>>> https://bugs.java.com/view_bug.do?bug_id=8208526
>>>
>>> Based on my explorations so far, I think this may be a JDK issue
>>> (specifically in the JSSE provider) that has not been addressed yet. Either
>>> way, do you think three is anything that could be done in Ignite to
>>> explicitly close the connection on both sides in a scenario like this ?
>>>
>>> What I can safely share on the SSL logs so far is this (both nodes get
>>> stuck, node 1 in failing to close the SSL connection, node 2 in waiting for
>>> partition exchange)
>>>
>>> Node 1:
>>>
>>> "2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of
>>> SSLEngine
>>> javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23
>>> 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing
>>> inbound before receiving peer's close_notify (
>>> "throwable" : {
>>>   javax.net.ssl.SSLException: closing inbound before receiving peer's
>>> close_notify
>>>   at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129)
>>>   at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
>>>   at
>>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
>>>   at
>>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
>>>   at
>>> java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
>>>   at
>>> java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724)
>>>   at
>>> org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185)
>>>   at
>>> org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002)
>>>   at
>>> org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
>>>   at
>>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>>>   at java.base/java.lang.Thread.run(Thread.java:834)}
>>>
>>> )
>>> 2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated
>>> session:  Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
>>> DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable
>>> to correctly close inbound data stream (will ignore) "
>>>
>>>
>>> Server 2
>>>
>>> "javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23
>>> 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12
>>> application_data, length = 1
>>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23
>>> 14:18:09.194 PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION (
>>>   0000: 01                                                 .
>>> )
>>> javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23
>>> 14:18:09.194 PDT|SSLSocketOutputRecord.java:324|Raw write (
>>>   0000: 17 03 03 00 19 00 00 00   00 00 00 00 A1 45 E5 C8
>>> .............E..
>>>   0010: 27 3D EE E2 89 4A 0F DD   B6 9A 2A A8 5F 6B        '=...J....*._k
>>> )
>>> WARN 14:18:10,585-0700 [*Initialization*]
>>> processors.cache.GridCachePartitionExchangeManager: Still waiting for
>>> initial partition map exchange "
>>>
>>> On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev <
>>> ilya.kasnacheev@gmail.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.
>>>>
>>>>
>>>> https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <
>>>> lradules@tibco.com>:
>>>>
>>>>> Definitely - do you want debug logs, and if yes, should I enable them
>>>>> on all Ignite packages, or just for certain ones? Any other diagnostic
>>>>> flags that I should set?
>>>>>
>>>>> On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <
>>>>> ilya.kasnacheev@gmail.com> wrote:
>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Apache Ignite does not claim support JDK 11 at this moment. However,
>>>>>> I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it
>>>>>> turned out mostly fine.
>>>>>>
>>>>>> More info is needed from your side, such as full instances logs.
>>>>>>
>>>>>> Regards,
>>>>>> --
>>>>>> Ilya Kasnacheev
>>>>>>
>>>>>>
>>>>>> пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <
>>>>>> lradules@tibco.com>:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I can consistently reproduce this issue with Ignite 2.6.0, JDK 11
>>>>>>> and SSL enabled:
>>>>>>>
>>>>>>>
>>>>>>>    - the second node that I bring up joins, and then shortly after
>>>>>>>    freezes and prints this message every minute:
>>>>>>>
>>>>>>> "WARN ...[*Initialization*]
>>>>>>> processors.cache.GridCachePartitionExchangeManager: Still waiting for
>>>>>>> initial partition map exchange"
>>>>>>>
>>>>>>>
>>>>>>>    - once the second node joins, the first node starts experiencing
>>>>>>>    very frequent 100% CPU spikes; these are the messages I see:
>>>>>>>
>>>>>>> WARN 2018-10-18T13:50:52,728-0700 []
>>>>>>> communication.tcp.TcpCommunicationSpi: Communication SPI session write
>>>>>>> timed out (consider increasing 'socketWriteTimeout' configuration property)
>>>>>>> [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
>>>>>>> WARN 2018-10-18T13:50:52,737-0700 []
>>>>>>> communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session
>>>>>>> gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL
>>>>>>> engine status after closeOutbound call [status=OK,
>>>>>>> handshakeStatus=NEED_WRAP,
>>>>>>> WARN 2018-10-18T13:51:01,441-0700 []
>>>>>>> dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions
>>>>>>> release latch within timeout: ServerLatch [permits=1,
>>>>>>> pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch
>>>>>>> [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]
>>>>>>>
>>>>>>> Other observations:
>>>>>>>
>>>>>>> I can reproduce this every time I start the nodes, and it doesn't
>>>>>>> matter which node comes up first.
>>>>>>>
>>>>>>>
>>>>>>> The issue goes away if I disable SSL.
>>>>>>>
>>>>>>>
>>>>>>> Increasing the socketWriteTimeout, networkTimeout or the
>>>>>>> failureDetectionTimeout does not help.
>>>>>>>
>>>>>>> It seems to be happening only with JDK 11, and not with JDK 8.
>>>>>>>
>>>>>>>
>>>>>>> Do you have any suggestions/known issues about this?
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Loredana
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>