You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladimir Steshin (Jira)" <ji...@apache.org> on 2020/10/30 10:17:00 UTC

[jira] [Updated] (IGNITE-13643) Disable socket linger dy default in TCPDiscoverySpi

     [ https://issues.apache.org/jira/browse/IGNITE-13643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Steshin updated IGNITE-13643:
--------------------------------------
    Description: 
Current IgniteUtils.closeQuiet(@Nullable Socket sock) can take about 5sec to close socket. This violates node detection failure. Despite we set failureDetectionTiemout == 1000, node failure is detected within 6.5 secs in average. 

This time gap was unearther by a discovery integration test on ducktape [1]. Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": "[6140]", "Nodes failed": 1}

Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1034, "All detection delays (ms):": "[1034]", "Nodes failed": 1}

There is note that 'graceful' socket closing was made to workaround bag in OpenJDK12 [2]. But as I see it has been fixed. Also, there were SSL issues like [3] and [4].
There are various fixes in modern versions of various JDK, supporting TLS 1.3 ([6] and [7]). OpenJDK11 does well as far as I know.

I believe, SSL in discovery is rare in usage. This slows down performance. With the issues, one could just enable soLiger or update the JDK. There is no reason to prolong failure detection by default.

[1] https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658
[3] https://issues.apache.org/jira/browse/IGNITE-12818
[4] https://issues.apache.org/jira/browse/IGNITE-11288
[5] https://bugs.openjdk.java.net/browse/JDK-8245468
[6] https://www.oracle.com/java/technologies/javase/8u261-relnotes.html


  was:
Current IgniteUtils.closeQuiet(@Nullable Socket sock) takes about 5sec to close socket. We should include socket linger in failureDetectionTimeout. This violates node detection failure. Despite we set failureDetectionTiemout == 1000, node failure is detected within 6.5 secs in average. Logging shows delay on socket closing in IgniteUtils.closeQuiet(@Nullable Socket sock).


This time gap was unearther by a discovery integration test on ducktape [1]. Failure detection timeout is set to 1000ms.
Typical results before the fix for 1 node:
"Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": "[6140]", "Nodes failed": 1}

Typical results after the fix for 1 node:
"Detection of node(s) failure (ms)": 1004, "All detection delays (ms):": "[1004]", "Nodes failed": 1}


Suggestion: use forced closing, set soLinger=0, do now wait for rest of the socket IO. We close socket in TcpDiscoverySpi when we already waited for target timeouts and consider connection is lost or invalid. We do not need to wait for any traffic on the socket any more.

There is note that 'graceful' socket closing was made to workaround bag in OpenJDK12 [1]. But as I see it has been fixed.
But we should take in account known issues with SSL connection where linger might be nesessary.


[1] https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
[2] https://bugs.openjdk.java.net/browse/JDK-8219658

        Summary: Disable socket linger dy default in TCPDiscoverySpi  (was: Fix long closing of the socker in ServerImpl (TcpDiscoverySpi))

> Disable socket linger dy default in TCPDiscoverySpi
> ---------------------------------------------------
>
>                 Key: IGNITE-13643
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13643
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current IgniteUtils.closeQuiet(@Nullable Socket sock) can take about 5sec to close socket. This violates node detection failure. Despite we set failureDetectionTiemout == 1000, node failure is detected within 6.5 secs in average. 
> This time gap was unearther by a discovery integration test on ducktape [1]. Failure detection timeout is set to 1000ms.
> Typical results before the fix for 1 node:
> "Detection of node(s) failure (ms)": 6140, "All detection delays (ms):": "[6140]", "Nodes failed": 1}
> Typical results after the fix for 1 node:
> "Detection of node(s) failure (ms)": 1034, "All detection delays (ms):": "[1034]", "Nodes failed": 1}
> There is note that 'graceful' socket closing was made to workaround bag in OpenJDK12 [2]. But as I see it has been fixed. Also, there were SSL issues like [3] and [4].
> There are various fixes in modern versions of various JDK, supporting TLS 1.3 ([6] and [7]). OpenJDK11 does well as far as I know.
> I believe, SSL in discovery is rare in usage. This slows down performance. With the issues, one could just enable soLiger or update the JDK. There is no reason to prolong failure detection by default.
> [1] https://github.com/apache/ignite/blob/ignite-ducktape/modules/ducktests/tests/ignitetest/tests/discovery_test.py
> [2] https://bugs.openjdk.java.net/browse/JDK-8219658
> [3] https://issues.apache.org/jira/browse/IGNITE-12818
> [4] https://issues.apache.org/jira/browse/IGNITE-11288
> [5] https://bugs.openjdk.java.net/browse/JDK-8245468
> [6] https://www.oracle.com/java/technologies/javase/8u261-relnotes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)