You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2021/10/18 13:45:00 UTC

[jira] [Updated] (FLINK-24156) BlobServer crashes due to SocketTimeoutException in Java 11

     [ https://issues.apache.org/jira/browse/FLINK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chesnay Schepler updated FLINK-24156:
-------------------------------------
    Fix Version/s: 1.14.1

> BlobServer crashes due to SocketTimeoutException in Java 11
> -----------------------------------------------------------
>
>                 Key: FLINK-24156
>                 URL: https://issues.apache.org/jira/browse/FLINK-24156
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.12.4, 1.13.2
>         Environment: Java 11
> CentOS 7.6
>            Reporter: Ryan Scudellari
>            Assignee: Ryan Scudellari
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0, 1.14.1
>
>
> h3. Overview
> We have seen the BlobServer crash due to a *SocketTimeoutException* while running on JRE 11. This is likely caused by a [JDK bug present in JDK 11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) that erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ is interrupted by any UNIX signal. The BlobServer calls _accept()_ when establishing connections with clients and is expected to block indefinitely. [The BlobServer currently shuts down when it catches a Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267]. We do not see this behavior when running the same steps in JRE 8.
> h3. Reproducing the issue
> To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available to find the relevant pid.
> One-liner:
> {code:bash}
> kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint | cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print $2}' | xargs printf "%d")
> {code}
>  
>  # Run
> {code:bash}
> jstack [PID] | grep BLOB
> {code}
> where *PID* is the process ID of the job manager.
>  # Find the *nid=[HEX]* value and convert the HEX to decimal.
>  # Run
> {code:bash}
> kill -SIGPIPE [DNID]
> {code}
> where *DNID* is the converted decimal value of *HEX nid* from the previous step.
>  # Observe the following error in the job manager logs:
> {noformat}
> 2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR org.apache.flink.runtime.blob.BlobServer  - BLOB server stopped working. Shutting down
>   at java.base/java.net.PlainSocketImpl.socketAccept
>   at java.base/java.net.AbstractPlainSocketImpl.accept
>        at java.base/java.net.ServerSocket.implAccept
>   at java.base/java.net.ServerSocket.accept
>   at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
> 2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO  org.apache.flink.runtime.blob.BlobServer  - Stopped BLOB server at 0.0.0.0:6124
> {noformat}
> h3. Proposed Fix
> To protect ourselves from this JDK bug, we propose the workaround of catching _SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call indefinitely.
>  
> Thanks to [~bsanders-wf] for helping track this down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)