You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Dong Lin (Jira)" <ji...@apache.org> on 2023/04/12 08:46:00 UTC

[jira] [Commented] (FLINK-31681) Network connection timeout between operators should trigger either network re-connection or job failover

    [ https://issues.apache.org/jira/browse/FLINK-31681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711256#comment-17711256 ] 

Dong Lin commented on FLINK-31681:
----------------------------------

I didn't not get to reproduce this issue myself and the users no longer report this issue. Will close this issue for now.

> Network connection timeout between operators should trigger either network re-connection or job failover
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-31681
>                 URL: https://issues.apache.org/jira/browse/FLINK-31681
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.15.1
>            Reporter: Dong Lin
>            Priority: Major
>
> If a network connection error occurs between two operators, the upstream operator may log the following error message in the method PartitionRequestQueue#handleException and subsequently close the connection. When this happens, the Flink job may become stuck without completing or failing. 
> To avoid this issue, we can either allow the upstream operator to reconnect with the downstream operator, or enable job failover so that users can take corrective action promptly.
> org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors#NativeIOException: writeAccess(...) failed: Connection timed out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)