You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Dong Lin (Jira)" <ji...@apache.org> on 2023/03/31 08:19:00 UTC

[jira] [Created] (FLINK-31681) Network connection timeout between operators should trigger either network re-connection or job failover

Dong Lin created FLINK-31681:
--------------------------------

             Summary: Network connection timeout between operators should trigger either network re-connection or job failover
                 Key: FLINK-31681
                 URL: https://issues.apache.org/jira/browse/FLINK-31681
             Project: Flink
          Issue Type: Bug
            Reporter: Dong Lin


If a network connection error occurs between two operators, the upstream operator may log the following error message in the method PartitionRequestQueue#handleException and subsequently close the connection. When this happens, the Flink job may become stuck without completing or failing. 

To avoid this issue, we can either allow the upstream operator to reconnect with the downstream operator, or enable job failover so that users can take corrective action promptly.

org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors#NativeIOException: writeAccess(...) failed: Connection timed out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)