You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (Jira)" <ji...@apache.org> on 2020/10/01 10:54:00 UTC

[jira] [Commented] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

    [ https://issues.apache.org/jira/browse/FLINK-19249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205414#comment-17205414 ] 

Stephan Ewen commented on FLINK-19249:
--------------------------------------

Digging into this a bit more, I think your are right. This is a corner case caused by problematic network environments.

Let's label this as an improvement where Flink could try to do something on the application layer to detect network issues when the kernel takes too long.

But we still miss a good proposal how to do that.

> Job would wait sometime(~10 min) before failover if some connection broken
> --------------------------------------------------------------------------
>
>                 Key: FLINK-19249
>                 URL: https://issues.apache.org/jira/browse/FLINK-19249
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: Congxian Qiu(klion26)
>            Priority: Critical
>             Fix For: 1.12.0
>
>
> {quote}encountered this error on 1.7, after going through the master code, I think the problem is still there
> {quote}
> When the network environment is not so good, the connection between the server and the client may be disconnected innocently. After the disconnection, the server will receive the IOException such as below
> {code:java}
> java.io.IOException: Connection timed out
>  at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>  at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>  at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>  at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>  at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
>  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> then release the view reader.
> But the job would not fail until the downstream detect the disconnection because of {{channelInactive}} later(~10 min). between such time, the job can still process data, but the broken channel can't transfer any data or event, so snapshot would fail during this time. this will cause the job to replay many data after failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)