You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by crazy <24...@qq.com.INVALID> on 2023/03/06 06:23:25 UTC

Flink作业tm Connection timed out异常问题

各位大佬好,有个线上作业频繁failover,异常日志如下:


2023-03-05 11:41:07,847 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @ xx.xx.xx.xx (dataPort=26882). org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to 'xxx/10.70.89.25:43923') 	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.11-1.13.5.jar:1.13.5] 	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131] Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out

每次失败都是A这台机器上的进程报&nbsp;switched from RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) , 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢







crazy
2463829830@qq.com



&nbsp;

回复: Flink作业tm Connection timed out异常问题

Posted by crazy <24...@qq.com.INVALID>.
好的,谢谢




crazy
2463829830@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <zjureel@gmail.com&gt;;
发送时间:&nbsp;2023年3月7日(星期二) 上午9:31
收件人:&nbsp;"user-zh"<user-zh@flink.apache.org&gt;;

主题:&nbsp;Re: Flink作业tm Connection timed out异常问题



Hi

很多原因都可能会导致连接失败问题,包括机器故障、系统问题或者服务器负载,如果是怀疑负载问题你可以找几台服务器和这台有疑问的服务器组成个小集群,提交一些作业,让这台服务器负载不要太高,观察一下作业运行情况

Best,
Shammon

On Mon, Mar 6, 2023 at 8:49 PM crazy <2463829830@qq.com.invalid&gt; wrote:

&gt; 报错日志下面这个一样,是同一个问题么
&gt; https://issues.apache.org/jira/browse/FLINK-19925
&gt;
&gt;
&gt; 其中描述到服务器 "high cpu usage or high network pressure" 可能会导致这个原因,想问下cpu usage,
&gt; network咋样才算高?
&gt;
&gt;
&gt;
&gt;
&gt; crazy
&gt; 2463829830@qq.com
&gt;
&gt;
&gt;
&gt; &amp;nbsp;
&gt;
&gt;
&gt;
&gt;
&gt; ------------------&amp;nbsp;原始邮件&amp;nbsp;------------------
&gt; 发件人:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "user-zh"
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <
&gt; tanyuxinwork@gmail.com&amp;gt;;
&gt; 发送时间:&amp;nbsp;2023年3月6日(星期一) 下午2:59
&gt; 收件人:&amp;nbsp;"user-zh"<user-zh@flink.apache.org&amp;gt;;
&gt;
&gt; 主题:&amp;nbsp;Re: Flink作业tm Connection timed out异常问题
&gt;
&gt;
&gt;
&gt; 不建议这样做,因为这样会掩盖问题。
&gt;
&gt; 但如果一定要配置"重试次数"或"超时时长" 这些参数,会涉及到很多参数,比如 akka.tcp.timeout,
&gt; taskmanager.network.netty.client.connectTimeoutSec,
&gt; taskmanager.network.retries等等,具体可以参考[1]。
&gt;
&gt; [1]
&gt;
&gt; https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/
&gt;
&gt; Best,
&gt; Yuxin
&gt;
&gt;
&gt; crazy <2463829830@qq.com.invalid&amp;gt; 于2023年3月6日周一 14:41写道:
&gt;
&gt; &amp;gt; 机器问题从监控上暂时没发现啥问题,能否通过增加"重试次数"或"超时时长"来缓解这个问题呢?不太清楚具体参数需要设置哪些?
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; crazy
&gt; &amp;gt; 2463829830@qq.com
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; &amp;amp;nbsp;
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; ------------------&amp;amp;nbsp;原始邮件&amp;amp;nbsp;------------------
&gt; &amp;gt; 发件人:
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&gt; "user-zh"
&gt; &amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&gt; <
&gt; &amp;gt; tanyuxinwork@gmail.com&amp;amp;gt;;
&gt; &amp;gt; 发送时间:&amp;amp;nbsp;2023年3月6日(星期一) 下午2:33
&gt; &amp;gt; 收件人:&amp;amp;nbsp;"user-zh"<user-zh@flink.apache.org&amp;amp;gt;;
&gt; &amp;gt;
&gt; &amp;gt; 主题:&amp;amp;nbsp;Re: Flink作业tm Connection timed out异常问题
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; "如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。
&gt; &amp;gt;
&gt; &amp;gt; 可以检查机器 A 的网络、内存、CPU
&gt; &amp;gt; 指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。
&gt; &amp;gt;
&gt; &amp;gt; 如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。
&gt; &amp;gt;
&gt; &amp;gt; Best,
&gt; &amp;gt; Yuxin
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; crazy <2463829830@qq.com.invalid&amp;amp;gt; 于2023年3月6日周一 14:23写道:
&gt; &amp;gt;
&gt; &amp;gt; &amp;amp;gt; 各位大佬好,有个线上作业频繁failover,异常日志如下:
&gt; &amp;gt; &amp;amp;gt;
&gt; &amp;gt; &amp;amp;gt; 2023-03-05 11:41:07,847 INFO&amp;amp;nbsp;
&gt; &amp;gt;
&gt; org.apache.flink.runtime.executiongraph.ExecutionGraph&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp;
&gt; &amp;gt; [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched
&gt; from
&gt; &amp;gt; RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @
&gt; &amp;gt; xx.xx.xx.xx (dataPort=26882).
&gt; &amp;gt; &amp;amp;gt; org.apache.flink.runtime.io
&gt; .network.netty.exception.LocalTransportException:
&gt; &amp;gt; readAddress(..) failed: Connection timed out (connection to 'xxx/
&gt; &amp;gt; 10.70.89.25:43923')
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at org.apache.flink.runtime.io
&gt; .network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
&gt; &amp;gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt; &amp;amp;gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; at java.lang.Thread.run(Thread.java:748)
&gt; ~[?:1.8.0_131]
&gt; &amp;gt; &amp;amp;gt; Caused by:
&gt; &amp;gt;
&gt; org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
&gt; &amp;gt; readAddress(..) failed: Connection timed out
&gt; &amp;gt; &amp;amp;gt;
&gt; &amp;gt; &amp;amp;gt;
&gt; &amp;gt; &amp;amp;gt; 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
&gt; &amp;gt; &amp;amp;gt; container_e26_1646120234560_82135_01_000097 @A
&gt; (dataPort=26882) ,
&gt; &amp;gt; &amp;amp;gt; 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
&gt; &amp;gt; &amp;amp;gt;
&gt; &amp;gt; &amp;amp;gt;
&gt; &amp;gt; &amp;amp;gt; ------------------------------
&gt; &amp;gt; &amp;amp;gt; crazy
&gt; &amp;gt; &amp;amp;gt; 2463829830@qq.com
&gt; &amp;gt; &amp;amp;gt;
&gt; &amp;gt; &amp;amp;gt; <
&gt; &amp;gt;
&gt; https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;amp;amp;nocheck=true&amp;amp;amp;name=crazy&amp;amp;amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;amp;amp;mail=2463829830%40qq.com&amp;amp;amp;code=&amp;amp;gt
&gt; &amp;gt
&gt; <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;amp;amp;nocheck=true&amp;amp;amp;name=crazy&amp;amp;amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;amp;amp;mail=2463829830%40qq.com&amp;amp;amp;code=&amp;amp;gt&amp;gt&gt;;
&gt; ;
&gt; &amp;gt; &amp;amp;gt;
&gt; &amp;gt; &amp;amp;gt;

Re: Flink作业tm Connection timed out异常问题

Posted by Shammon FY <zj...@gmail.com>.
Hi

很多原因都可能会导致连接失败问题,包括机器故障、系统问题或者服务器负载,如果是怀疑负载问题你可以找几台服务器和这台有疑问的服务器组成个小集群,提交一些作业,让这台服务器负载不要太高,观察一下作业运行情况

Best,
Shammon

On Mon, Mar 6, 2023 at 8:49 PM crazy <24...@qq.com.invalid> wrote:

> 报错日志下面这个一样,是同一个问题么
> https://issues.apache.org/jira/browse/FLINK-19925
>
>
> 其中描述到服务器 "high cpu usage or high network pressure" 可能会导致这个原因,想问下cpu usage,
> network咋样才算高?
>
>
>
>
> crazy
> 2463829830@qq.com
>
>
>
> &nbsp;
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "user-zh"
>                                                                     <
> tanyuxinwork@gmail.com&gt;;
> 发送时间:&nbsp;2023年3月6日(星期一) 下午2:59
> 收件人:&nbsp;"user-zh"<user-zh@flink.apache.org&gt;;
>
> 主题:&nbsp;Re: Flink作业tm Connection timed out异常问题
>
>
>
> 不建议这样做,因为这样会掩盖问题。
>
> 但如果一定要配置"重试次数"或"超时时长" 这些参数,会涉及到很多参数,比如 akka.tcp.timeout,
> taskmanager.network.netty.client.connectTimeoutSec,
> taskmanager.network.retries等等,具体可以参考[1]。
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/
>
> Best,
> Yuxin
>
>
> crazy <2463829830@qq.com.invalid&gt; 于2023年3月6日周一 14:41写道:
>
> &gt; 机器问题从监控上暂时没发现啥问题,能否通过增加"重试次数"或"超时时长"来缓解这个问题呢?不太清楚具体参数需要设置哪些?
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; crazy
> &gt; 2463829830@qq.com
> &gt;
> &gt;
> &gt;
> &gt; &amp;nbsp;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; ------------------&amp;nbsp;原始邮件&amp;nbsp;------------------
> &gt; 发件人:
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> "user-zh"
> &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> <
> &gt; tanyuxinwork@gmail.com&amp;gt;;
> &gt; 发送时间:&amp;nbsp;2023年3月6日(星期一) 下午2:33
> &gt; 收件人:&amp;nbsp;"user-zh"<user-zh@flink.apache.org&amp;gt;;
> &gt;
> &gt; 主题:&amp;nbsp;Re: Flink作业tm Connection timed out异常问题
> &gt;
> &gt;
> &gt;
> &gt; "如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。
> &gt;
> &gt; 可以检查机器 A 的网络、内存、CPU
> &gt; 指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。
> &gt;
> &gt; 如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。
> &gt;
> &gt; Best,
> &gt; Yuxin
> &gt;
> &gt;
> &gt; crazy <2463829830@qq.com.invalid&amp;gt; 于2023年3月6日周一 14:23写道:
> &gt;
> &gt; &amp;gt; 各位大佬好,有个线上作业频繁failover,异常日志如下:
> &gt; &amp;gt;
> &gt; &amp;gt; 2023-03-05 11:41:07,847 INFO&amp;nbsp;
> &gt;
> org.apache.flink.runtime.executiongraph.ExecutionGraph&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
> &gt; [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched
> from
> &gt; RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @
> &gt; xx.xx.xx.xx (dataPort=26882).
> &gt; &amp;gt; org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException:
> &gt; readAddress(..) failed: Connection timed out (connection to 'xxx/
> &gt; 10.70.89.25:43923')
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at org.apache.flink.runtime.io
> .network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
> &gt;
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> &gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt; &amp;gt;&nbsp;&nbsp;&nbsp; at java.lang.Thread.run(Thread.java:748)
> ~[?:1.8.0_131]
> &gt; &amp;gt; Caused by:
> &gt;
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
> &gt; readAddress(..) failed: Connection timed out
> &gt; &amp;gt;
> &gt; &amp;gt;
> &gt; &amp;gt; 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
> &gt; &amp;gt; container_e26_1646120234560_82135_01_000097 @A
> (dataPort=26882) ,
> &gt; &amp;gt; 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
> &gt; &amp;gt;
> &gt; &amp;gt;
> &gt; &amp;gt; ------------------------------
> &gt; &amp;gt; crazy
> &gt; &amp;gt; 2463829830@qq.com
> &gt; &amp;gt;
> &gt; &amp;gt; <
> &gt;
> https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;amp;nocheck=true&amp;amp;name=crazy&amp;amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;amp;mail=2463829830%40qq.com&amp;amp;code=&amp;gt
> &gt
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;amp;nocheck=true&amp;amp;name=crazy&amp;amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;amp;mail=2463829830%40qq.com&amp;amp;code=&amp;gt&gt>;
> ;
> &gt; &amp;gt;
> &gt; &amp;gt;

回复: Flink作业tm Connection timed out异常问题

Posted by crazy <24...@qq.com.INVALID>.
报错日志下面这个一样,是同一个问题么
https://issues.apache.org/jira/browse/FLINK-19925


其中描述到服务器 "high cpu usage or high network pressure" 可能会导致这个原因,想问下cpu usage, network咋样才算高?




crazy
2463829830@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <tanyuxinwork@gmail.com&gt;;
发送时间:&nbsp;2023年3月6日(星期一) 下午2:59
收件人:&nbsp;"user-zh"<user-zh@flink.apache.org&gt;;

主题:&nbsp;Re: Flink作业tm Connection timed out异常问题



不建议这样做,因为这样会掩盖问题。

但如果一定要配置"重试次数"或"超时时长" 这些参数,会涉及到很多参数,比如 akka.tcp.timeout,
taskmanager.network.netty.client.connectTimeoutSec,
taskmanager.network.retries等等,具体可以参考[1]。

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/

Best,
Yuxin


crazy <2463829830@qq.com.invalid&gt; 于2023年3月6日周一 14:41写道:

&gt; 机器问题从监控上暂时没发现啥问题,能否通过增加"重试次数"或"超时时长"来缓解这个问题呢?不太清楚具体参数需要设置哪些?
&gt;
&gt;
&gt;
&gt;
&gt; crazy
&gt; 2463829830@qq.com
&gt;
&gt;
&gt;
&gt; &amp;nbsp;
&gt;
&gt;
&gt;
&gt;
&gt; ------------------&amp;nbsp;原始邮件&amp;nbsp;------------------
&gt; 发件人:
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "user-zh"
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <
&gt; tanyuxinwork@gmail.com&amp;gt;;
&gt; 发送时间:&amp;nbsp;2023年3月6日(星期一) 下午2:33
&gt; 收件人:&amp;nbsp;"user-zh"<user-zh@flink.apache.org&amp;gt;;
&gt;
&gt; 主题:&amp;nbsp;Re: Flink作业tm Connection timed out异常问题
&gt;
&gt;
&gt;
&gt; "如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。
&gt;
&gt; 可以检查机器 A 的网络、内存、CPU
&gt; 指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。
&gt;
&gt; 如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。
&gt;
&gt; Best,
&gt; Yuxin
&gt;
&gt;
&gt; crazy <2463829830@qq.com.invalid&amp;gt; 于2023年3月6日周一 14:23写道:
&gt;
&gt; &amp;gt; 各位大佬好,有个线上作业频繁failover,异常日志如下:
&gt; &amp;gt;
&gt; &amp;gt; 2023-03-05 11:41:07,847 INFO&amp;nbsp;
&gt; org.apache.flink.runtime.executiongraph.ExecutionGraph&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&gt; [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from
&gt; RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @
&gt; xx.xx.xx.xx (dataPort=26882).
&gt; &amp;gt; org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
&gt; readAddress(..) failed: Connection timed out (connection to 'xxx/
&gt; 10.70.89.25:43923')
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at
&gt; org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
&gt; ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; &amp;gt;&nbsp;&nbsp;&nbsp; at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
&gt; &amp;gt; Caused by:
&gt; org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
&gt; readAddress(..) failed: Connection timed out
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
&gt; &amp;gt; container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) ,
&gt; &amp;gt; 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
&gt; &amp;gt;
&gt; &amp;gt;
&gt; &amp;gt; ------------------------------
&gt; &amp;gt; crazy
&gt; &amp;gt; 2463829830@qq.com
&gt; &amp;gt;
&gt; &amp;gt; <
&gt; https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;amp;nocheck=true&amp;amp;name=crazy&amp;amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;amp;mail=2463829830%40qq.com&amp;amp;code=&amp;gt
&gt; ;
&gt; &amp;gt;
&gt; &amp;gt;

Re: Flink作业tm Connection timed out异常问题

Posted by Yuxin Tan <ta...@gmail.com>.
不建议这样做,因为这样会掩盖问题。

但如果一定要配置"重试次数"或"超时时长" 这些参数,会涉及到很多参数,比如 akka.tcp.timeout,
taskmanager.network.netty.client.connectTimeoutSec,
taskmanager.network.retries等等,具体可以参考[1]。

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/

Best,
Yuxin


crazy <24...@qq.com.invalid> 于2023年3月6日周一 14:41写道:

> 机器问题从监控上暂时没发现啥问题,能否通过增加"重试次数"或"超时时长"来缓解这个问题呢?不太清楚具体参数需要设置哪些?
>
>
>
>
> crazy
> 2463829830@qq.com
>
>
>
> &nbsp;
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "user-zh"
>                                                                     <
> tanyuxinwork@gmail.com&gt;;
> 发送时间:&nbsp;2023年3月6日(星期一) 下午2:33
> 收件人:&nbsp;"user-zh"<user-zh@flink.apache.org&gt;;
>
> 主题:&nbsp;Re: Flink作业tm Connection timed out异常问题
>
>
>
> "如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。
>
> 可以检查机器 A 的网络、内存、CPU
> 指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。
>
> 如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。
>
> Best,
> Yuxin
>
>
> crazy <2463829830@qq.com.invalid&gt; 于2023年3月6日周一 14:23写道:
>
> &gt; 各位大佬好,有个线上作业频繁failover,异常日志如下:
> &gt;
> &gt; 2023-03-05 11:41:07,847 INFO&nbsp;
> org.apache.flink.runtime.executiongraph.ExecutionGraph&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from
> RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @
> xx.xx.xx.xx (dataPort=26882).
> &gt; org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> readAddress(..) failed: Connection timed out (connection to 'xxx/
> 10.70.89.25:43923')
> &gt;    at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> &gt;    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
> &gt; Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
> readAddress(..) failed: Connection timed out
> &gt;
> &gt;
> &gt; 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
> &gt; container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) ,
> &gt; 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
> &gt;
> &gt;
> &gt; ------------------------------
> &gt; crazy
> &gt; 2463829830@qq.com
> &gt;
> &gt; <
> https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;nocheck=true&amp;name=crazy&amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;mail=2463829830%40qq.com&amp;code=&gt
> ;
> &gt;
> &gt;

回复: Flink作业tm Connection timed out异常问题

Posted by crazy <24...@qq.com.INVALID>.
机器问题从监控上暂时没发现啥问题,能否通过增加"重试次数"或"超时时长"来缓解这个问题呢?不太清楚具体参数需要设置哪些?




crazy
2463829830@qq.com



&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "user-zh"                                                                                    <tanyuxinwork@gmail.com&gt;;
发送时间:&nbsp;2023年3月6日(星期一) 下午2:33
收件人:&nbsp;"user-zh"<user-zh@flink.apache.org&gt;;

主题:&nbsp;Re: Flink作业tm Connection timed out异常问题



"如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。

可以检查机器 A 的网络、内存、CPU
指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。

如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。

Best,
Yuxin


crazy <2463829830@qq.com.invalid&gt; 于2023年3月6日周一 14:23写道:

&gt; 各位大佬好,有个线上作业频繁failover,异常日志如下:
&gt;
&gt; 2023-03-05 11:41:07,847 INFO&nbsp; org.apache.flink.runtime.executiongraph.ExecutionGraph&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @ xx.xx.xx.xx (dataPort=26882).
&gt; org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to 'xxx/10.70.89.25:43923')
&gt; 	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
&gt; 	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
&gt; Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
&gt;
&gt;
&gt; 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
&gt; container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) ,
&gt; 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
&gt;
&gt;
&gt; ------------------------------
&gt; crazy
&gt; 2463829830@qq.com
&gt;
&gt; <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&amp;nocheck=true&amp;name=crazy&amp;icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&amp;mail=2463829830%40qq.com&amp;code=&gt;
&gt;
&gt;

Re: Flink作业tm Connection timed out异常问题

Posted by Yuxin Tan <ta...@gmail.com>.
"如果进程没被调度到这台机器上,任务正常",从给出的描述来看,确实很可能是 A 这台机器有问题。

可以检查机器 A 的网络、内存、CPU
指标或者监控是否正常,与其他机器是否存在不同。比如网络参数的配置、机器内存是否存在损坏、机器是否存在异常进程或负载等等。

如果硬件问题,系统日志有可能有一些报错。也可以使用一些机器检查工具, dmesg/vmstat等。

Best,
Yuxin


crazy <24...@qq.com.invalid> 于2023年3月6日周一 14:23写道:

> 各位大佬好,有个线上作业频繁failover,异常日志如下:
>
> 2023-03-05 11:41:07,847 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Process (287/300) (b3ef27fec49fe3777f830802ef3501e9) switched from RUNNING to FAILED on container_e26_1646120234560_82135_01_000097 @ xx.xx.xx.xx (dataPort=26882).
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to 'xxx/10.70.89.25:43923')
> 	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.11-1.13.5.jar:1.13.5]
> 	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131]
> Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
>
>
> 每次失败都是A这台机器上的进程报 switched from RUNNING to FAILED on
> container_e26_1646120234560_82135_01_000097 @A (dataPort=26882) ,
> 查看机器A负载正常,如果进程没被调度到这台机器上,任务正常,目前怀疑是这台机器导致的问题,请教下该如何排查这个问题呢?多谢
>
>
> ------------------------------
> crazy
> 2463829830@qq.com
>
> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=crazy&icon=http%3A%2F%2Fthirdqq.qlogo.cn%2Fg%3Fb%3Dsdk%26k%3DKlvibnHhZZWe933WckLKt7Q%26s%3D100%26t%3D1557169080%3Frand%3D1638962060&mail=2463829830%40qq.com&code=>
>
>