You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user-zh@flink.apache.org by yidan zhao <hi...@gmail.com> on 2021/06/16 07:36:10 UTC

flink job exception analysis (netty related, readAddress failed. connection timed out)

Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem?

Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

Hi, here is the text exception stack:

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to
'10.35.215.18/10.35.215.18:2045')
    at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection timed out

Robert Metzger <rm...@apache.org> 于2021年6月16日周三 下午4:26写道：
>
> Hi Yidan,
> it seems that the attachment did not make it through the mailing list. Can
> you copy-paste the text of the exception here or upload the log somewhere?
>
>
>
> On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <hi...@gmail.com> wrote:
>
> > Attachment is the exception stack from flink's web-ui. Does anyone
> > have also met this problem?
> >
> > Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > each 28G mem.
> >

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

Hi, here is the text exception stack:

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to
'10.35.215.18/10.35.215.18:2045')
    at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:201)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
    at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
    at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:728)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:818)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
    at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException:
readAddress(..) failed: Connection timed out

Robert Metzger <rm...@apache.org> 于2021年6月16日周三 下午4:26写道：
>
> Hi Yidan,
> it seems that the attachment did not make it through the mailing list. Can
> you copy-paste the text of the exception here or upload the log somewhere?
>
>
>
> On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <hi...@gmail.com> wrote:
>
> > Attachment is the exception stack from flink's web-ui. Does anyone
> > have also met this problem?
> >
> > Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > each 28G mem.
> >

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by Robert Metzger <rm...@apache.org>.

Hi Yidan,
it seems that the attachment did not make it through the mailing list. Can
you copy-paste the text of the exception here or upload the log somewhere?

On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <hi...@gmail.com> wrote:

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> each 28G mem.
>

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by Robert Metzger <rm...@apache.org>.

Hi Yidan,
it seems that the attachment did not make it through the mailing list. Can
you copy-paste the text of the exception here or upload the log somewhere?

On Wed, Jun 16, 2021 at 9:36 AM yidan zhao <hi...@gmail.com> wrote:

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> each 28G mem.
>

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

I also searched many result in internet. There are some related
exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException,
but in my case it is
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException.
It is different in 'LocalTransportException' or
'RemoteTransportException'.

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午7:10写道：
>
> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > >> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

Ok, I will try.

Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午8:00写道：
>
> Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs.
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午7:10写道：
>>
>> Hi, yingjie.
>> If the network is not stable, which config parameter I should adjust.
>>
>> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>> >
>> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> > 142892, so it is not bad.
>> > 3: stream job.
>> > 4: I will try to config taskmanager.network.retries which is default
>> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> > is 120s。
>> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> > so I think it is a reasonable value.
>> >
>> > 1: can not be sure.
>> >
>> > Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>> > >
>> > > Hi yidan,
>> > >
>> > > 1. Is the network stable?
>> > > 2. Is there any GC problem?
>> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> > >
>> > > Hope this helps.
>> > >
>> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> > >
>> > > Best,
>> > > Yingjie
>> > >
>> > > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>> > >>
>> > >> Attachment is the exception stack from flink's web-ui. Does anyone
>> > >> have also met this problem?
>> > >>
>> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> > >> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

Ok, I will try.

Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午8:00写道：
>
> Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs.
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午7:10写道：
>>
>> Hi, yingjie.
>> If the network is not stable, which config parameter I should adjust.
>>
>> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>> >
>> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> > 142892, so it is not bad.
>> > 3: stream job.
>> > 4: I will try to config taskmanager.network.retries which is default
>> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> > is 120s。
>> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> > so I think it is a reasonable value.
>> >
>> > 1: can not be sure.
>> >
>> > Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>> > >
>> > > Hi yidan,
>> > >
>> > > 1. Is the network stable?
>> > > 2. Is there any GC problem?
>> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> > >
>> > > Hope this helps.
>> > >
>> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> > >
>> > > Best,
>> > > Yingjie
>> > >
>> > > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>> > >>
>> > >> Attachment is the exception stack from flink's web-ui. Does anyone
>> > >> have also met this problem?
>> > >>
>> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> > >> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by Yingjie Cao <ke...@gmail.com>.

Maybe you can try to
increase taskmanager.network.retries,
taskmanager.network.netty.server.backlog and
taskmanager.network.netty.sendReceiveBufferSize. These options are useful
for our jobs.

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午7:10写道：

> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
> information.
> > > 4. You may try to config these two options:
> taskmanager.network.retries,
> taskmanager.network.netty.client.connectTimeoutSec. More relevant options
> can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may
> need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > >> each 28G mem.
>

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

I also searched many result in internet. There are some related
exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException,
but in my case it is
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException.
It is different in 'LocalTransportException' or
'RemoteTransportException'.

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午7:10写道：
>
> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > >> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by Yingjie Cao <ke...@gmail.com>.

Maybe you can try to
increase taskmanager.network.retries,
taskmanager.network.netty.server.backlog and
taskmanager.network.netty.sendReceiveBufferSize. These options are useful
for our jobs.

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午7:10写道：

> Hi, yingjie.
> If the network is not stable, which config parameter I should adjust.
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >
> > 2: I use G1, and no full gc occurred, young gc count: 422, time:
> > 142892, so it is not bad.
> > 3: stream job.
> > 4: I will try to config taskmanager.network.retries which is default
> > 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> > is 120s。
> > 5: I checked the net fd number of the taskmanager, it is about 1000+,
> > so I think it is a reasonable value.
> >
> > 1: can not be sure.
> >
> > Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> > >
> > > Hi yidan,
> > >
> > > 1. Is the network stable?
> > > 2. Is there any GC problem?
> > > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
> information.
> > > 4. You may try to config these two options:
> taskmanager.network.retries,
> taskmanager.network.netty.client.connectTimeoutSec. More relevant options
> can be found in 'Data Transport Network Stack' section of [2].
> > > 5. If it is not the above cases, it is may related to [3], you may
> need to check the number of tcp connection per TM and node.
> > >
> > > Hope this helps.
> > >
> > > [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > > [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > > [3] https://issues.apache.org/jira/browse/FLINK-22643
> > >
> > > Best,
> > > Yingjie
> > >
> > > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> > >>
> > >> Attachment is the exception stack from flink's web-ui. Does anyone
> > >> have also met this problem?
> > >>
> > >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> > >> each 28G mem.
>

Re: Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

我仔细想了想，我的集群是内网服务器上的容器，容器之间访问应该不算经过NAT。

当然和网络相关的监控来看，的确很多机器的time-wait状态的连接不少，在5w+个左右，但也不至于导致这个问题感觉。

东东 <do...@163.com> 于2021年6月17日周四 下午2:48写道：
>
> 这俩都开启的话，就要求同一源ip的连接请求中的timstamp必须是递增的，否则(非递增)的连接请求被视为无效，数据包会被抛弃，给client端的感觉就是时不时的连接超时。
>
>
>
> 一般来说单机不会有这个问题，因为时钟应该是一个，在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致)，但不清楚你的具体架构，只能说试一试。
>
>
> 最后，可以跟运维讨论一下，除非确信不会有经过NAT过来的链接，否则这俩最好别都开。
>
>
> PS： kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了，因为太多人掉这坑里了
>
>
> 在 2021-06-17 14:07:50，"yidan zhao" <hi...@gmail.com> 写道：
> >这啥原理，这个改动我没办法直接改，需要申请。
> >
> >东东 <do...@163.com> 于2021年6月17日周四 下午1:36写道：
> >>
> >>
> >>
> >> 把其中一个改成0
> >>
> >>
> >> 在 2021-06-17 13:11:01，"yidan zhao" <hi...@gmail.com> 写道：
> >> >是的，宿主机IP。
> >> >
> >> >net.ipv4.tcp_tw_reuse = 1
> >> >net.ipv4.tcp_timestamps = 1
> >> >
> >> >东东 <do...@163.com> 于2021年6月17日周四 下午12:52写道：
> >> >>
> >> >> 10.35.215.18是宿主机IP？
> >> >>
> >> >> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> >> >> 实在不行就 tcpdump 吧
> >> >>
> >> >>
> >> >>
> >> >> 在 2021-06-17 12:41:58，"yidan zhao" <hi...@gmail.com> 写道：
> >> >> >@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
> >> >> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
> >> >> >
> >> >> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
> >> >> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
> >> >> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >> >> >
> >> >> >东东 <do...@163.com> 于2021年6月17日周四 上午11:19写道：
> >> >> >>
> >> >> >> 单机standalone，还是Docker/K8s ?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> 在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
> >> >> >> >Hi, yingjie.
> >> >> >> >If the network is not stable, which config parameter I should adjust.
> >> >> >> >
> >> >> >> >yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >> >> >> >>
> >> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> >> >> 142892, so it is not bad.
> >> >> >> >> 3: stream job.
> >> >> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> >> >> is 120s。
> >> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> >> >> so I think it is a reasonable value.
> >> >> >> >>
> >> >> >> >> 1: can not be sure.
> >> >> >> >>
> >> >> >> >> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> >> >> >> >> >
> >> >> >> >> > Hi yidan,
> >> >> >> >> >
> >> >> >> >> > 1. Is the network stable?
> >> >> >> >> > 2. Is there any GC problem?
> >> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >> >> >
> >> >> >> >> > Hope this helps.
> >> >> >> >> >
> >> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >> >> >
> >> >> >> >> > Best,
> >> >> >> >> > Yingjie
> >> >> >> >> >
> >> >> >> >> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> >> >> >> >> >>
> >> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> >> >> have also met this problem?
> >> >> >> >> >>
> >> >> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> >> >> >> each 28G mem.

Re:Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by 东东 <do...@163.com>.

这俩都开启的话，就要求同一源ip的连接请求中的timstamp必须是递增的，否则(非递增)的连接请求被视为无效，数据包会被抛弃，给client端的感觉就是时不时的连接超时。



一般来说单机不会有这个问题，因为时钟应该是一个，在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致)，但不清楚你的具体架构，只能说试一试。


最后，可以跟运维讨论一下，除非确信不会有经过NAT过来的链接，否则这俩最好别都开。


PS： kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了，因为太多人掉这坑里了


在 2021-06-17 14:07:50，"yidan zhao" <hi...@gmail.com> 写道：
>这啥原理，这个改动我没办法直接改，需要申请。
>
>东东 <do...@163.com> 于2021年6月17日周四 下午1:36写道：
>>
>>
>>
>> 把其中一个改成0
>>
>>
>> 在 2021-06-17 13:11:01，"yidan zhao" <hi...@gmail.com> 写道：
>> >是的，宿主机IP。
>> >
>> >net.ipv4.tcp_tw_reuse = 1
>> >net.ipv4.tcp_timestamps = 1
>> >
>> >东东 <do...@163.com> 于2021年6月17日周四 下午12:52写道：
>> >>
>> >> 10.35.215.18是宿主机IP？
>> >>
>> >> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
>> >> 实在不行就 tcpdump 吧
>> >>
>> >>
>> >>
>> >> 在 2021-06-17 12:41:58，"yidan zhao" <hi...@gmail.com> 写道：
>> >> >@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
>> >> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
>> >> >
>> >> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
>> >> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
>> >> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>> >> >
>> >> >东东 <do...@163.com> 于2021年6月17日周四 上午11:19写道：
>> >> >>
>> >> >> 单机standalone，还是Docker/K8s ?
>> >> >>
>> >> >>
>> >> >>
>> >> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>> >> >>
>> >> >>
>> >> >>
>> >> >> 在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
>> >> >> >Hi, yingjie.
>> >> >> >If the network is not stable, which config parameter I should adjust.
>> >> >> >
>> >> >> >yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>> >> >> >>
>> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> >> >> 142892, so it is not bad.
>> >> >> >> 3: stream job.
>> >> >> >> 4: I will try to config taskmanager.network.retries which is default
>> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> >> >> is 120s。
>> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> >> >> so I think it is a reasonable value.
>> >> >> >>
>> >> >> >> 1: can not be sure.
>> >> >> >>
>> >> >> >> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>> >> >> >> >
>> >> >> >> > Hi yidan,
>> >> >> >> >
>> >> >> >> > 1. Is the network stable?
>> >> >> >> > 2. Is there any GC problem?
>> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >> >> >
>> >> >> >> > Hope this helps.
>> >> >> >> >
>> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >> >> >
>> >> >> >> > Best,
>> >> >> >> > Yingjie
>> >> >> >> >
>> >> >> >> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>> >> >> >> >>
>> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> >> >> have also met this problem?
>> >> >> >> >>
>> >> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> >> >> >> each 28G mem.

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

这啥原理，这个改动我没办法直接改，需要申请。

东东 <do...@163.com> 于2021年6月17日周四 下午1:36写道：
>
>
>
> 把其中一个改成0
>
>
> 在 2021-06-17 13:11:01，"yidan zhao" <hi...@gmail.com> 写道：
> >是的，宿主机IP。
> >
> >net.ipv4.tcp_tw_reuse = 1
> >net.ipv4.tcp_timestamps = 1
> >
> >东东 <do...@163.com> 于2021年6月17日周四 下午12:52写道：
> >>
> >> 10.35.215.18是宿主机IP？
> >>
> >> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> >> 实在不行就 tcpdump 吧
> >>
> >>
> >>
> >> 在 2021-06-17 12:41:58，"yidan zhao" <hi...@gmail.com> 写道：
> >> >@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
> >> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
> >> >
> >> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
> >> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
> >> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >> >
> >> >东东 <do...@163.com> 于2021年6月17日周四 上午11:19写道：
> >> >>
> >> >> 单机standalone，还是Docker/K8s ?
> >> >>
> >> >>
> >> >>
> >> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
> >> >>
> >> >>
> >> >>
> >> >> 在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
> >> >> >Hi, yingjie.
> >> >> >If the network is not stable, which config parameter I should adjust.
> >> >> >
> >> >> >yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >> >> >>
> >> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> >> 142892, so it is not bad.
> >> >> >> 3: stream job.
> >> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> >> is 120s。
> >> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> >> so I think it is a reasonable value.
> >> >> >>
> >> >> >> 1: can not be sure.
> >> >> >>
> >> >> >> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> >> >> >> >
> >> >> >> > Hi yidan,
> >> >> >> >
> >> >> >> > 1. Is the network stable?
> >> >> >> > 2. Is there any GC problem?
> >> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >> >
> >> >> >> > Hope this helps.
> >> >> >> >
> >> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >> >
> >> >> >> > Best,
> >> >> >> > Yingjie
> >> >> >> >
> >> >> >> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> >> >> >> >>
> >> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> >> have also met this problem?
> >> >> >> >>
> >> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> >> >> each 28G mem.

Re:Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by 东东 <do...@163.com>.


把其中一个改成0


在 2021-06-17 13:11:01，"yidan zhao" <hi...@gmail.com> 写道：
>是的，宿主机IP。
>
>net.ipv4.tcp_tw_reuse = 1
>net.ipv4.tcp_timestamps = 1
>
>东东 <do...@163.com> 于2021年6月17日周四 下午12:52写道：
>>
>> 10.35.215.18是宿主机IP？
>>
>> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
>> 实在不行就 tcpdump 吧
>>
>>
>>
>> 在 2021-06-17 12:41:58，"yidan zhao" <hi...@gmail.com> 写道：
>> >@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
>> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
>> >
>> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
>> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
>> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>> >
>> >东东 <do...@163.com> 于2021年6月17日周四 上午11:19写道：
>> >>
>> >> 单机standalone，还是Docker/K8s ?
>> >>
>> >>
>> >>
>> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>> >>
>> >>
>> >>
>> >> 在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
>> >> >Hi, yingjie.
>> >> >If the network is not stable, which config parameter I should adjust.
>> >> >
>> >> >yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>> >> >>
>> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> >> 142892, so it is not bad.
>> >> >> 3: stream job.
>> >> >> 4: I will try to config taskmanager.network.retries which is default
>> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> >> is 120s。
>> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> >> so I think it is a reasonable value.
>> >> >>
>> >> >> 1: can not be sure.
>> >> >>
>> >> >> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>> >> >> >
>> >> >> > Hi yidan,
>> >> >> >
>> >> >> > 1. Is the network stable?
>> >> >> > 2. Is there any GC problem?
>> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >> >
>> >> >> > Hope this helps.
>> >> >> >
>> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >> >
>> >> >> > Best,
>> >> >> > Yingjie
>> >> >> >
>> >> >> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>> >> >> >>
>> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> >> have also met this problem?
>> >> >> >>
>> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> >> >> each 28G mem.

Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

是的，宿主机IP。

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1

东东 <do...@163.com> 于2021年6月17日周四 下午12:52写道：
>
> 10.35.215.18是宿主机IP？
>
> 看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> 实在不行就 tcpdump 吧
>
>
>
> 在 2021-06-17 12:41:58，"yidan zhao" <hi...@gmail.com> 写道：
> >@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
> >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
> >
> >此外，有个点我不是很清楚，网上这个报错很少，类似的都是
> >RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
> >LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
> >
> >东东 <do...@163.com> 于2021年6月17日周四 上午11:19写道：
> >>
> >> 单机standalone，还是Docker/K8s ?
> >>
> >>
> >>
> >> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
> >>
> >>
> >>
> >> 在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
> >> >Hi, yingjie.
> >> >If the network is not stable, which config parameter I should adjust.
> >> >
> >> >yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >> >>
> >> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> >> 142892, so it is not bad.
> >> >> 3: stream job.
> >> >> 4: I will try to config taskmanager.network.retries which is default
> >> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> >> is 120s。
> >> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> >> so I think it is a reasonable value.
> >> >>
> >> >> 1: can not be sure.
> >> >>
> >> >> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> >> >> >
> >> >> > Hi yidan,
> >> >> >
> >> >> > 1. Is the network stable?
> >> >> > 2. Is there any GC problem?
> >> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >> >
> >> >> > Hope this helps.
> >> >> >
> >> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >> >
> >> >> > Best,
> >> >> > Yingjie
> >> >> >
> >> >> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> >> >> >>
> >> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> >> have also met this problem?
> >> >> >>
> >> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> >> each 28G mem.

Re:Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by 东东 <do...@163.com>.

10.35.215.18是宿主机IP？

看一下  tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
实在不行就 tcpdump 吧



在 2021-06-17 12:41:58，"yidan zhao" <hi...@gmail.com> 写道：
>@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
>我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。
>
>此外，有个点我不是很清楚，网上这个报错很少，类似的都是
>RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
>LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
>
>东东 <do...@163.com> 于2021年6月17日周四 上午11:19写道：
>>
>> 单机standalone，还是Docker/K8s ?
>>
>>
>>
>> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>>
>>
>>
>> 在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
>> >Hi, yingjie.
>> >If the network is not stable, which config parameter I should adjust.
>> >
>> >yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>> >>
>> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> >> 142892, so it is not bad.
>> >> 3: stream job.
>> >> 4: I will try to config taskmanager.network.retries which is default
>> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> >> is 120s。
>> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> >> so I think it is a reasonable value.
>> >>
>> >> 1: can not be sure.
>> >>
>> >> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>> >> >
>> >> > Hi yidan,
>> >> >
>> >> > 1. Is the network stable?
>> >> > 2. Is there any GC problem?
>> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >> >
>> >> > Hope this helps.
>> >> >
>> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >> >
>> >> > Best,
>> >> > Yingjie
>> >> >
>> >> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>> >> >>
>> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> >> have also met this problem?
>> >> >>
>> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> >> each 28G mem.

Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

@东东 standalone集群。 随机时间，一会一个的，没有固定规律。  和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。
我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。

此外，有个点我不是很清楚，网上这个报错很少，类似的都是
RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是
LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。

东东 <do...@163.com> 于2021年6月17日周四 上午11:19写道：
>
> 单机standalone，还是Docker/K8s ?
>
>
>
> 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？
>
>
>
> 在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
> >Hi, yingjie.
> >If the network is not stable, which config parameter I should adjust.
> >
> >yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
> >>
> >> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> >> 142892, so it is not bad.
> >> 3: stream job.
> >> 4: I will try to config taskmanager.network.retries which is default
> >> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> >> is 120s。
> >> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> >> so I think it is a reasonable value.
> >>
> >> 1: can not be sure.
> >>
> >> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> >> >
> >> > Hi yidan,
> >> >
> >> > 1. Is the network stable?
> >> > 2. Is there any GC problem?
> >> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> >> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> >> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >> >
> >> > Hope this helps.
> >> >
> >> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> >> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> >> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >> >
> >> > Best,
> >> > Yingjie
> >> >
> >> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> >> >>
> >> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> >> have also met this problem?
> >> >>
> >> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> >> each 28G mem.

Re:Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by 东东 <do...@163.com>.

单机standalone，还是Docker/K8s ?



这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？



在 2021-06-16 19:10:24，"yidan zhao" <hi...@gmail.com> 写道：
>Hi, yingjie.
>If the network is not stable, which config parameter I should adjust.
>
>yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>>
>> 2: I use G1, and no full gc occurred, young gc count: 422, time:
>> 142892, so it is not bad.
>> 3: stream job.
>> 4: I will try to config taskmanager.network.retries which is default
>> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
>> is 120s。
>> 5: I checked the net fd number of the taskmanager, it is about 1000+,
>> so I think it is a reasonable value.
>>
>> 1: can not be sure.
>>
>> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>> >
>> > Hi yidan,
>> >
>> > 1. Is the network stable?
>> > 2. Is there any GC problem?
>> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
>> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
>> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>> >
>> > Hope this helps.
>> >
>> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
>> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
>> > [3] https://issues.apache.org/jira/browse/FLINK-22643
>> >
>> > Best,
>> > Yingjie
>> >
>> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>> >>
>> >> Attachment is the exception stack from flink's web-ui. Does anyone
>> >> have also met this problem?
>> >>
>> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> >> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

Hi, yingjie.
If the network is not stable, which config parameter I should adjust.

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is default
> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> is 120s。
> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> so I think it is a reasonable value.
>
> 1: can not be sure.
>
> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> >
> > Hi yidan,
> >
> > 1. Is the network stable?
> > 2. Is there any GC problem?
> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >
> > Hope this helps.
> >
> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >
> > Best,
> > Yingjie
> >
> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> >>
> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> have also met this problem?
> >>
> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

Hi, yingjie.
If the network is not stable, which config parameter I should adjust.

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午6:56写道：
>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is default
> 0, and taskmanager.network.netty.client.connectTimeoutSec 's default
> is 120s。
> 5: I checked the net fd number of the taskmanager, it is about 1000+,
> so I think it is a reasonable value.
>
> 1: can not be sure.
>
> Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
> >
> > Hi yidan,
> >
> > 1. Is the network stable?
> > 2. Is there any GC problem?
> > 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> > 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> > 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
> >
> > Hope this helps.
> >
> > [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> > [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> > [3] https://issues.apache.org/jira/browse/FLINK-22643
> >
> > Best,
> > Yingjie
> >
> > yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
> >>
> >> Attachment is the exception stack from flink's web-ui. Does anyone
> >> have also met this problem?
> >>
> >> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> >> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the taskmanager, it is about 1000+,
so I think it is a reasonable value.

1: can not be sure.

Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>
> Hi yidan,
>
> 1. Is the network stable?
> 2. Is there any GC problem?
> 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>
> Hope this helps.
>
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> [3] https://issues.apache.org/jira/browse/FLINK-22643
>
> Best,
> Yingjie
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>>
>> Attachment is the exception stack from flink's web-ui. Does anyone
>> have also met this problem?
>>
>> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by yidan zhao <hi...@gmail.com>.

2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the taskmanager, it is about 1000+,
so I think it is a reasonable value.

1: can not be sure.

Yingjie Cao <ke...@gmail.com> 于2021年6月16日周三 下午4:34写道：
>
> Hi yidan,
>
> 1. Is the network stable?
> 2. Is there any GC problem?
> 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information.
> 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options can be found in 'Data Transport Network Stack' section of [2].
> 5. If it is not the above cases, it is may related to [3], you may need to check the number of tcp connection per TM and node.
>
> Hope this helps.
>
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
> [3] https://issues.apache.org/jira/browse/FLINK-22643
>
> Best,
> Yingjie
>
> yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：
>>
>> Attachment is the exception stack from flink's web-ui. Does anyone
>> have also met this problem?
>>
>> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
>> each 28G mem.

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by Yingjie Cao <ke...@gmail.com>.

Hi yidan,

1. Is the network stable?
2. Is there any GC problem?
3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
information.
4. You may try to config these two options: taskmanager.network.retries,
taskmanager.network.netty.client.connectTimeoutSec. More relevant options
can be found in 'Data Transport Network Stack' section of [2].
5. If it is not the above cases, it is may related to [3], you may need to
check the number of tcp connection per TM and node.

Hope this helps.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
[3] https://issues.apache.org/jira/browse/FLINK-22643

Best,
Yingjie

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> each 28G mem.
>

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Posted by Yingjie Cao <ke...@gmail.com>.

Hi yidan,

1. Is the network stable?
2. Is there any GC problem?
3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
information.
4. You may try to config these two options: taskmanager.network.retries,
taskmanager.network.netty.client.connectTimeoutSec. More relevant options
can be found in 'Data Transport Network Stack' section of [2].
5. If it is not the above cases, it is may related to [3], you may need to
check the number of tcp connection per TM and node.

Hope this helps.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/batch/blocking_shuffle/
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/
[3] https://issues.apache.org/jira/browse/FLINK-22643

Best,
Yingjie

yidan zhao <hi...@gmail.com> 于2021年6月16日周三 下午3:36写道：

> Attachment is the exception stack from flink's web-ui. Does anyone
> have also met this problem?
>
> Flink1.12 - Flink1.13.1.  Standalone Cluster, include 30 containers,
> each 28G mem.
>