You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by yidan zhao <hi...@gmail.com> on 2022/09/27 08:21:42 UTC

PartitionNotFoundException

打开了TM的debug日志后发现很多这种日志:
Responding with error: class
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException

目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
Sending the partition request to '/10.216.187.171:8709 (#0)' failed.
at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
ChannelPromise)(Unknown Source)

不清楚和debug的那个日志是否有关呢?

然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。

Re: PartitionNotFoundException

Posted by Shammon FY <zj...@gmail.com>.
一般如果是发生failover或者重启时短时间出现这个信息是没关系的,Flink会自己恢复;如果一直出现并且无法恢复,可以结合WebUI查看一下具体是哪些task没有部署成功

On Thu, Sep 29, 2022 at 10:23 AM yidan zhao <hi...@gmail.com> wrote:

> 嗯,谢谢建议,等再出现问题我试试,现在重启后还好,目前感觉是长时间运行后的集群才会出现。
>
> Lijie Wang <wa...@gmail.com> 于2022年9月29日周四 10:17写道:
> >
> > Hi,
> >
> > 可以尝试增大一下 taskmanager.network.request-backoff.max 的值。默认值是 10000,也就是 10 s。
> > 上下游可能是并发部署的,所以是有可能下游请求 partition 时,上游还没部署完成,增大
> taskmanager.network.request-backoff.max 可以增加下游的等待时间和重试次数,减小出现
> PartitionNotFoundException 的概率。
> >
> > Best,
> > Lijie
> >
> > yidan zhao <hi...@gmail.com> 于2022年9月28日周三 17:35写道:
> >>
> >> 按照flink的设计,存在上游还没部署成功,下游就开始请求 partition 的情况吗? 此外,上游没有部署成功一般会有相关日志不?
> >>
> >> 我目前重启了集群后OK了,在等段时间,看看还会不会出现。
> >>
> >> Shammon FY <zj...@gmail.com> 于2022年9月28日周三 15:45写道:
> >> >
> >> > Hi
> >> >
> >> > 计算任务输出PartitionNotFoundException,原因是它向上游TaskManager发送partition
> request请求,上游TaskManager的netty server接收到partition request后发现它请求的上游计算任务没有部署成功。
> >> >
> 所以从这个异常错误来看netty连接是通的,你可能需要根据输出PartitionNotFoundException信息的计算任务,查一下它的上游计算任务为什么没有部署成功
> >> >
> >> > On Tue, Sep 27, 2022 at 10:20 PM yidan zhao <hi...@gmail.com>
> wrote:
> >> >>
> >> >> 补充:flink1.15.2版本,standalone集群,基于zk的ha。
> >> >> 环境是公司自研容器环境。3个容器启JM+HistoryServer。剩下几百个容器都是TM。每个TM提供1个slot。
> >> >>
> >> >> yidan zhao <hi...@gmail.com> 于2022年9月27日周二 22:07写道:
> >> >> >
> >> >> > 此外,今天还做了个尝试,貌似和长时间没重启TM有关?重启后频率低很多会。
> >> >> > 我预留的TM很多,比如500个TM,每个TM就提供1个slot,任务可能只用100个TM。
> >> >> > 会不会剩下400的TM的连接,时间厂了就会出现某种问题?
> >> >> >
> >> >> > yidan zhao <hi...@gmail.com> 于2022年9月27日周二 16:21写道:
> >> >> > >
> >> >> > > 打开了TM的debug日志后发现很多这种日志:
> >> >> > > Responding with error: class
> >> >> > > org.apache.flink.runtime.io
> .network.partition.PartitionNotFoundException
> >> >> > >
> >> >> > > 目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
> >> >> > > org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException:
> >> >> > > Sending the partition request to '/10.216.187.171:8709 (#0)'
> failed.
> >> >> > > at org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
> >> >> > > at org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> >> >> > > at java.lang.Thread.run(Thread.java:748)
> >> >> > > Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
> >> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> >> >> > > ChannelPromise)(Unknown Source)
> >> >> > >
> >> >> > > 不清楚和debug的那个日志是否有关呢?
> >> >> > >
> >> >> > > 然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。
>

Re: PartitionNotFoundException

Posted by yidan zhao <hi...@gmail.com>.
嗯,谢谢建议,等再出现问题我试试,现在重启后还好,目前感觉是长时间运行后的集群才会出现。

Lijie Wang <wa...@gmail.com> 于2022年9月29日周四 10:17写道:
>
> Hi,
>
> 可以尝试增大一下 taskmanager.network.request-backoff.max 的值。默认值是 10000,也就是 10 s。
> 上下游可能是并发部署的,所以是有可能下游请求 partition 时,上游还没部署完成,增大 taskmanager.network.request-backoff.max 可以增加下游的等待时间和重试次数,减小出现 PartitionNotFoundException 的概率。
>
> Best,
> Lijie
>
> yidan zhao <hi...@gmail.com> 于2022年9月28日周三 17:35写道:
>>
>> 按照flink的设计,存在上游还没部署成功,下游就开始请求 partition 的情况吗? 此外,上游没有部署成功一般会有相关日志不?
>>
>> 我目前重启了集群后OK了,在等段时间,看看还会不会出现。
>>
>> Shammon FY <zj...@gmail.com> 于2022年9月28日周三 15:45写道:
>> >
>> > Hi
>> >
>> > 计算任务输出PartitionNotFoundException,原因是它向上游TaskManager发送partition request请求,上游TaskManager的netty server接收到partition request后发现它请求的上游计算任务没有部署成功。
>> > 所以从这个异常错误来看netty连接是通的,你可能需要根据输出PartitionNotFoundException信息的计算任务,查一下它的上游计算任务为什么没有部署成功
>> >
>> > On Tue, Sep 27, 2022 at 10:20 PM yidan zhao <hi...@gmail.com> wrote:
>> >>
>> >> 补充:flink1.15.2版本,standalone集群,基于zk的ha。
>> >> 环境是公司自研容器环境。3个容器启JM+HistoryServer。剩下几百个容器都是TM。每个TM提供1个slot。
>> >>
>> >> yidan zhao <hi...@gmail.com> 于2022年9月27日周二 22:07写道:
>> >> >
>> >> > 此外,今天还做了个尝试,貌似和长时间没重启TM有关?重启后频率低很多会。
>> >> > 我预留的TM很多,比如500个TM,每个TM就提供1个slot,任务可能只用100个TM。
>> >> > 会不会剩下400的TM的连接,时间厂了就会出现某种问题?
>> >> >
>> >> > yidan zhao <hi...@gmail.com> 于2022年9月27日周二 16:21写道:
>> >> > >
>> >> > > 打开了TM的debug日志后发现很多这种日志:
>> >> > > Responding with error: class
>> >> > > org.apache.flink.runtime.io.network.partition.PartitionNotFoundException
>> >> > >
>> >> > > 目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
>> >> > > org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
>> >> > > Sending the partition request to '/10.216.187.171:8709 (#0)' failed.
>> >> > > at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
>> >> > > at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
>> >> > > at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> >> > > at java.lang.Thread.run(Thread.java:748)
>> >> > > Caused by: org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
>> >> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
>> >> > > ChannelPromise)(Unknown Source)
>> >> > >
>> >> > > 不清楚和debug的那个日志是否有关呢?
>> >> > >
>> >> > > 然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。

Re: PartitionNotFoundException

Posted by Lijie Wang <wa...@gmail.com>.
Hi,

可以尝试增大一下 taskmanager.network.request-backoff.max 的值。默认值是 10000,也就是 10 s。
上下游可能是并发部署的,所以是有可能下游请求 partition 时,上游还没部署完成,增大
taskmanager.network.request-backoff.max 可以增加下游的等待时间和重试次数,减小出现
PartitionNotFoundException 的概率。

Best,
Lijie

yidan zhao <hi...@gmail.com> 于2022年9月28日周三 17:35写道:

> 按照flink的设计,存在上游还没部署成功,下游就开始请求 partition 的情况吗? 此外,上游没有部署成功一般会有相关日志不?
>
> 我目前重启了集群后OK了,在等段时间,看看还会不会出现。
>
> Shammon FY <zj...@gmail.com> 于2022年9月28日周三 15:45写道:
> >
> > Hi
> >
> > 计算任务输出PartitionNotFoundException,原因是它向上游TaskManager发送partition
> request请求,上游TaskManager的netty server接收到partition request后发现它请求的上游计算任务没有部署成功。
> >
> 所以从这个异常错误来看netty连接是通的,你可能需要根据输出PartitionNotFoundException信息的计算任务,查一下它的上游计算任务为什么没有部署成功
> >
> > On Tue, Sep 27, 2022 at 10:20 PM yidan zhao <hi...@gmail.com> wrote:
> >>
> >> 补充:flink1.15.2版本,standalone集群,基于zk的ha。
> >> 环境是公司自研容器环境。3个容器启JM+HistoryServer。剩下几百个容器都是TM。每个TM提供1个slot。
> >>
> >> yidan zhao <hi...@gmail.com> 于2022年9月27日周二 22:07写道:
> >> >
> >> > 此外,今天还做了个尝试,貌似和长时间没重启TM有关?重启后频率低很多会。
> >> > 我预留的TM很多,比如500个TM,每个TM就提供1个slot,任务可能只用100个TM。
> >> > 会不会剩下400的TM的连接,时间厂了就会出现某种问题?
> >> >
> >> > yidan zhao <hi...@gmail.com> 于2022年9月27日周二 16:21写道:
> >> > >
> >> > > 打开了TM的debug日志后发现很多这种日志:
> >> > > Responding with error: class
> >> > > org.apache.flink.runtime.io
> .network.partition.PartitionNotFoundException
> >> > >
> >> > > 目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
> >> > > org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException:
> >> > > Sending the partition request to '/10.216.187.171:8709 (#0)'
> failed.
> >> > > at org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
> >> > > at org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> >> > > at java.lang.Thread.run(Thread.java:748)
> >> > > Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
> >> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> >> > > ChannelPromise)(Unknown Source)
> >> > >
> >> > > 不清楚和debug的那个日志是否有关呢?
> >> > >
> >> > > 然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。
>

Re: PartitionNotFoundException

Posted by yidan zhao <hi...@gmail.com>.
按照flink的设计,存在上游还没部署成功,下游就开始请求 partition 的情况吗? 此外,上游没有部署成功一般会有相关日志不?

我目前重启了集群后OK了,在等段时间,看看还会不会出现。

Shammon FY <zj...@gmail.com> 于2022年9月28日周三 15:45写道:
>
> Hi
>
> 计算任务输出PartitionNotFoundException,原因是它向上游TaskManager发送partition request请求,上游TaskManager的netty server接收到partition request后发现它请求的上游计算任务没有部署成功。
> 所以从这个异常错误来看netty连接是通的,你可能需要根据输出PartitionNotFoundException信息的计算任务,查一下它的上游计算任务为什么没有部署成功
>
> On Tue, Sep 27, 2022 at 10:20 PM yidan zhao <hi...@gmail.com> wrote:
>>
>> 补充:flink1.15.2版本,standalone集群,基于zk的ha。
>> 环境是公司自研容器环境。3个容器启JM+HistoryServer。剩下几百个容器都是TM。每个TM提供1个slot。
>>
>> yidan zhao <hi...@gmail.com> 于2022年9月27日周二 22:07写道:
>> >
>> > 此外,今天还做了个尝试,貌似和长时间没重启TM有关?重启后频率低很多会。
>> > 我预留的TM很多,比如500个TM,每个TM就提供1个slot,任务可能只用100个TM。
>> > 会不会剩下400的TM的连接,时间厂了就会出现某种问题?
>> >
>> > yidan zhao <hi...@gmail.com> 于2022年9月27日周二 16:21写道:
>> > >
>> > > 打开了TM的debug日志后发现很多这种日志:
>> > > Responding with error: class
>> > > org.apache.flink.runtime.io.network.partition.PartitionNotFoundException
>> > >
>> > > 目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
>> > > org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
>> > > Sending the partition request to '/10.216.187.171:8709 (#0)' failed.
>> > > at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
>> > > at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
>> > > at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> > > at java.lang.Thread.run(Thread.java:748)
>> > > Caused by: org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
>> > > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
>> > > ChannelPromise)(Unknown Source)
>> > >
>> > > 不清楚和debug的那个日志是否有关呢?
>> > >
>> > > 然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。

Re: PartitionNotFoundException

Posted by Shammon FY <zj...@gmail.com>.
Hi

计算任务输出PartitionNotFoundException,原因是它向上游TaskManager发送partition
request请求,上游TaskManager的netty server接收到partition request后发现它请求的上游计算任务没有部署成功。
所以从这个异常错误来看netty连接是通的,你可能需要根据输出PartitionNotFoundException信息的计算任务,查一下它的上游计算任务为什么没有部署成功

On Tue, Sep 27, 2022 at 10:20 PM yidan zhao <hi...@gmail.com> wrote:

> 补充:flink1.15.2版本,standalone集群,基于zk的ha。
> 环境是公司自研容器环境。3个容器启JM+HistoryServer。剩下几百个容器都是TM。每个TM提供1个slot。
>
> yidan zhao <hi...@gmail.com> 于2022年9月27日周二 22:07写道:
> >
> > 此外,今天还做了个尝试,貌似和长时间没重启TM有关?重启后频率低很多会。
> > 我预留的TM很多,比如500个TM,每个TM就提供1个slot,任务可能只用100个TM。
> > 会不会剩下400的TM的连接,时间厂了就会出现某种问题?
> >
> > yidan zhao <hi...@gmail.com> 于2022年9月27日周二 16:21写道:
> > >
> > > 打开了TM的debug日志后发现很多这种日志:
> > > Responding with error: class
> > > org.apache.flink.runtime.io
> .network.partition.PartitionNotFoundException
> > >
> > > 目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
> > > org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException:
> > > Sending the partition request to '/10.216.187.171:8709 (#0)' failed.
> > > at org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
> > > at org.apache.flink.runtime.io
> .network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> > > at
> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> > > at java.lang.Thread.run(Thread.java:748)
> > > Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
> > > at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> > > ChannelPromise)(Unknown Source)
> > >
> > > 不清楚和debug的那个日志是否有关呢?
> > >
> > > 然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。
>

Re: PartitionNotFoundException

Posted by yidan zhao <hi...@gmail.com>.
补充:flink1.15.2版本,standalone集群,基于zk的ha。
环境是公司自研容器环境。3个容器启JM+HistoryServer。剩下几百个容器都是TM。每个TM提供1个slot。

yidan zhao <hi...@gmail.com> 于2022年9月27日周二 22:07写道:
>
> 此外,今天还做了个尝试,貌似和长时间没重启TM有关?重启后频率低很多会。
> 我预留的TM很多,比如500个TM,每个TM就提供1个slot,任务可能只用100个TM。
> 会不会剩下400的TM的连接,时间厂了就会出现某种问题?
>
> yidan zhao <hi...@gmail.com> 于2022年9月27日周二 16:21写道:
> >
> > 打开了TM的debug日志后发现很多这种日志:
> > Responding with error: class
> > org.apache.flink.runtime.io.network.partition.PartitionNotFoundException
> >
> > 目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
> > org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> > Sending the partition request to '/10.216.187.171:8709 (#0)' failed.
> > at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
> > at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
> > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
> > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
> > at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> > at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> > at java.lang.Thread.run(Thread.java:748)
> > Caused by: org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
> > at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> > ChannelPromise)(Unknown Source)
> >
> > 不清楚和debug的那个日志是否有关呢?
> >
> > 然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。

Re: PartitionNotFoundException

Posted by yidan zhao <hi...@gmail.com>.
此外,今天还做了个尝试,貌似和长时间没重启TM有关?重启后频率低很多会。
我预留的TM很多,比如500个TM,每个TM就提供1个slot,任务可能只用100个TM。
会不会剩下400的TM的连接,时间厂了就会出现某种问题?

yidan zhao <hi...@gmail.com> 于2022年9月27日周二 16:21写道:
>
> 打开了TM的debug日志后发现很多这种日志:
> Responding with error: class
> org.apache.flink.runtime.io.network.partition.PartitionNotFoundException
>
> 目前问题的直观表现是:提交任务后,一直报 LocalTransportException:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> Sending the partition request to '/10.216.187.171:8709 (#0)' failed.
> at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)
> at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:133)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
> at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1017)
> at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:878)
> at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
> at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
> at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException
> at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object,
> ChannelPromise)(Unknown Source)
>
> 不清楚和debug的那个日志是否有关呢?
>
> 然后都是什么原因呢这个问题,之前一直怀疑是网络原因,一直也不知道啥原因。今天开了debug才发现有这么个debug报错。