You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Zhijiang(wangzhijiang999)" <wa...@aliyun.com> on 2017/04/04 16:03:31 UTC

回复：PartitionNotFoundException on deploying streaming job

Hi Kamil, 
     When the producer receives the PartitionRequest from downstream task, first it will check whether the requested partition is already registered. If not, it will reponse PartitionNotFoundException.And the upstream task is submitted and begins to run, it will registered all its partitions into ResultPartitionManager. So your case is that the partition request is arrived before the partition registration.Maybe the upstream task is submitted delay by JobManager or some logics delay before register task in NetworkEnvironment. You can debug the specific status in upstream when response the PartitionNotFound to track the reason. Wish your further findings!
Cheers,Zhijiang
------------------------------------------------------------------发件人：Kamil Dziublinski <ka...@gmail.com>发送时间：2017年4月4日(星期二) 17:20收件人：user <us...@flink.apache.org>主　题：PartitionNotFoundException on deploying streaming job
Hi guys,
When I run my streaming job I almost always have initially PartitionNotFoundException. Job fails, after that restarts and it runs ok.I wonder what is causing that and if I can adjust some parameters to not have this initial failure.
I have flink session on yarn with 55 task managers. 4 cores and 4gb per TM.This setup is using 77% of my yarn cluster.
Any ideas?
Thanks,Kamil.

回复：PartitionNotFoundException on deploying streaming job

Posted by "Zhijiang(wangzhijiang999)" <wa...@aliyun.com>.

Yes, it should be an improvement to add some timeout/wait mechanism to avoid such exception as you mentioned.Currently the LAZY_FROM_SOURCES schedule mode can avoid this issue, but for streaming job it is not suggested to use this schedule mode.Maybe there are other ways to work around after you find the reason for your case. Wish your sharing!
cheers,Zhijiang------------------------------------------------------------------发件人：Kamil Dziublinski <ka...@gmail.com>发送时间：2017年4月5日(星期三) 16:07收件人：Zhijiang(wangzhijiang999) <wa...@aliyun.com>抄　送：user <us...@flink.apache.org>主　题：Re: PartitionNotFoundException on deploying streaming job
Ok thanks I will try to debug it.But my initial thought was that it should be possible to increase some timeout/wait value to not have it. If it only occurs during initial start and after restarting works fine.Any idea of such property in flink? 
On Tue, Apr 4, 2017 at 6:03 PM, Zhijiang(wangzhijiang999) <wa...@aliyun.com> wrote:
Hi Kamil, 
     When the producer receives the PartitionRequest from downstream task, first it will check whether the requested partition is already registered. If not, it will reponse PartitionNotFoundException.And the upstream task is submitted and begins to run, it will registered all its partitions into ResultPartitionManager. So your case is that the partition request is arrived before the partition registration.Maybe the upstream task is submitted delay by JobManager or some logics delay before register task in NetworkEnvironment. You can debug the specific status in upstream when response the PartitionNotFound to track the reason. Wish your further findings!
Cheers,Zhijiang
------------------------------------------------------------------发件人：Kamil Dziublinski <ka...@gmail.com>发送时间：2017年4月4日(星期二) 17:20收件人：user <us...@flink.apache.org>主　题：PartitionNotFoundException on deploying streaming job
Hi guys,
When I run my streaming job I almost always have initially PartitionNotFoundException. Job fails, after that restarts and it runs ok.I wonder what is causing that and if I can adjust some parameters to not have this initial failure.
I have flink session on yarn with 55 task managers. 4 cores and 4gb per TM.This setup is using 77% of my yarn cluster.
Any ideas?
Thanks,Kamil.

Re: PartitionNotFoundException on deploying streaming job

Posted by Kamil Dziublinski <ka...@gmail.com>.

Ok thanks I will try to debug it.
But my initial thought was that it should be possible to increase some
timeout/wait value to not have it. If it only occurs during initial start
and after restarting works fine.
Any idea of such property in flink?

On Tue, Apr 4, 2017 at 6:03 PM, Zhijiang(wangzhijiang999) <
wangzhijiang999@aliyun.com> wrote:

> Hi Kamil,
>
>      When the producer receives the PartitionRequest from downstream task,
> first it will check whether the requested partition is already registered.
> If not, it will reponse PartitionNotFoundException.
> And the upstream task is submitted and begins to run, it will registered
> all its partitions into ResultPartitionManager. So your case is that the
> partition request is arrived before the partition registration.
> Maybe the upstream task is submitted delay by JobManager or some logics
> delay before register task in NetworkEnvironment. You can debug the
> specific status in upstream when response the PartitionNotFound to track
> the reason. Wish your further findings!
>
> Cheers,
> Zhijiang
>
> ------------------------------------------------------------------
> 发件人：Kamil Dziublinski <ka...@gmail.com>
> 发送时间：2017年4月4日(星期二) 17:20
> 收件人：user <us...@flink.apache.org>
> 主 题：PartitionNotFoundException on deploying streaming job
>
> Hi guys,
>
> When I run my streaming job I almost always have initially
> PartitionNotFoundException. Job fails, after that restarts and it runs ok.
> I wonder what is causing that and if I can adjust some parameters to not
> have this initial failure.
>
> I have flink session on yarn with 55 task managers. 4 cores and 4gb per TM.
> This setup is using 77% of my yarn cluster.
>
> Any ideas?
>
> Thanks,
> Kamil.
>
>
>