You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Benchao Li <li...@gmail.com> on 2020/02/12 13:16:56 UTC

Re: Encountered error while consuming partitions

hi 建刚,

We encountered a similar issue internally. I pulled in 光辉, he has solved
this issue internally.


刘建刚 <li...@gmail.com> 于2020年2月12日周三 下午8:53写道：

>       I am using flink 1.6.2 and had a job consisted of a map and a
> window.  Everything was normal for a long time. After adjusting the
> network,  all jobs restarted except this job. This job was still running
> but some tasks of this job’s window could not receive any data. I checked
> the log and found an exception as following:
>
>       The related code was as below:
>       I suspect that some TCP connections were broken-down and upstream’s
> task release the resource. So downstream’s task could not receive any data.
> Is my idea right and why the job not failed?  Can someone help me? Thank
> you very much.
>
>
>
>

-- 

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: libenchao@gmail.com; libenchao@pku.edu.cn

Re: Encountered error while consuming partitions

Posted by Piotr Nowojski <pi...@ververica.com>.

Hi 刘建刚,

Could you explain how did you fix the problem for your case? Did you modify Flink code to use `IdleStateHandler`?

Piotrek

> On 13 Feb 2020, at 11:10, 刘建刚 <li...@gmail.com> wrote:
> 
> Thanks for all the help. Following the advice, I have fixed the problem.
> 
>> 2020年2月13日 下午6:05，Zhijiang <wangzhijiang999@aliyun.com <ma...@aliyun.com>> 写道：
>> 
>> Thanks for reporting this issue and I also agree with the below analysis. Actually we encountered the same issue several years ago and solved it also via the netty idle handler.
>> 
>> Let's trace it via the ticket [1] as the following step.
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-16030 <https://issues.apache.org/jira/browse/FLINK-16030>
>> 
>> Best,
>> Zhijiang
>> 
>> ------------------------------------------------------------------
>> From:张光辉 <begginghard@gmail.com <ma...@gmail.com>>
>> Send Time:2020 Feb. 12 (Wed.) 22:19
>> To:Benchao Li <libenchao@gmail.com <ma...@gmail.com>>
>> Cc:刘建刚 <liujiangangpeng@gmail.com <ma...@gmail.com>>; user <user@flink.apache.org <ma...@flink.apache.org>>
>> Subject:Re: Encountered error while consuming partitions
>> 
>> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss). 
>> 
>> The problem is that the long tcp connection between netty client and server is lost, then the server failed to send message to the client, and shut down the channel. The Netty Client  does not know that the connection has been disconnected, so it has been waiting. 
>> 
>> To detect long tcp connection alive on netty client and server, we should have two ways: tcp keepalives and heartbeat.
>> Tcp keepalives is 2 hours by default. When the error occurs, if you continue to wait for 2 hours, the netty client will trigger exception and enter failover recovery.
>> If you want to detect long tcp connection quickly, netty provides IdleStateHandler which it use ping-pang mechanism. If netty client send continuously n ping message and receive no one pang message, then trigger exception.
>>  <ma...@pku.edu.cn>
>> 
>

Re: Encountered error while consuming partitions

Posted by 刘建刚 <li...@gmail.com>.

Thanks for all the help. Following the advice, I have fixed the problem.

> 2020年2月13日 下午6:05，Zhijiang <wa...@aliyun.com> 写道：
> 
> Thanks for reporting this issue and I also agree with the below analysis. Actually we encountered the same issue several years ago and solved it also via the netty idle handler.
> 
> Let's trace it via the ticket [1] as the following step.
> 
> [1] https://issues.apache.org/jira/browse/FLINK-16030 <https://issues.apache.org/jira/browse/FLINK-16030>
> 
> Best,
> Zhijiang
> 
> ------------------------------------------------------------------
> From:张光辉 <be...@gmail.com>
> Send Time:2020 Feb. 12 (Wed.) 22:19
> To:Benchao Li <li...@gmail.com>
> Cc:刘建刚 <li...@gmail.com>; user <us...@flink.apache.org>
> Subject:Re: Encountered error while consuming partitions
> 
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss). 
> 
> The problem is that the long tcp connection between netty client and server is lost, then the server failed to send message to the client, and shut down the channel. The Netty Client  does not know that the connection has been disconnected, so it has been waiting. 
> 
> To detect long tcp connection alive on netty client and server, we should have two ways: tcp keepalives and heartbeat.
> Tcp keepalives is 2 hours by default. When the error occurs, if you continue to wait for 2 hours, the netty client will trigger exception and enter failover recovery.
> If you want to detect long tcp connection quickly, netty provides IdleStateHandler which it use ping-pang mechanism. If netty client send continuously n ping message and receive no one pang message, then trigger exception.
>  <ma...@pku.edu.cn>
>

Re: Encountered error while consuming partitions

Posted by Zhijiang <wa...@aliyun.com>.

Thanks for reporting this issue and I also agree with the below analysis. Actually we encountered the same issue several years ago and solved it also via the netty idle handler.

Let's trace it via the ticket [1] as the following step.

[1] https://issues.apache.org/jira/browse/FLINK-16030

Best,
Zhijiang


------------------------------------------------------------------
From:张光辉 <be...@gmail.com>
Send Time:2020 Feb. 12 (Wed.) 22:19
To:Benchao Li <li...@gmail.com>
Cc:刘建刚 <li...@gmail.com>; user <us...@flink.apache.org>
Subject:Re: Encountered error while consuming partitions

Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss). 

The problem is that the long tcp connection between netty client and server is lost, then the server failed to send message to the client, and shut down the channel. The Netty Client  does not know that the connection has been disconnected, so it has been waiting. 

To detect long tcp connection alive on netty client and server, we should have two ways: tcp keepalives and heartbeat.
Tcp keepalives is 2 hours by default. When the error occurs, if you continue to wait for 2 hours, the netty client will trigger exception and enter failover recovery.
If you want to detect long tcp connection quickly, netty provides IdleStateHandler which it use ping-pang mechanism. If netty client send continuously n ping message and receive no one pang message, then trigger exception.

Re: Encountered error while consuming partitions

Posted by 张光辉 <be...@gmail.com>.

Network can fail in many ways, sometimes pretty subtle (e.g. high ratio
packet loss).

The problem is that the long tcp connection between netty client and server
is lost, then the server failed to send message to the client, and shut
down the channel. The Netty Client  does not know that the connection has
been disconnected, so it has been waiting.

To detect long tcp connection alive on netty client and server, we should
have two ways: tcp keepalives and heartbeat.
Tcp keepalives is 2 hours by default. When the error occurs, if you
continue to wait for 2 hours, the netty client will trigger exception and
enter failover recovery.
If you want to detect long tcp connection quickly, netty provides
IdleStateHandler which it use ping-pang mechanism. If netty client send
continuously n ping message and receive no one pang message, then trigger
exception.

>  <li...@pku.edu.cn>
>
>