You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (Jira)" <ji...@apache.org> on 2019/12/31 15:32:00 UTC
[jira] [Comment Edited] (FLINK-15416) Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel

    [ https://issues.apache.org/jira/browse/FLINK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006134#comment-17006134 ] 

Piotr Nowojski edited comment on FLINK-15416 at 12/31/19 3:31 PM:
------------------------------------------------------------------

I'm not sure. I think the motivating example that you provided is not very compelling. After all the proper fix is/was to just restart/fix the network and setting the number of retries is pretty arbitrary (why 2? why not 3?) and still can fail.

On the other hand, If we wanted to handle some more generic networking failures, we would need to handle a lot more things, like connection lost, timeouts, etc, not only connection failed, which also could impact restarting/recovery times.


was (Author: pnowojski):
I'm not sure. I think the motivating example that you provided is not very compelling. After all the proper fix is/was to just restart/fix the network and setting the number of retries is pretty arbitrarily and still can fail.

On the other hand, If we wanted to handle some more generic networking failures, we would need to handle a lot more things, like connection lost, timeouts, etc, not only connection failed, which also could impact restarting/recovery times.

> Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel
> -----------------------------------------------------------------------
>
>                 Key: FLINK-15416
>                 URL: https://issues.apache.org/jira/browse/FLINK-15416
>             Project: Flink
>          Issue Type: Wish
>          Components: Runtime / Network
>    Affects Versions: 1.10.0
>            Reporter: Zhenqiu Huang
>            Priority: Major
>
> We run a flink with 256 TMs in production. The job internally has keyby logic. Thus, it builds a 256 * 256 communication channels. An outage happened when there is a chip internal link of one of the network switchs broken that connecting these machines. During the outage, the flink can't restart successfully as there is always an exception like  "Connecting the channel failed: Connecting to remote task manager + '****/10.14.139.6:41300' has failed. This might indicate that the remote task manager has been lost. 
> After deep investigation with the network infrastructure team, we found there are 6 switchs connecting with these machines. Each switch has 32 physcal links. Every socket is round-robin assigned to each of links for load balances. Thus, there is always average 256 * 256 / 6 * 32  * 2 = 170 channels will be assigned to the broken link. The issue lasted for 4 hours until we found the broken link and restart the problematic switch. 
> Given this, we found that the retry of creating channel will help to resolve this issue. For our networking topology, we can set retry to 2. As 170 / (132 * 132) < 1, which means after retry twice no channel in 170 channels will be assigned to the broken link in the average case.
> I think it is valuable fix for this kind of partial network partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)