You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Gary Yao (JIRA)" <ji...@apache.org> on 2018/05/17 12:57:00 UTC

[jira] [Updated] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

     [ https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Yao updated FLINK-6160:
----------------------------
    Affects Version/s: 1.5.0

>  Retry JobManager/ResourceManager connection in case of timeout
> ---------------------------------------------------------------
>
>                 Key: FLINK-6160
>                 URL: https://issues.apache.org/jira/browse/FLINK-6160
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination
>    Affects Versions: 1.3.0, 1.5.0, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to the remote component. Furthermore, it assumes that the component has actually failed and, thus, it will only start trying to connect to the component if it is notified about a new leader address and leader session id. This is brittle, because the heartbeat could also time out without the component having crashed. Thus, we should add an automatic retry to the latest known leader address information in case of a timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)