You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2016/07/22 19:49:20 UTC

[jira] [Closed] (FLINK-3411) Failed recovery can lead to removal of HA state

     [ https://issues.apache.org/jira/browse/FLINK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ufuk Celebi closed FLINK-3411.
------------------------------
       Resolution: Fixed
    Fix Version/s: 1.1.0

Fixed in FLINK-2733 and FLINK-4201.

> Failed recovery can lead to removal of HA state
> -----------------------------------------------
>
>                 Key: FLINK-3411
>                 URL: https://issues.apache.org/jira/browse/FLINK-3411
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>            Reporter: Ufuk Celebi
>            Priority: Critical
>             Fix For: 1.1.0
>
>
> When a job is recovered by a standby job manager and the recovery of the checkpoint state or job fails, the job might be eventually removed by the job manager after all retries are exhausted. This leads to the removal of the job/checkpoint state in ZooKeeper and the state backend, making it impossible to ever recover the job again.
> We should never exhaust job retries in the HA case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)