You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2022/01/05 11:39:00 UTC

[jira] [Comment Edited] (FLINK-25486) Perjob can not recover from checkpoint when zookeeper leader changes

    [ https://issues.apache.org/jira/browse/FLINK-25486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469226#comment-17469226 ] 

Till Rohrmann edited comment on FLINK-25486 at 1/5/22, 11:38 AM:
-----------------------------------------------------------------

Hi [~Jiangang], thanks for reporting this issue. I think this is indeed a bug and should be fixed. The problem seems as you described that the {{MiniDispatcher}} completes the {{shutDownFuture}} not only on globally terminal states.

Do you want to work on it?

cc [~dmvk].


was (Author: till.rohrmann):
Hi [~Jiangang], thanks for reporting this issue. I think this is indeed a bug and should be fixed. Do you want to work on it? How will you fix it?

cc [~dmvk].

> Perjob can not recover from checkpoint when zookeeper leader changes
> --------------------------------------------------------------------
>
>                 Key: FLINK-25486
>                 URL: https://issues.apache.org/jira/browse/FLINK-25486
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.13.5, 1.14.2
>            Reporter: Liu
>            Priority: Critical
>             Fix For: 1.15.0, 1.13.6, 1.14.3
>
>
> When the config high-availability.zookeeper.client.tolerate-suspended-connections is default false, the appMaster will failover once zk leader changes. In this case, the old appMaster will clean up all the zk info and the new appMaster will not recover from the latest checkpoint.
> The process is as following:
>  # Start a perJob application.
>  # kill zk's leade node which cause the perJob to suspend.
>  # In MiniDispatcher's function jobReachedTerminalState, shutDownFuture is set to UNKNOWN .
>  # The future is transferred to ClusterEntrypoint, the method is called with cleanupHaData true.
>  # Clean up zk data and exit.
>  # The new appMaster will not find any checkpoints to start and the state is lost.
> Since the job can recover automatically when the zk leader changes, it is reasonable to keep zk info for the coming recovery.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)