You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2022/09/26 09:58:00 UTC

[jira] [Created] (FLINK-29415) InitializationFailure when recovering from a checkpoint in Application Mode leads to the cleanup of all HA data

Matthias Pohl created FLINK-29415:
-------------------------------------

             Summary: InitializationFailure when recovering from a checkpoint in Application Mode leads to the cleanup of all HA data
                 Key: FLINK-29415
                 URL: https://issues.apache.org/jira/browse/FLINK-29415
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.15.2, 1.16.0, 1.17.0, 1.14.6
            Reporter: Matthias Pohl


This issue was raised in the user ML thread [JobManager restarts on job failure|https://lists.apache.org/thread/qkmcty3h4gkkx5g09m19gwqrf8z8d383]. Recovering from a external checkpoint is handled differently than recovering from an internal state (see [Dispatcher#handleJobManagerRunner|https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L651]). For the latter case, we explicitly do a local cleanup (i.e. no HA data is cleaned up). For the case, described in the ML thread, a global cleanup is performed. That's not a problem in session mode where a new job ID is used. But in Application mode, we use the default job ID `0` which would be reused. In case of a failure, all the HA data will be "namespaced" using the default job id. As a consequence, the related data is cleaned up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)