You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Yanfei Lei (Jira)" <ji...@apache.org> on 2023/03/01 12:26:00 UTC

[jira] [Commented] (FLINK-30863) Do not delete the local changelog file of aborted checkpoint

    [ https://issues.apache.org/jira/browse/FLINK-30863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695042#comment-17695042 ] 

Yanfei Lei commented on FLINK-30863:
------------------------------------

Looking back at the fileNotFound problem of local recovery again, I found that my previous analysis was incorrect:

For a checkpoint, notifyAbort() is impossible to come after notifyComplete() on TM.

If TM is materialized before receiving confirm(), the previously uploaded queue in `FsStateChangelogWriter` will be cleared, so the local files of the completed checkpoint will not be registered again, while the JM owned files are registered before confirm(), and do not depend on the uploaded queue, so the local files are deleted, and the DFS files are still there.

I added  `testLocalFileAfterMaterialize` to simulate this scenario, and I think local files should be registered before confirm() to avoid this problem. 
[~roman]  [~Feifan Wang] could you please take a look again?

 

> Do not delete the local changelog file of aborted checkpoint
> ------------------------------------------------------------
>
>                 Key: FLINK-30863
>                 URL: https://issues.apache.org/jira/browse/FLINK-30863
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / State Backends
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: tm-log_fail_cl_local_recovery.txt
>
>
> Do not delete the local changelog file of aborted checkpoint, because this checkpoint may contain the files of the previous checkpoint's file which would be used by local recovery. The local files of the aborted checkpoint would be deleted at next checkpoint completed or deleted when deleting the entire allocation folder when exiting the TM process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)