You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "fanrui (Jira)" <ji...@apache.org> on 2022/10/07 08:40:00 UTC

[jira] [Commented] (FLINK-28474) ChannelStateWriteResult may not fail after checkpoint abort

    [ https://issues.apache.org/jira/browse/FLINK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17613959#comment-17613959 ] 

fanrui commented on FLINK-28474:
--------------------------------

[~pnowojski]  thanks for your help and review~

> ChannelStateWriteResult may not fail after checkpoint abort
> -----------------------------------------------------------
>
>                 Key: FLINK-28474
>                 URL: https://issues.apache.org/jira/browse/FLINK-28474
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.5, 1.15.1
>            Reporter: fanrui
>            Assignee: fanrui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.17.0
>
>         Attachments: image-2022-07-09-22-21-24-417.png
>
>
> After Checkpoint abort, ChannelStateWriteResult should fail.
> But if _channelStateWriter.start(id, checkpointOptions);_ is executed after Checkpoint abort, ChannelStateWriteResult will not fail.
>  
> h2. Cause Analysis:
> When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not be executed yet. These checkpointIds will be stored in the abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it will check if the checkpointId should be aborted.
> _ChannelStateWriter.abort(checkpointId, exception, true) should also be executed here._
> The unit test can reproduce this bug.
> !image-2022-07-09-22-21-24-417.png|width=803,height=307!
>  
> Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it doesn't account for channelStateWriter.start after notifyCheckpointAborted.
> JIRA: FLINK-17869
> commit: https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e
>  
> The bug will affect the new feature FLINK-26803, because the channel state file can be closed only after the Checkpoints of all tasks of the shared file are complete or abort. So when the checkpoint of some tasks fails, if abort is not called, the file cannot be closed and all tasks sharing the file cannot execute inputChannelStateHandles.completeExceptionally(e); and resultSubpartitionStateHandles.completeExceptionally(e); , AsyncCheckpointRunnable will wait forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)