You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "fanrui (Jira)" <ji...@apache.org> on 2022/07/09 14:26:00 UTC
[jira] [Updated] (FLINK-28474) ChannelStateWriteResult may not fail after checkpoint abort
[ https://issues.apache.org/jira/browse/FLINK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
fanrui updated FLINK-28474:
---------------------------
Description:
After Checkpoint abort, ChannelStateWriteResult should fail.
But if _channelStateWriter.start(id, checkpointOptions);_ is executed after Checkpoint abort, ChannelStateWriteResult will not fail.
h2. Cause Analysis:
When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not be executed yet. These checkpointIds will be stored in the abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it will check if the checkpointId should be aborted.
_ChannelStateWriter.abort(checkpointId, exception, true) should also be executed here._
The unit test can reproduce this bug.
!image-2022-07-09-22-21-24-417.png|width=803,height=307!
was:
After Checkpoint abort, ChannelStateWriteResult should fail.
But if _channelStateWriter.start(id, checkpointOptions);_ is executed after Checkpoint abort, ChannelStateWriteResult will not fail.
h2. Cause Analysis:
When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not be executed yet. These checkpointIds will be stored in the abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it will check if the checkpointId should be aborted.
_ChannelStateWriter.abort(checkpointId, exception, true) should also be executed here._
!image-2022-07-09-22-21-24-417.png|width=803,height=307!
> ChannelStateWriteResult may not fail after checkpoint abort
> -----------------------------------------------------------
>
> Key: FLINK-28474
> URL: https://issues.apache.org/jira/browse/FLINK-28474
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.5, 1.15.1
> Reporter: fanrui
> Priority: Major
> Fix For: 1.16.0, 1.15.2, 1.14.6
>
> Attachments: image-2022-07-09-22-21-24-417.png
>
>
> After Checkpoint abort, ChannelStateWriteResult should fail.
> But if _channelStateWriter.start(id, checkpointOptions);_ is executed after Checkpoint abort, ChannelStateWriteResult will not fail.
>
> h2. Cause Analysis:
> When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not be executed yet. These checkpointIds will be stored in the abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it will check if the checkpointId should be aborted.
> _ChannelStateWriter.abort(checkpointId, exception, true) should also be executed here._
> The unit test can reproduce this bug.
> !image-2022-07-09-22-21-24-417.png|width=803,height=307!
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)