You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "vinoyang (JIRA)" <ji...@apache.org> on 2019/07/31 09:48:00 UTC

[jira] [Comment Edited] (FLINK-13497) Checkpoints can complete after CheckpointFailureManager fails job

    [ https://issues.apache.org/jira/browse/FLINK-13497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896988#comment-16896988 ] 

vinoyang edited comment on FLINK-13497 at 7/31/19 9:47 AM:
-----------------------------------------------------------

When I thought deeply, even if we did not introduce FLINK-12364, the issue described by [~till.rohrmann] also exists. When job failing, the pending checkpoints and checkpoint coordinator can also work normally. What's more, some failed instances in FLINK-9900 happened before merging FLINK-12364.

But there is a fact that after FLINK-12364 was merged, the number of failures increased significantly.


was (Author: yanghua):
When I thought deeply, even if we did not introduce FLINK-12364, the issue described by [~till.rohrmann] also exists. When users call {{setFailOnCheckpointingErrors(true)}} and when a checkpoint failed on TM, the decline message would be sent to JM and trigger the failure of the job. The pending checkpoints and checkpoint coordinator also work normally. What's more, some failed instances in FLINK-9900 happened before merging FLINK-12364.

But there is a fact that after FLINK-12364 was merged, the number of failures increased significantly.

> Checkpoints can complete after CheckpointFailureManager fails job
> -----------------------------------------------------------------
>
>                 Key: FLINK-13497
>                 URL: https://issues.apache.org/jira/browse/FLINK-13497
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.9.0
>
>
> I think that we introduced with FLINK-12364 an inconsistency wrt to job termination a checkpointing. In FLINK-9900 it was discovered that checkpoints can complete even after the {{CheckpointFailureManager}} decided to fail a job. I think the expected behaviour should be that we fail all pending checkpoints once the {{CheckpointFailureManager}} decides to fail the job.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)