You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Konstantin Knauf (Jira)" <ji...@apache.org> on 2020/01/24 15:48:00 UTC

[jira] [Resolved] (FLINK-15731) Stop while Checkpoint is In-Progress Triggers Job Failover

     [ https://issues.apache.org/jira/browse/FLINK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantin Knauf resolved FLINK-15731.
--------------------------------------
    Release Note: 
I unsuccessfully tried to reproduce the issue today. Not sure, why I was seeing this issue so consistently on Wednesday. Only explanation I can think of right now, is that the job had some unrelated issue that only manifested itself in this situation.

The original theory for the reason was wrong anyhow as it turns out aborted checkpoints due to suspension are ignored in the checkpoint failure count.
      Resolution: Cannot Reproduce

> Stop while Checkpoint is In-Progress Triggers Job Failover
> ----------------------------------------------------------
>
>                 Key: FLINK-15731
>                 URL: https://issues.apache.org/jira/browse/FLINK-15731
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.1, 1.10.0
>            Reporter: Konstantin Knauf
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> Currently, when a Job is {{stopped}} in-progress checkpoints are aborted and afterwards a synchronous savepoint is started.
> Since the number of tolerable checkpoint failures is 0 per default (see {{org.apache.flink.streaming.api.environment.CheckpointConfig#getTolerableCheckpointFailureNumber}}), this triggers a restart of the job if there are any ongoing checkpoints. 
> In consequence, the stop call only triggers a failover of the job instead of stopping the job, if there is an ongoing checkpoint (or savepoint). 
> Possible options I see are: 
> a) change default of tolerable checkpoint failures to at least the max number of concurrent checkpoints
> b) do not count checkpoint failures due to the stop action when checking against tolerable checkpoint failures
> c) do not abort pending checkpoints when stopping a job, but queue the synchronous savepoint after all current in-progress checkpoints



--
This message was sent by Atlassian Jira
(v8.3.4#803005)