You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "vinoyang (JIRA)" <ji...@apache.org> on 2019/04/09 12:19:00 UTC

[jira] [Commented] (FLINK-12144) Option to prefer checkpoints on recovery

    [ https://issues.apache.org/jira/browse/FLINK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813322#comment-16813322 ] 

vinoyang commented on FLINK-12144:
----------------------------------

[~gyfora] The proposal sounds like the discussion under FLINK-11159, right?

> Option to prefer checkpoints on recovery
> ----------------------------------------
>
>                 Key: FLINK-12144
>                 URL: https://issues.apache.org/jira/browse/FLINK-12144
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing
>            Reporter: Gyula Fora
>            Priority: Trivial
>
> When a streaming job fails the getLatestCheckpoint() of the CheckpointStore is used to determine which checkpoint or savepoint is going to be used for recovery.
> This behaviour is perfectly fine for jobs with relatively small states or where there are no strong SLAs but it some cases it can be problematic.
> For jobs with a very large state size, the difference between recovery times from savepoints and checkpoints can be substantial to the point where it might break a use-case. So we would like to avoid ever recovering from a savepoint if a not too old checkpoint is also readily available.
> This cannot be avoided right now if a job fails after we took a savepoint maybe for backup purposes (maybe it is scheduled multiple times a day).
> I suggest we add a configuration option to allow the job to fall back to an earlier checkpoint (within maybe a certain age limit) even if there is a newer savepoint available to avoid lengthy downtimes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)