You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2018/11/01 14:32:00 UTC
[jira] [Created] (FLINK-10751) Checkpoints should be retained when
job reaches suspended state
Ufuk Celebi created FLINK-10751:
-----------------------------------
Summary: Checkpoints should be retained when job reaches suspended state
Key: FLINK-10751
URL: https://issues.apache.org/jira/browse/FLINK-10751
Project: Flink
Issue Type: Bug
Components: Distributed Coordination
Affects Versions: 1.6.2
Reporter: Ufuk Celebi
Assignee: Ufuk Celebi
{{CheckpointProperties}} define in which terminal job status a checkpoint should be disposed.
I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}}, {{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally) terminal job status {{SUSPENDED}}.
Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses leadership, this would result in the checkpoint to be cleaned up and not being available for recovery by the new leader. Therefore, we should rather retain checkpoints when reachingĀ job status {{SUSPENDED}}.
*BUT:* Because we special case this terminal state in the only highly available {{CompletedCheckpointStore}} implementation (seeĀ [ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315]) and don't use regular checkpoint disposal, this issue has not surfaced yet.
I think we should proactively fix the properties to indicate to retain checkpoints in {{SUSPENDED}} state. We might actually completely remove this case since with this change, all properties will indicate to retain on suspension.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)