You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/11/02 11:00:00 UTC
[jira] [Commented] (FLINK-10751) Checkpoints should be retained when job reaches suspended state

    [ https://issues.apache.org/jira/browse/FLINK-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672931#comment-16672931 ] 

ASF GitHub Bot commented on FLINK-10751:
----------------------------------------

uce opened a new pull request #7006: [FLINK-10751] [runtime] Retain checkpoints on suspension
URL: https://github.com/apache/flink/pull/7006
 
 
   ## What is the purpose of the change
   
   Retain checkpoints in case of terminal job status `SUSPENDED`. Note that this does **not actually effect** the retention behavior currently, because we special case this terminal state in `ZooKeeperCompletedCheckpointStore` and don't suspend jobs when running with `StandaloneCompletedCheckpointStore`.
   
   The proposed change is more of a proactive guard to avoid confusion in the future (e.g. if we stop special casing or accidentally use `StandaloneCompletedCheckpointStore` in HA mode).
   
   I'm also OK with closing this PR without merging since it is not clear how the `SUSPENDED` state will evolve in the future. Currently `SUSPENDED` is an "internal" terminal state to which we transition on lost leadership. If we plan to change this in the future (e.g. let users trigger this transition), it might be worthwhile to keep the current behavior.
   
   ## Brief change log
   
   - Update `CheckpointProperties` to retain on suspension
   
   ## Verifying this change
   
   - This change is already covered by existing HA tests
   - The modified `StandaloneCompletedCheckpointStoreTest` was essentially testing behavior of an illegal state
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no (see comments above)
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Checkpoints should be retained when job reaches suspended state
> ---------------------------------------------------------------
>
>                 Key: FLINK-10751
>                 URL: https://issues.apache.org/jira/browse/FLINK-10751
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>
> {{CheckpointProperties}} define in which terminal job status a checkpoint should be disposed.
> I've noticed that the properties for {{CHECKPOINT_NEVER_RETAINED}}, {{CHECKPOINT_RETAINED_ON_FAILURE}} prescribe checkpoint disposal in (locally) terminal job status {{SUSPENDED}}.
> Since a job reaches the {{SUSPENDED}} state when its {{JobMaster}} looses leadership, this would result in the checkpoint to be cleaned up and not being available for recovery by the new leader. Therefore, we should rather retain checkpoints when reaching job status {{SUSPENDED}}.
> *BUT:* Because we special case this terminal state in the only highly available {{CompletedCheckpointStore}} implementation (see [ZooKeeperCompletedCheckpointStore|https://github.com/apache/flink/blob/e7ac3ba/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L315]) and don't use regular checkpoint disposal, this issue has not surfaced yet.
> I think we should proactively fix the properties to indicate to retain checkpoints in {{SUSPENDED}} state. We might actually completely remove this case since with this change, all properties will indicate to retain on suspension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)