You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/04/07 08:27:41 UTC

[GitHub] [flink] curcur commented on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

curcur commented on PR #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1091302016

I am mostly fine with the changes. Besides the comments left above, my main concern is why introducing "RestoreMode DEFAULT = NO_CLAIM", let me explain my concerns in the following:

1. I am fine with the changes along the path "CheckpointCoordinator#restoreSavepoint", including all the interface changes, e.t.c.
2. I am not fine with `RestoreMode DEFAULT = NO_CLAIM` and passed in `EmbeddedCompletedCheckpointStore`, `StandaloneCompletedCheckpointStore`, `CheckpointResourcesCleanupRunner`, `EmbeddedHaServicesWithLeadershipControl`. There, it should follow the normal failover paths assuming all checkpoints belong to the same job, and non-referenced are subsumed as normal. I do not think NO_CLAIM follows this way.
So I would suggest we introduce a fourth internal value to indicate the normal JM failover behavior
3. @Myasuka mentioned a good point that item 2 itself is not enough: it is possible that before the first checkpoint after recovery is able to be made, JM crashes again, so item 2 may lead to losing information about whether the previous checkpoint is from a previous job or previous run.

We can also discuss offline.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org