You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Roman Khachatryan (Jira)" <ji...@apache.org> on 2022/04/07 09:37:00 UTC

[jira] [Created] (FLINK-27114) On JM restart, the information about the initial checkpoints can be lost

Roman Khachatryan created FLINK-27114:
-----------------------------------------

             Summary: On JM restart, the information about the initial checkpoints can be lost
                 Key: FLINK-27114
                 URL: https://issues.apache.org/jira/browse/FLINK-27114
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.14.4, 1.15.0, 1.16.0
            Reporter: Roman Khachatryan
             Fix For: 1.16.0, 1.14.5, 1.15.1


Scenario (1.14):
 # A job starts from an existing checkpoint 1, with incremental checkpoints enabled
 # Checkpoint 1 is loaded with discardOnSubsume=false by CheckpointCoordinator.restoreSavepoint
 # A new checkpoint 2 completes, it reuses some state from the initial checkpoint
 # At some point, checkpoint 1 is subsumed, but the state is not discarded (thanks to discardOnSubsume=false, ref counts stay 1)
 # JM crashes
 # JM restarts, loads the checkpoints 2..x from ZK (or other store) -   discardOnSubsume=true (as deserialized from handles)
 # At some point, checkpoint 2 is subsumed and the initial shared state is not used anymore; because checkpoint 2 has discardOnSubsume=true, shared state will be erroneously discarded

In 1.15, there were the following changes:
 # RestoreMode was added; only NO_CLAIM and LEGACY  modes are affected
 # SharedStateRegistry was changed from refCounts to highest checkpoint ID
 # In step (7), state will not be discarded; however, because it's impossible to distinguish initial state from the state created by this job, the latter will not be discarded as well, leading to left-over state artifacts.

The proposed solution is to store the initial checkpoint ID (in store such as ZK or in checkpoints) and adjust steps 6 or 7.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)