You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/04/01 14:46:35 UTC

[GitHub] [flink] rkhachatryan opened a new pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

rkhachatryan opened a new pull request #19331:
URL: https://github.com/apache/flink/pull/19331


   ## What is the purpose of the change
   
   As described in the ticket, in LEGACY restore mode,
   shared state of incremental checkpoints can be discarded
   regardless of whether they were created by this job or not.
   
   The bug was introduced in FLINK-24611. Before, reference count was maintained
   for each entry;
   "initial" checkpoints did not decrement this count, preventing their shared state from being discarded.
   
   This PR makes `SharedStateRegistry` to:
   1. remember the max checkpiont ID encountered during recovery
   2. associate each state entry with a checkpoint ID that created it
   3. only discard the entry if its `createdByCheckpointID` > highestRetainCheckpointID``
   
   (1) is called from:
   - `CheckpointCoordinator.restoreSavepoint` - to cover initial restore from a checkpoint
   - `SharedStateFactory`, when building checkpoint store - to cover the failover case
   
   Only `CheckpointCoordinator` does not seem sufficient, because a new checkpoint
   can be created, from which the job can recover automatically, without calling `restoreSavepoint`.
   
   (see `DefaultExecutionGraphFactory.createAndRestoreExecutionGraph`)
   
   ## Verifying this change
   
   `ResumeCheckpointManuallyITCase` in `LEGACY` restore mode
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #19331:
URL: https://github.com/apache/flink/pull/19331#discussion_r841065426



##########
File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ResumeCheckpointManuallyITCase.java
##########
@@ -269,17 +317,19 @@ private void testExternalizedCheckpoints(
         try {
             // main test sequence:  start job -> eCP -> restore job -> eCP -> restore job
             String firstExternalCheckpoint =
-                    runJobAndGetExternalizedCheckpoint(backend, checkpointDir, null, client);
+                    runJobAndGetExternalizedCheckpoint(
+                            backend, checkpointDir, null, client, restoreMode);
             assertNotNull(firstExternalCheckpoint);
 
             String secondExternalCheckpoint =
                     runJobAndGetExternalizedCheckpoint(
-                            backend, checkpointDir, firstExternalCheckpoint, client);
+                            backend, checkpointDir, firstExternalCheckpoint, client, restoreMode);
             assertNotNull(secondExternalCheckpoint);
 
             String thirdExternalCheckpoint =
                     runJobAndGetExternalizedCheckpoint(
-                            backend, checkpointDir, secondExternalCheckpoint, client);
+                            // restore from the 1st external checkpoint path
+                            backend, checkpointDir, firstExternalCheckpoint, client, restoreMode);

Review comment:
       You're right, I'll revert this change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132",
       "triggerID" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d8a33dad195ea7303270333d05f5449a5ea71bf",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34171",
       "triggerID" : "3d8a33dad195ea7303270333d05f5449a5ea71bf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2936ad45e82d77b79ce8d9a7f83d8ae972da5d49 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132) 
   * 3d8a33dad195ea7303270333d05f5449a5ea71bf Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34171) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132",
       "triggerID" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2936ad45e82d77b79ce8d9a7f83d8ae972da5d49 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121) 
   * 2936ad45e82d77b79ce8d9a7f83d8ae972da5d49 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132",
       "triggerID" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d8a33dad195ea7303270333d05f5449a5ea71bf",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34171",
       "triggerID" : "3d8a33dad195ea7303270333d05f5449a5ea71bf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3d8a33dad195ea7303270333d05f5449a5ea71bf Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34171) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Myasuka commented on a change in pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
Myasuka commented on a change in pull request #19331:
URL: https://github.com/apache/flink/pull/19331#discussion_r841003426



##########
File path: flink-tests/src/test/java/org/apache/flink/test/checkpointing/ResumeCheckpointManuallyITCase.java
##########
@@ -269,17 +317,19 @@ private void testExternalizedCheckpoints(
         try {
             // main test sequence:  start job -> eCP -> restore job -> eCP -> restore job
             String firstExternalCheckpoint =
-                    runJobAndGetExternalizedCheckpoint(backend, checkpointDir, null, client);
+                    runJobAndGetExternalizedCheckpoint(
+                            backend, checkpointDir, null, client, restoreMode);
             assertNotNull(firstExternalCheckpoint);
 
             String secondExternalCheckpoint =
                     runJobAndGetExternalizedCheckpoint(
-                            backend, checkpointDir, firstExternalCheckpoint, client);
+                            backend, checkpointDir, firstExternalCheckpoint, client, restoreMode);
             assertNotNull(secondExternalCheckpoint);
 
             String thirdExternalCheckpoint =
                     runJobAndGetExternalizedCheckpoint(
-                            backend, checkpointDir, secondExternalCheckpoint, client);
+                            // restore from the 1st external checkpoint path
+                            backend, checkpointDir, firstExternalCheckpoint, client, restoreMode);

Review comment:
       The previous test actually follow the steps:
   create 1st checkpoint --> restore from 1st checkpoint and then create 2nd checkpoint -> restore from 2nd and then create the 3rd one.
   
   However, this PR would change the original test purpose as it would restore from the 1st job on the 3rd run.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java
##########
@@ -51,6 +53,9 @@
     /** Executor for async state deletion */
     private final Executor asyncDisposalExecutor;
 
+    /** Checkpoint ID below which no state is discarded, inclusive. */
+    private long highestRetainCheckpointID = -1L;

Review comment:
       I think `highestRetainCheckpointID` might not be a good choice as we could still retain multi checkpoints even in `CLAIM` restore mode. However, current implementation would still keep the `highestRetainCheckpointID` as `-1` in `CLAIM` restore mode.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointRecoveryFactory.java
##########
@@ -37,13 +38,15 @@
      * @param sharedStateRegistryFactory Simple factory to produce {@link SharedStateRegistry}
      *     objects.
      * @param ioExecutor Executor used to run (async) deletes.
+     * @param restoreMode the job in which the job is restoring

Review comment:
       ```suggestion
        * @param restoreMode the restore mode with which the job is restoring.
   ```

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java
##########
@@ -251,13 +267,16 @@ public void run() {
         /** The shared state handle */
         StreamStateHandle stateHandle;
 
+        private final long createdByCheckpointID;

Review comment:
       This field is not included in the `#toString()` method

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java
##########
@@ -174,6 +181,15 @@ public void registerAll(
         }
     }
 
+    @Override
+    public void registerAllAfterRestored(CompletedCheckpoint checkpoint, RestoreMode mode) {
+        registerAll(checkpoint.getOperatorStates().values(), checkpoint.getCheckpointID());
+        if (mode != RestoreMode.CLAIM) {

Review comment:
       Why only `CLAIM` mode does not need to update the `highestRetainCheckpointID`? I think this deserve a description.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1086620964


   Thanks for the review @Myasuka ,
   
   > I think current solution looks a bit friable as it involes many changes to different APIs.
   
   I'm not sure what exactly do you mean. Indeed, there are changes to multiple interfaces, mostly because of the need to pass `RestoreMode` from `CheckpointRecoveryFactory` to `SharedStateRegistryFactory`.  However, I think this is unavoidable. Or do you see any alternative?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132",
       "triggerID" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3d8a33dad195ea7303270333d05f5449a5ea71bf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3d8a33dad195ea7303270333d05f5449a5ea71bf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2936ad45e82d77b79ce8d9a7f83d8ae972da5d49 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132) 
   * 3d8a33dad195ea7303270333d05f5449a5ea71bf UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19331:
URL: https://github.com/apache/flink/pull/19331#issuecomment-1085999920


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121",
       "triggerID" : "2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132",
       "triggerID" : "2936ad45e82d77b79ce8d9a7f83d8ae972da5d49",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2c5d446d162d8f616820fe3b4fdf0cdb1eada5bf Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34121) 
   * 2936ad45e82d77b79ce8d9a7f83d8ae972da5d49 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34132) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #19331:
URL: https://github.com/apache/flink/pull/19331#discussion_r841065119



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java
##########
@@ -51,6 +53,9 @@
     /** Executor for async state deletion */
     private final Executor asyncDisposalExecutor;
 
+    /** Checkpoint ID below which no state is discarded, inclusive. */
+    private long highestRetainCheckpointID = -1L;

Review comment:
       Could you explain which case do you mean and what would be the problem?
   
   With `CLAIM` mode, the goal is to discard the state, including retained checkpoints. So this field `highestRetainCheckpointID` must be `-1`, which will allow to do this (condition in state entry will be met).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #19331: [FLINK-26985][runtime] Don't discard shared state of restored checkpoints

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #19331:
URL: https://github.com/apache/flink/pull/19331#discussion_r841064655



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/SharedStateRegistryImpl.java
##########
@@ -174,6 +181,15 @@ public void registerAll(
         }
     }
 
+    @Override
+    public void registerAllAfterRestored(CompletedCheckpoint checkpoint, RestoreMode mode) {
+        registerAll(checkpoint.getOperatorStates().values(), checkpoint.getCheckpointID());
+        if (mode != RestoreMode.CLAIM) {

Review comment:
       In `CLAIM` mode, the state of the checkpoints that the job is recovering from can be deleted (must be deleted).
   I'll add this comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org