You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/01/05 13:50:38 UTC

[GitHub] [flink] Myasuka opened a new pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Myasuka opened a new pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768
 
 
   
   ## What is the purpose of the change
   
   Even we enable `ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION`, current job level checkpoint directory and its sub-directory `shared` and `taskowned` would still left on job cancellation.
   Introduce shutdown to `CheckpointStorageCoordinatorView` to clean up these checkpoint directories which created by itself.
   
   ## Brief change log
   
     - Introduce new inerfaces of`setCheckpointProperties` and `shutdown` to `CheckpointStorageCoordinatorView`.
     - Make some getters in `CheckpointProperties` as public.
     - Add documentation to warn users not to set savepoint target directory as the same as the checkpoint directory.
   
   ## Verifying this change
   This change added tests and can be verified as follows:
   
     - Added `testShutDownStorageOnNeverRetain`, `testShutDownStorageOnRetainWhenCanceled` and `testShutDownStorageOnRetainWhenFailed` to `AbstractFileCheckpointStorageTestBase`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **yes**
     - If yes, how is the feature documented? docs

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/143142694) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101) 
   * 3563e73329fd94305582c244a24d5e8b66293628 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386015007
 
 

 ##########
 File path: flink-runtime/src/test/java/org/apache/flink/runtime/state/filesystem/AbstractFileCheckpointStorageTestBase.java
 ##########
 @@ -229,6 +231,41 @@ public void writeToAlreadyExistingCheckpointFails() throws Exception {
 		catch (IOException ignored) {}
 	}
 
+	@Test
+	public void testShutDownStorageOnNeverRetain() throws IOException {
+		verifyStorageShutDownAsExpected(CheckpointRetentionPolicy.NEVER_RETAIN_AFTER_TERMINATION, JobStatus.CANCELED, false);
+
+		verifyStorageShutDownAsExpected(CheckpointRetentionPolicy.NEVER_RETAIN_AFTER_TERMINATION, JobStatus.FINISHED, false);
+
+		verifyStorageShutDownAsExpected(CheckpointRetentionPolicy.NEVER_RETAIN_AFTER_TERMINATION, JobStatus.FAILED, false);
 
 Review comment:
   Should also cover `JobStatus.SUSPENDED`. The same for tests on other policies.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "PENDING",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "613198860",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "613198860",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 2139ea47ee3dac096c47b36efdf077b15e621a19 Travis: [PENDING](https://travis-ci.com/github/flink-ci/flink/builds/159982673) Azure: [CANCELED](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372) 
   * 31d15b255ff5bb329c16e1f1afac36b5ff06d4d8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "PENDING",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/143142694) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101) 
   * 3563e73329fd94305582c244a24d5e8b66293628 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/151256469) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "613198860",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "613198860",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 2139ea47ee3dac096c47b36efdf077b15e621a19 Travis: [SUCCESS](https://travis-ci.com/github/flink-ci/flink/builds/159982673) Azure: [CANCELED](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372) 
   * 31d15b255ff5bb329c16e1f1afac36b5ff06d4d8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-616186860
 
 
   @StefanRRichter Thanks for your review and suggestions!
   
   The reason why I still want to introduce the option to let checkpoint directories could be deleted recursively is observed from actual job-running results. We cannot ensure all the files in the checkpoint directory has been removed before we just to remove the directory, FLINK-17073 describes a user case in which those older checkpoints have not been subsumed in time.
   
   I agree to move the configuration into checkpoint storage scope, and I prefer to name that as `state.checkpoints.cleanup.recursive` as checkpoint directories are not only used in fs state backend.
   
   I'll update this PR to make the IO operation asynchronously recently, and thanks for your review again!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
StephanEwen edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-594419275
 
 
   I am not sure whether we should make it the responsibility of the checkpoint storage to understand when checkpoints can be deleted and when not. Meaning, evaluating the combination of JobStatus and CheckpointProperties. I would think that this is responsibility of the checkpoint coordinator, and the storage simply removes directories when told so. Otherwise, each storage would either implement the same logic (duplication) or have leftover data in different scenarios (inconsistency).
   
   *Note:* An indicator of that is that the visibility of `CheckpointProperties` needs to be increased now. This shows that some concept that was only intended for coordinator-internal logic now leaks into other parts. Sometimes, increasing visibility is necessary, but it is always a good indicator that now certain responsibilities/concerns are not separated any more as much as the were before.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386172856
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java
 ##########
 @@ -124,5 +117,5 @@ CheckpointStorageLocation initializeLocationForSavepoint(
 	 *
 	 * @throws IOException Thrown if the storage cannot be shut down well due to an I/O exception.
 	 */
-	void shutDown(JobStatus jobStatus) throws IOException;
+	void shutDown(JobStatus jobStatus, CheckpointProperties checkpointProperties) throws IOException;
 
 Review comment:
   Please add javadoc for the newly added parameter.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386012840
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStorage.java
 ##########
 @@ -134,6 +149,37 @@ public CheckpointStorageLocation initializeLocationForCheckpoint(long checkpoint
 				writeBufferSize);
 	}
 
+	@Override
+	public void shutDown(JobStatus jobStatus) throws IOException {
 
 Review comment:
   The logic to check whether checkpoint folders should be discarded are the same for Fs and MemoryBackend checkpoint storage, while the difference lies in what folders to discard. So we could extract them out into `AbstractFsCheckpointStorage` and introduce a new `doDiscard` abstract method for implementations to override.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/143142694) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-594733676
 
 
   @StephanEwen , I think it could be okay to delete directories with `JobStatus` and `CheckpointProperties` as current implementations of `checkpoint store` and `checkpoint id counter` all have their own responsibility to decide whether to discard checkpoints. In my point of view, `checkpoint storage` should have similar responsibility ownership compared with `checkpoint store`. I prefer each storage could have their own implementation to discard its directories to offer agility.
   
   When talking about cleanup without recursive. I think your assumptions might not happen in real environment as previous job would always have a different `job-id` sub-directory which is definitely not the same as current job. We would only at most clean up the parent directory at `job-id` level and I think we would not cleanup other or previous jobs.
   Moreover, for HDFS (or non-object storage), the directory could only be discarded if empty. However, we often come across problem that checkpoint coordinator deletes files not so fast which lead the sub-directory is not exactly empty. In the end, we cannot ensure the directory could be cleaned up as expected.
   To make cleanup more safer, we could add documentation to not set savepoint target location under the `job-id` sub-directory. What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570914659
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 (Sun Jan 05 13:53:50 UTC 2020)
   
    ✅no warnings
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   Hash:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101 TriggerType:PUSH TriggerID:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70
   Hash:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/143142694 TriggerType:PUSH TriggerID:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70
   -->
   ## CI report:
   
   * fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/143142694) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386013559
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointProperties.java
 ##########
 @@ -103,7 +103,7 @@ boolean forceCheckpoint() {
 	 *
 	 * @see CompletedCheckpointStore
 	 */
-	boolean discardOnSubsumed() {
+	public boolean discardOnSubsumed() {
 
 Review comment:
   This change is unnecessary.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "PENDING",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3563e73329fd94305582c244a24d5e8b66293628 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/151256469) 
   * 31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/151482641) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   Hash:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70
   -->
   ## CI report:
   
   * fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/151482641) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-592467764
 
 
   @StephanEwen @pnowojski Can some some one take a review? thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386013013
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStorage.java
 ##########
 @@ -134,6 +149,37 @@ public CheckpointStorageLocation initializeLocationForCheckpoint(long checkpoint
 				writeBufferSize);
 	}
 
+	@Override
+	public void shutDown(JobStatus jobStatus) throws IOException {
+		if (checkpointProperties != null) {
+			boolean discardCheckpointFolders = false;
+			switch (jobStatus) {
+				case FAILED:
+					discardCheckpointFolders = checkpointProperties.discardOnJobFailed();
+					break;
+				case CANCELED:
+					discardCheckpointFolders = checkpointProperties.discardOnJobCancelled();
+					break;
+				case FINISHED:
+					discardCheckpointFolders = checkpointProperties.discardOnJobFinished();
+					break;
+				case SUSPENDED:
+					discardCheckpointFolders = checkpointProperties.discardOnJobSuspended();
+					break;
+			}
+			if (discardCheckpointFolders) {
+				fileSystem.delete(taskOwnedStateDirectory, true);
+				fileSystem.delete(sharedStateDirectory, true);
+				fileSystem.delete(checkpointsDirectory, true);
+			}
 
 Review comment:
   ```suggestion
   		if (jobStatus == JobStatus.FINISHED && checkpointProperties.discardOnJobFinished() ||
   			jobStatus == JobStatus.CANCELED && checkpointProperties.discardOnJobCancelled() ||
   			jobStatus == JobStatus.FAILED && checkpointProperties.discardOnJobFailed() ||
   			jobStatus == JobStatus.SUSPENDED && checkpointProperties.discardOnJobSuspended()) {
   
   			doDiscard();
   		}
   ```
   Simplify the logic referring to `CompletedCheckpoint#discardOnShutdown`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2139ea47ee3dac096c47b36efdf077b15e621a19 Travis: [CANCELED](https://travis-ci.com/github/flink-ci/flink/builds/159982673) Azure: [CANCELED](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372) 
   * 31d15b255ff5bb329c16e1f1afac36b5ff06d4d8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-594419275
 
 
   I am not sure whether we should make it the responsibility of the checkpoint storage to understand when checkpoints can be deleted and when not. Meaning, evaluating the combination of JobStatus and CheckpointProperties. I would think that this is responsibility of the checkpoint coordinator, and the storage simply removes directories when told so. Otherwise, each storage would either implement the same logic (duplication) or have leftover data in different scenarios (inconsistency).
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-612724984
 
 
   @flinkbot run azure

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386012560
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java
 ##########
 @@ -103,4 +106,23 @@
 	CheckpointStorageLocation initializeLocationForSavepoint(
 		long checkpointId,
 		@Nullable String externalLocationPointer) throws IOException;
+
+	/**
+	 * Sets the checkpoint properties which used when shutting down the storage to the checkpoint storage.
+	 *
+	 * @param checkpointProperties The checkpoint properties to forward.
+	 */
+	void setCheckpointProperties(CheckpointProperties checkpointProperties);
 
 Review comment:
   As mentioned, we don't need this interface.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-593119085
 
 
   @carp84 Really thanks for your kind review!
   I have accepted most your comments. After I think for a while, I decided to not discard checkpoint directories on `SUSPENDED` job status which is only local terminal status. We could refer to [job-status doc](https://ci.apache.org/projects/flink/flink-docs-stable/internals/job_scheduling.html#jobmanager-data-structures) to know that:
   
   > Consequently, a job which reaches the suspended state won’t be completely cleaned up.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2139ea47ee3dac096c47b36efdf077b15e621a19 Travis: [CANCELED](https://travis-ci.com/github/flink-ci/flink/builds/159982673) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372) 
   * 31d15b255ff5bb329c16e1f1afac36b5ff06d4d8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-595176653
 
 
   > @StephanEwen , I think it could be okay to delete directories with JobStatus and CheckpointProperties as current implementations of checkpoint store and checkpoint id counter all have their own responsibility to decide whether to discard checkpoints. In my point of view, checkpoint storage should have similar responsibility ownership compared with checkpoint store. I prefer each storage could have their own implementation to discard its directories to offer agility.
   
   I don't agree here. This really sounds to me like both making abstractions harder for the future and inconsistencies waiting to happen.
   
   Can there be a case in which different storages behave differently? Can it be that a storage would retain data when "Job=Canceled" and "Properties=NeverRetain", but delete them when "Job=Canceled" and "Properties=RetainOnFailure"?
   
   About the "checkpoint id counter": That one has a pretty clear life cycle: exactly that of the job, no ZK data should be left when the job finishes/cancels/fails. No retention policies or so. The JobStatus there is only to differentiate between shutdown-without-delete (suspend = lost leadership) and shutdown-with-delete. Arguable, that should actually be a boolean flag, not passing the JobStatus.
   
   > When talking about cleanup without recursive. I think your assumptions might not happen in real environment as previous job would always have a different job-id sub-directory which is definitely not the same as current job. We would only at most clean up the parent directory at job-id level and I think we would not cleanup other or previous jobs.
   
   That sounds exactly right, we should not try and clean up data from a previous job. That risks data loss in certain setups, when users configure the setup in an "unconventional way".
   
   Moreover, for HDFS (or non-object storage), the directory could only be discarded if empty. However, we often come across problem that checkpoint coordinator deletes files not so fast which lead the sub-directory is not exactly empty. In the end, we cannot ensure the directory could be cleaned up as expected.
   To make cleanup more safer, we could add documentation to not set savepoint target location under the job-id sub-directory. What do you think?
   
   I am not quite convinced here as well. This sounds again like risking data loss when users make a setup error. "Fixing by documentation" is often not preventing these things.
   What about "incremental savepoints" in the future? They would leave data in the "shared" directory.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386012585
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/CheckpointStorageCoordinatorView.java
 ##########
 @@ -103,4 +106,23 @@
 	CheckpointStorageLocation initializeLocationForSavepoint(
 		long checkpointId,
 		@Nullable String externalLocationPointer) throws IOException;
+
+	/**
+	 * Sets the checkpoint properties which used when shutting down the storage to the checkpoint storage.
+	 *
+	 * @param checkpointProperties The checkpoint properties to forward.
+	 */
+	void setCheckpointProperties(CheckpointProperties checkpointProperties);
+
+	/**
+	 * Shuts down the checkpoint storage on the coordinator side with given job status.
+	 *
+	 * <p>The job status is forwarded and used to decide whether checkpoint directory should
+	 * actually be discarded or kept.
+	 *
+	 * @param jobStatus The job status on shut down.
+	 *
+	 * @throws IOException Thrown if the storage cannot be shut down well due to an I/O exception.
+	 */
+	void shutDown(JobStatus jobStatus) throws IOException;
 
 Review comment:
   ```suggestion
   	void shutDown(JobStatus jobStatus, CheckpointProperties checkpointProperties) throws IOException;
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3563e73329fd94305582c244a24d5e8b66293628 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/151256469) 
   * 31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-613198860
 
 
   @flinkbot run travis
   
   @flinkbot run azure

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "PENDING",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/151482641) 
   * 2139ea47ee3dac096c47b36efdf077b15e621a19 Travis: [PENDING](https://travis-ci.com/github/flink-ci/flink/builds/159982673) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372) 
   * 31d15b255ff5bb329c16e1f1afac36b5ff06d4d8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-614768345
 
 
   Thanks for making another pass over this.
   
   Making the cleanup behavior configurable is a compromise, I guess. If many others in the community find this a valuable feature, I am okay with having this, even though I am personally a bit skeptical about features that can lead to "tricky surprises".
   
   What I would push for is having very clear docs for this, as part of this PR, so we can be sure we don't forget about them:
     - A section on the checkpointing docs describing the cleanup behavior, with a big warning that recursive cleanup can lead to problems if
       - you have savepoints in the same directory as checkpoints
       - you start a new job with retained checkpoints of another job
     - The description for the config option should link to the documentation.
   
   
   There are a few code changes I would suggest:
   
   (1) Whether cleanup is recursive or now is a feature specific to the file checkpoint storage. The `CheckpointCoordinator` is oblivious to the existence of directories.
   
   The division of responsibilities should be as follows:
     - The `CheckpointCoordinator` descides whether to clean up (compare JobStatus with CheckpointProperties) and calls `storage.shutdownAndCleanup()` if necessary.
     - The `AbstractFsCheckpointStorage` has the notion what that cleanup means, meaning it has the `isCleanupRecursievFlag`.
     - The config option should also be scoped to FS storage, not to checkpoints in general. Something like _'state.backend.fs.cleanup.recursive'_.
   
   (2) The cleanup is an I/O operation and one that can take very long, so we cannot run it in the main RPC threads.
     - Especially recursive cleanup can take long, because for many FileSystems/ObjectStores this does effectively mean enumerating all files/objects under the directory/prefix and issuing a separate delete call. Those can be 100s or 1000s, taking many seconds to even minutes.
     - We could need to run somewhere else, like the I/O executor. Passing an I/O executor to the "shutdown" methods would be a start.
     - We then have the issue that shutdown does not wait for the operation to be finished, so the Checkpoint Coordinator becomes a component that shuts down asynchronously (shutdown method returns a `CompletableFuture<Void>` that completes when shutdown is complete). That needs to be integrated with `JobMaster` shutdown.
     - Alternatively, we can try and move the shutdown of the CheckpointStorageLocation to the shutdown of the `JobMaster` which already supports asynchronous shutdown.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "31d15b255ff5bb329c16e1f1afac36b5ff06d4d8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "PENDING",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/159982673",
       "triggerID" : "613198860",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 2139ea47ee3dac096c47b36efdf077b15e621a19 Travis: [PENDING](https://travis-ci.com/github/flink-ci/flink/builds/159982673) Azure: [CANCELED](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=7372) 
   * 31d15b255ff5bb329c16e1f1afac36b5ff06d4d8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-593749284
 
 
   @carp84 , after reading [FLINK-5007](https://issues.apache.org/jira/browse/FLINK-5007) which introduce `discardOnSuspended`, I think we would not allow to discard checkpoint for HA case and could discard them on suspened for non-HA cases. However, current checkpoint storage only know whether it could support high available but not know whether it serves high availability now.
   
   To not discard checkpoints by mistake, I will not let checkpoint storage to discard on job suspended, and I think we would introduce HA related information to checkpoint storage in another story to handle this situation better.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   Hash:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101 TriggerType:PUSH TriggerID:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70
   Hash:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/143142694 TriggerType:PUSH TriggerID:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70
   -->
   ## CI report:
   
   * fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/143142694) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   Hash:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101 TriggerType:PUSH TriggerID:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70
   Hash:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/143142694 TriggerType:PUSH TriggerID:fdd072ffecb7136e6adbc48b7e85ebe8079c0e70
   -->
   ## CI report:
   
   * fdd072ffecb7136e6adbc48b7e85ebe8079c0e70 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/143142694) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-613216066
 
 
   It seems the flinkbot command triggers build against old commit 2139ea4... Let's issue no more command to see whether the build against latest commit 31d15b2 could be scheduled successfully.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-594422896
 
 
   Should we make the directory cleanup **NON recursive**? That might be the safer option.
   
   Given that we only want the parent/container directories removed, non-recursive delete should be fine.
   
   If something is still in those directories, it may be unsafe to delete them. Maybe a retained checkpoint from a previous job? Maybe an incremental safepoint in the future? It may be dangerous to assume that at shutdown time, all data in these directories can be deleted.
   If we see left-over data because checkpoints failed to clean up properly, then the shutdown part may not be the right place to clean this up, because it has too little information to know if this truly safe to be deleted. For that we need some "cleanup safety nets" that understand which shared files are still referenced, which checkpoints have been discarded (up to which point).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
Myasuka commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-612656015
 
 
   Refactor this PR with two commits:
   * Introduce `shutdown` to CheckpointStorageCoordinatorView to clean up checkpoint directory.
   * Introduce cleanup-storage-directories-recursively option to CheckpointConfig. 
   We would not delete files recursively be default. However, as we might often meet "not empty" exception due to orphan or not-ready-to-discard files, we should provide option for users to remove directories recursively.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
carp84 commented on a change in pull request #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#discussion_r386012480
 
 

 ##########
 File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ##########
 @@ -277,6 +277,7 @@ public CheckpointCoordinator(
 
 		try {
 			this.checkpointStorage = checkpointStateBackend.createCheckpointStorage(job);
+			checkpointStorage.setCheckpointProperties(checkpointProperties);
 
 Review comment:
   `CheckpointStorage` is used on both JM and TM side, and the `checkpointProperties` is only required by `CheckpointCoordinator`, so it should not be a property in `CheckpointStorage`. Instead, we should pass it in as a parameter in `CheckpointStorageCoordinatorView#shutDown`.
   
   Accordingly, we don't need to introduce the setter interface in `CheckpointStorageCoordinatorView` any more.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     }, {
       "hash" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151482641",
       "triggerID" : "31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2139ea47ee3dac096c47b36efdf077b15e621a19",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 31ba52e04ee5e84ee6c50ebd5c8b972bf326dd07 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/151482641) 
   * 2139ea47ee3dac096c47b36efdf077b15e621a19 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-570915760
 
 
   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4101",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/143142694",
       "triggerID" : "fdd072ffecb7136e6adbc48b7e85ebe8079c0e70",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3563e73329fd94305582c244a24d5e8b66293628",
       "status" : "SUCCESS",
       "url" : "https://travis-ci.com/flink-ci/flink/builds/151256469",
       "triggerID" : "3563e73329fd94305582c244a24d5e8b66293628",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3563e73329fd94305582c244a24d5e8b66293628 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/151256469) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Posted by GitBox <gi...@apache.org>.
StephanEwen edited a comment on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-595176653
 
 
   > @StephanEwen , I think it could be okay to delete directories with JobStatus and CheckpointProperties as current implementations of checkpoint store and checkpoint id counter all have their own responsibility to decide whether to discard checkpoints. In my point of view, checkpoint storage should have similar responsibility ownership compared with checkpoint store. I prefer each storage could have their own implementation to discard its directories to offer agility.
   
   I don't agree here. This really sounds to me like both making abstractions harder for the future and inconsistencies waiting to happen.
   
   Can there be a case in which different storages behave differently? Can it be that a storage would retain data when "Job=Canceled" and "Properties=NeverRetain", but delete them when "Job=Canceled" and "Properties=RetainOnFailure"?
   
   About the "checkpoint id counter": That one has a pretty clear life cycle: exactly that of the job, no ZK data should be left when the job finishes/cancels/fails. No retention policies or so. The JobStatus there is only to differentiate between shutdown-without-delete (suspend = lost leadership) and shutdown-with-delete. Arguable, that should actually be a boolean flag, not passing the JobStatus.
   
   > When talking about cleanup without recursive. I think your assumptions might not happen in real environment as previous job would always have a different job-id sub-directory which is definitely not the same as current job. We would only at most clean up the parent directory at job-id level and I think we would not cleanup other or previous jobs.
   
   That sounds exactly right, we should not try and clean up data from a previous job. That risks data loss in certain setups, when users configure the setup in an "unconventional way".
   
   > Moreover, for HDFS (or non-object storage), the directory could only be discarded if empty. However, we often come across problem that checkpoint coordinator deletes files not so fast which lead the sub-directory is not exactly empty. In the end, we cannot ensure the directory could be cleaned up as expected.
   To make cleanup more safer, we could add documentation to not set savepoint target location under the job-id sub-directory. What do you think?
   
   I am not quite convinced here as well. This sounds again like risking data loss when users make a setup error. "Fixing by documentation" is often not preventing these things.
   What about "incremental savepoints" in the future? They would leave data in the "shared" directory.
   
   Can we instead make sure that we wait for checkpoint delete operations to go through before attempting to delete the parent directory?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services