You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/03/30 03:51:40 UTC

[GitHub] [flink] Thesharing opened a new pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Thesharing opened a new pull request #19275:
URL: https://github.com/apache/flink/pull/19275


   ## Brief change log
   
   With this change, the job won't be terminated until the archiving of its ExecutionGraphInfo finishes or a timeout happens. The timeout is set to be the same as the value of `cluster.services.shutdown-timeout`. Currently, the default value of `cluster.services.shutdown-timeout` is 30s.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - *Added unit tests that validates that the waiting and the timeout works well.*
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "1083013320",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a72b6c1e5b8234587c558e4a631f3145b6a6262",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a72b6c1e5b8234587c558e4a631f3145b6a6262",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7767a1c7e6702293553e2e2be3ac980731582434 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943) 
   * 5a72b6c1e5b8234587c558e4a631f3145b6a6262 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on a change in pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on a change in pull request #19275:
URL: https://github.com/apache/flink/pull/19275#discussion_r838093153



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java
##########
@@ -618,12 +620,16 @@ private void runJob(JobManagerRunner jobManagerRunner, ExecutionType executionTy
                                 getMainThreadExecutor());
 
         final CompletableFuture<Void> jobTerminationFuture =
-                cleanupJobStateFuture.thenCompose(
-                        cleanupJobState ->
-                                removeJob(jobId, cleanupJobState)
-                                        .exceptionally(
-                                                throwable ->
-                                                        logCleanupErrorWarning(jobId, throwable)));
+                cleanupJobStateFuture.thenComposeAsync(
+                        (jobTerminalState) ->

Review comment:
       ```suggestion
                           jobTerminalState ->
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1084341166


   Thanks @Thesharing for your contribution. I looked into it and was wondering whether you also considered utilizing the chaining of the `CompletableFutures` within `handleJobManagerRunnerResult` as a possible solution. Right now (on `master`), `jobReachedTerminalState` archives the `ExecutionGraph` on the main thread, triggers the archiving of the `ExecutionGraph` in the history server if terminated globally, and adding the job to the `JobResultEntry` afterwards (in case of a globally terminated state). In your solution you're passing the result future of the history server archiving through this new class `JobTerminalState` and chain the history server archiving result later on.
   
   What about making the `handleJobManagerRunnerResult` and `jobManagerRunnerFailed` return a `CompletableFuture<CleanupJobState>` that completes in the case of a globally terminal job state after the history server archiving took place and the JobResultStore entry was written. WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp edited a comment on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085493232


   That's a good point. Thanks for the detailed explanation. I was thinking about it: Essentially, the question is whether we consider the `HistoryServer` archiving being part of the job (which means that we want to finish it before the cleanup phase starts and the `JobManagerRunner` is removed) or we want it to run concurrently to the cleanup logic. For the latter case, I even thought of integrating it into the cleanup phase by making it implement the `GloballyCleanableResource` interface. But I'm not a fan of it because, semantically, cleaning up and archiving are two different things (it would have the benefit of getting retries out-of-the-box in case of failure, though). This PR proposes a hybrid approach (i.e. triggering the archiving before the cleanup phase but letting it run concurrently to it) which contributes to the code becoming more complex. Hence, I'd propose going for one of the options I described depending on the usecase we want to cover.
   
   I still tend to lean towards the first approach (waiting for the archiving before triggering the cleanup) because that's where the job is considered finished and the user should expect a result in the `HistoryServer`. 
   
   That said, keep in mind that if we go for the latter option (i.e. archiving concurrently to the cleanup), it doesn't guarantee that the cluster waits for it to finish. We have a ticket for that: FLINK-26772 The reason is that the `Dispatcher.shutdownCluster` methods isn't waiting for the `Dispatcher.jobTerminationFutures` to complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 051c82f77c4c1616e0d57e1ef50f3b8e66585979 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939) 
   * 7767a1c7e6702293553e2e2be3ac980731582434 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1084046966


   cc @zhuzhurk @XComp Would you mind helping me review this pull request if you had free time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085618588


   > As for FLINK-26772, I'm wondering whether we could make sure shutdownFuture is not completed until the jobTerminationFutures are all completed or not.
   
   Correct, that's the plan around FLINK-26772


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "1083013320",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a72b6c1e5b8234587c558e4a631f3145b6a6262",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34105",
       "triggerID" : "5a72b6c1e5b8234587c558e4a631f3145b6a6262",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7767a1c7e6702293553e2e2be3ac980731582434 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943) 
   * 5a72b6c1e5b8234587c558e4a631f3145b6a6262 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34105) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085759396


   I've already updated the pull request according to your suggestion, @XComp . Would you mind reviewing it again if you got free time? Thank you so much in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd080fe4c62df22212af4b5b75eb8f6755a43f40 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085598840


   > One other point that came up which I'd like to share: If the user enables the history server, I'd suppose that he/she has a strong desire for the result to be in there. That would be another reason why we would consider the HistoryServer in that case a "first-level" citizen next to the `ExecutionGraphInfoStore` which should be populated before actually triggering the cleanup phase.
   
   I get it. If the users met the worst case like I mentioned before, he/she can just turn off the archive. Just for my information: would it try to archive the ExecutionGraph once again before it re-triggers another resource cleanup?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1084341166


   Thanks @Thesharing for your contribution. I looked into it and was wondering whether you also considered utilizing the chaining of the `CompletableFutures` as a possible solution. Right now (on `master`), `jobReachedTerminalState` archives the `ExecutionGraph` on the main thread, triggers the archiving of the `ExecutionGraph` in the history server if terminated globally, and adding the job to the `JobResultEntry` afterwards (in case of a globally terminated state). 
   
   Essentially, we could just make `handleJobManagerRunnerResult` and `jobManagerRunnerFailed` return a `CompletableFuture<CleanupJobState>` that completes in the case of a globally terminal job state after the history server archiving took place and the JobResultStore entry was written. WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1083013320


   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 051c82f77c4c1616e0d57e1ef50f3b8e66585979 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939) 
   * 7767a1c7e6702293553e2e2be3ac980731582434 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd080fe4c62df22212af4b5b75eb8f6755a43f40 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "1083013320",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "5a72b6c1e5b8234587c558e4a631f3145b6a6262",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34105",
       "triggerID" : "5a72b6c1e5b8234587c558e4a631f3145b6a6262",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a72b6c1e5b8234587c558e4a631f3145b6a6262 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=34105) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085589841


   Thank you for the detailed analysis, @XComp. I agree that cleaning up and archiving are two different things. Integrating the archiving into the resource cleanup may make user confused. 
   
   I know that if the resource cleanup is not completed, the JobResultEntry will not be marked as clean, and a retry of the resource cleanup would be triggered. Would it try to archive the ExecutionGraph once again before it re-triggers another resource cleanup? If so, I think maybe waiting for the archiving before triggering the cleanup is better. However, I'm worried about the worst case: the archiving fails over and over again (due to a busy disk or a slow network), which makes the job retry so many times. If we run the archiving and the cleanup currently, the failure of archiving won't make the job retry over and over again. We just lost the archive, the same as we currently do.
   
   As for FLINK-26772, I'm wondering whether we could make sure `shutdownFuture` is not completed until the `jobTerminationFutures` are all completed or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd080fe4c62df22212af4b5b75eb8f6755a43f40 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085588177


   One other point that came up which I'd like to share: If the user enables the history server, I'd suppose that he/she has a strong desire for the result to be in there. That would be another reason why we would consider the HistoryServer in that case a "first-level" citizen next to the `ExecutionGraphInfoStore` which should be populated before actually triggering the cleanup phase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085631615


   About FLINK-26976. Yes, that might be the easiest solution (without looking into it in more detail). But a test covering this should be added to be sure... (but maybe, move discussions like that into the corresponding ticket).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7767a1c7e6702293553e2e2be3ac980731582434 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 051c82f77c4c1616e0d57e1ef50f3b8e66585979 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing edited a comment on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1084449612


   Thank you so much for your review and suggestions, @XComp! 😄 
   
   ![Illustration](https://user-images.githubusercontent.com/6576831/161047726-613407d3-114e-4a28-a536-de2b61552576.jpg)
   
   I draw an illustration for two options. Option 1 chains the result future of archiving and the result future of resource cleanup. Option 2 makes the `handleJobManagerRunnerResult` and `jobManagerRunnerFailed` return a `CompletableFuture<CleanupJobState>`.
   
   Option 1 could parallelize two IO operations. Furthermore, if the archiving takes a long time in the worst case, the job may be terminated by users or external resource providers. In this situation, the job still get cleaned up. Therefore, I think maybe option 1 is better. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1086125286


   sure, I will do another pass on Monday 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "1083013320",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 7767a1c7e6702293553e2e2be3ac980731582434 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on a change in pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on a change in pull request #19275:
URL: https://github.com/apache/flink/pull/19275#discussion_r838093153



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java
##########
@@ -618,12 +620,16 @@ private void runJob(JobManagerRunner jobManagerRunner, ExecutionType executionTy
                                 getMainThreadExecutor());
 
         final CompletableFuture<Void> jobTerminationFuture =
-                cleanupJobStateFuture.thenCompose(
-                        cleanupJobState ->
-                                removeJob(jobId, cleanupJobState)
-                                        .exceptionally(
-                                                throwable ->
-                                                        logCleanupErrorWarning(jobId, throwable)));
+                cleanupJobStateFuture.thenComposeAsync(
+                        (jobTerminalState) ->

Review comment:
       ```suggestion
                           jobTerminalState ->
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing edited a comment on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085628342


   > For the HistoryServer, that's not the case. It will try to trigger the archiving again but would probably find a the ExecutionGraph already being archived for that job. This will result in a failure, i.e. the archiving is not idempotent which is actually should be. I created [FLINK-26976](https://issues.apache.org/jira/browse/FLINK-26976) to cover this.
   
   Thank you for your explanation, Matthias! I'm going to fix my pull request to make sure archiving is finished before triggering the cleanup.
   
   For FLINK-26976, changing `org.apache.flink.runtime.history.FsJobArchivist#archiveJob` from
   ```java
   OutputStream out = fs.create(path, FileSystem.WriteMode.NO_OVERWRITE);
   ```
   to
   ```java
   OutputStream out = fs.create(path, FileSystem.WriteMode.OVERWRITE);
   ```
   could solve the problem. Am I right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1084449612


   > Thanks @Thesharing for your contribution. I looked into it and was wondering whether you also considered utilizing the chaining of the `CompletableFutures` within `handleJobManagerRunnerResult` as a possible solution. Right now (on `master`), `jobReachedTerminalState` archives the `ExecutionGraph` on the main thread, triggers the archiving of the `ExecutionGraph` in the history server if terminated globally, and adding the job to the `JobResultEntry` afterwards (in case of a globally terminated state). In your solution you're passing the result future of the history server archiving through this new class `JobTerminalState` and chain the history server archiving result later on.
   > 
   > What about making the `handleJobManagerRunnerResult` and `jobManagerRunnerFailed` return a `CompletableFuture<CleanupJobState>` that completes in the case of a globally terminal job state after the history server archiving took place and the JobResultStore entry was written. WDYT?
   
   Thank you so much for your review and suggestions, @XComp! 😄 
   
   ![Illustration](https://user-images.githubusercontent.com/6576831/161047726-613407d3-114e-4a28-a536-de2b61552576.jpg)
   
   I draw an illustration for two options. Option 1 chains the result future of archiving and the result future of resource cleanup. Option 2 makes the `handleJobManagerRunnerResult` and `jobManagerRunnerFailed` return a `CompletableFuture<CleanupJobState>`.
   
   Option 1 could parallelize two IO operations. Furthermore, if the archiving takes a long time in the worst case, the job may be terminated by users or external resource providers. In this situation, the job still get cleaned up. Therefore, I think maybe option 1 is better. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085616997


   the archiving will be retriggered in case of a JobManager failover. Consider that the job finished globally. The following steps would happen:
   1. Archiving of the job to the `ExecutionGraphInfoStore`
   2. (optional) HistoryServer archiving is triggered
   3. JobResult is written as dirty entry to `JobResultStore`
   4. Cleanup of job-related artifacts is triggered in a retryable fashion
   5. JobResult is marked as clean in the JobResultStore
   6. The job termination future completes
   
   In this setup, the archiving only happens once. No retry is triggered. Now, let's assume, the jobManager would failover for whatever reason in phase 4. That means that the dirty entry for this job already exists in the JobResultStore. A failover of the JobManager would start a `CleanupJobManagerRunner` that will immediately complete and trigger the termination process (as described above) again. As a consequence, a sparse ArchivedExecutionGraph is archived into the `ExecutionGraphInfoStore`. That is ok for now because the ExecutionGraphInfoStore only lives on the JobManager node and is not shared outside of its scope.
   For the HistoryServer, that's not the case. It will try to trigger the archiving again but would probably find a the ExecutionGraph already being archived for that job. This will result in a failure, i.e. the archiving is not idempotent which is actually should be. I created FLINK-26976 to cover this.
   
   Another follow-up issue should be making the archiving also retryable. This isn't the case, yet, but should be desired. I would suggest fixing that as a separate issue to avoid increasing the PRs scope. Therefore, I created FLINK-26984 to cover the retrying of the archiving.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd080fe4c62df22212af4b5b75eb8f6755a43f40 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933) 
   * 051c82f77c4c1616e0d57e1ef50f3b8e66585979 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "1083013320",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 7767a1c7e6702293553e2e2be3ac980731582434 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1084046966


   cc @zhuzhurk @XComp Would you mind helping me review this pull request if you had free time? This change is related to the cleanup of the resources including JobManagerRunnerRegistry and JobResultStore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing edited a comment on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085598840


   > One other point that came up which I'd like to share: If the user enables the history server, I'd suppose that he/she has a strong desire for the result to be in there. That would be another reason why we would consider the HistoryServer in that case a "first-level" citizen next to the `ExecutionGraphInfoStore` which should be populated before actually triggering the cleanup phase.
   
   I get it. If the users met the worst case like I mentioned before, he/she can just turn off the archive. Just for my information: would it try to archive the ExecutionGraph once again before it re-triggers another resource cleanup? If so, I think waiting for the archiving before triggering the cleanup is better indeed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1084449612


   > Thanks @Thesharing for your contribution. I looked into it and was wondering whether you also considered utilizing the chaining of the `CompletableFutures` within `handleJobManagerRunnerResult` as a possible solution. Right now (on `master`), `jobReachedTerminalState` archives the `ExecutionGraph` on the main thread, triggers the archiving of the `ExecutionGraph` in the history server if terminated globally, and adding the job to the `JobResultEntry` afterwards (in case of a globally terminated state). In your solution you're passing the result future of the history server archiving through this new class `JobTerminalState` and chain the history server archiving result later on.
   > 
   > What about making the `handleJobManagerRunnerResult` and `jobManagerRunnerFailed` return a `CompletableFuture<CleanupJobState>` that completes in the case of a globally terminal job state after the history server archiving took place and the JobResultStore entry was written. WDYT?
   
   Thank you so much for your review and suggestions, @XComp! 😄 
   
   Chaining the result future of archiving and the result future of resource cleanup could parallelize two IO operations. Furthermore, if the archiving takes a long time in the worst case, the job may be terminated by users or external resource providers. In this situation, the job still get cleaned up. Therefore, I think maybe chaining the result future of archiving and the result future of resource cleanup is better. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] XComp commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
XComp commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085493232


   That's a good point. Thanks for the detailed explanation. I was thinking about it: Essentially, the question is whether we consider the `HistoryServer` archiving being part of the job (which means that we want to finish it before the cleanup phase starts and the `JobManagerRunner` is removed) or we want it to run concurrently to the cleanup logic. For the latter case, I even thought of integrating it into the cleanup phase by making it implement the `GloballyCleanableResource` interface. But I'm not a fan of it because semantically, cleaning up and archiving are two different things (it would have the benefit of getting retries out-of-the-box in case of failure, though). This PR proposes a hybrid approach (i.e. triggering the archiving before the cleanup phase but letting it run concurrently to it) which contributes to the code becoming more complex. Hence, I'd propose going for one of the options I described depending on the usecase we want to cover.
   
   I still tend to lean towards the first approach (waiting for the archiving before triggering the cleanup) because that's where the job is considered finished and the user should expect a result in the `HistoryServer`. 
   
   That said, keep in mind that if we go for the latter option (i.e. archiving concurrently to the cleanup), it doesn't guarantee that the cluster waits for it to finish. We have a ticket for that: FLINK-26772 The reason is that the `Dispatcher.shutdownCluster` methods isn't waiting for the `Dispatcher.jobTerminationFutures` to complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] Thesharing commented on pull request #19275: [FLINK-24491][runtime] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
Thesharing commented on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1085628342


   > For the HistoryServer, that's not the case. It will try to trigger the archiving again but would probably find a the ExecutionGraph already being archived for that job. This will result in a failure, i.e. the archiving is not idempotent which is actually should be. I created [FLINK-26976](https://issues.apache.org/jira/browse/FLINK-26976) to cover this.
   
   Thank you for your explanation, Matthias! I'm going to fix my pull request to make sure archiving is finished before triggering the cleanup.
   
   For FLINK-26976, changing org.apache.flink.runtime.history.FsJobArchivist#archiveJob from
   
   ```java
   OutputStream out = fs.create(path, FileSystem.WriteMode.NO_OVERWRITE);
   ```
   to
   ```java
   OutputStream out = fs.create(path, FileSystem.WriteMode.OVERWRITE);
   ```
   could solve the problem. Am I right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * dd080fe4c62df22212af4b5b75eb8f6755a43f40 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933) 
   * 051c82f77c4c1616e0d57e1ef50f3b8e66585979 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #19275: [FLINK-24491] Make the job termination wait until the archiving of ExecutionGraphInfo finishes

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #19275:
URL: https://github.com/apache/flink/pull/19275#issuecomment-1082595206


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33933",
       "triggerID" : "dd080fe4c62df22212af4b5b75eb8f6755a43f40",
       "triggerType" : "PUSH"
     }, {
       "hash" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33939",
       "triggerID" : "051c82f77c4c1616e0d57e1ef50f3b8e66585979",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7767a1c7e6702293553e2e2be3ac980731582434",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943",
       "triggerID" : "1083013320",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 7767a1c7e6702293553e2e2be3ac980731582434 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33943) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org