You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/01/20 08:57:03 UTC

[GitHub] [flink] dmvk opened a new pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

dmvk opened a new pull request #18416:
URL: https://github.com/apache/flink/pull/18416


   https://issues.apache.org/jira/browse/FLINK-25715
   
   
   Currently in application mode, any exception happens in the application driver (before submitting an actual job) leads to a fail-over. These errors are usually not retryable and we don't have a good way of reporting them to the user.
   
   We'll introduce a new config option `execution.submit-failed-job-on-application-error` that submits a failed job with the `$internal.pipeline.job-id` instead.
   
   This is intended to be used in combination with `execution.shutdown-on-application-finish = false` to allow user to retrieve the information about the failed submission.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1018578216


   I've updated the documentation as per the offline discussion. PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
rmetzger edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1023323763


   Thanks for implementing this nice feature.
   
   I have one question: I just tried this out, and it works as expected. However, when I issue a REST `DELETE /cluster` call (once I've retrieved the exception), the JobManager process will exit with code 0.
   For a cancelled job, this is fine, but for a failed Application Mode cluster, I would expect exit code 1 (so that my resource manager can restart the process)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r788921137



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/webmonitor/TestingDispatcherGateway.java
##########
@@ -230,6 +234,10 @@ public DispatcherId getFencingToken() {
         private BiFunction<JobID, String, CompletableFuture<String>>
                 stopWithSavepointAndGetLocationFunction;
 
+        public Builder() {

Review comment:
       🤦 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r788958346



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       Completing the `jobIdsFuture` exceptionally here seems right




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r789500760



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       > For this to be useful, the user should know the jobId upfront
   
   I don't think that's really true; outside of application mode users don't know the job ID upfront.
   The job name & stacktrace are the identifiable bits imo.
   
   * the stacktrace provides information on where it failed
   * the job name could be something like "Job # 4" or "Job after \<insert job name of previous successful job>", "UserClass#Line<where execute was called>", anything that is reasonable deterministic.
   
   > I think the current approach should be sufficient for now (failing the whole dispatcher bootstrap)
   
   It's not documented though ;) Neither for users (config docs) and devs (comment).
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017253519


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit d7d71f3d8f1219c69134edd68dd58b18101a0b33 (Thu Jan 20 09:01:17 UTC 2022)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
    * **This pull request references an unassigned [Jira ticket](https://issues.apache.org/jira/browse/FLINK-25715).** According to the [code contribution guide](https://flink.apache.org/contributing/contribute-code.html), tickets need to be assigned before starting with the implementation work.
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820",
       "triggerID" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d7d71f3d8f1219c69134edd68dd58b18101a0b33 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772) 
   * d6fd188546d369feeb985da588e9496f0a532d47 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r789500760



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       > For this to be useful, the user should know the jobId upfront
   
   I don't think that's really true; outside of application mode users don't know the job ID upfront.
   The job name & stacktrace are the identifiable bits imo.
   
   * the stacktrace provides information on where it failed
   * the job name could be something like "Job #4" or "Job after <insert job name of previous successful job>", "UserClass#Line<where execute was called>", anything that is reasonable deterministic.
   
   > I think the current approach should be sufficient for now (failing the whole dispatcher bootstrap)
   
   It's not documented though ;) Neither for users (config docs) and devs (comment).
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820",
       "triggerID" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d6fd188546d369feeb985da588e9496f0a532d47 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820) 
   * 9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820",
       "triggerID" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29903",
       "triggerID" : "9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29903) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d7d71f3d8f1219c69134edd68dd58b18101a0b33 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772) 
   * d6fd188546d369feeb985da588e9496f0a532d47 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r788642062



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       I'm confused. Submitting a failed job is not supported (for some reason), yet we are doing exactly that right here?

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/webmonitor/TestingDispatcherGateway.java
##########
@@ -230,6 +234,10 @@ public DispatcherId getFencingToken() {
         private BiFunction<JobID, String, CompletableFuture<String>>
                 stopWithSavepointAndGetLocationFunction;
 
+        public Builder() {

Review comment:
       (package-)private so there's only one way to create the builder?

##########
File path: flink-clients/src/test/java/org/apache/flink/client/testjar/FailingJob.java
##########
@@ -0,0 +1,35 @@
+package org.apache.flink.client.testjar;

Review comment:
       missing license header

##########
File path: flink-clients/src/test/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrapTest.java
##########
@@ -92,7 +94,7 @@ public void cleanup() {
     @Test
     public void testExceptionThrownWhenApplicationContainsNoJobs() throws Throwable {
         final TestingDispatcherGateway.Builder dispatcherBuilder =
-                new TestingDispatcherGateway.Builder()
+                TestingDispatcherGateway.newBuilder()

Review comment:
       can we move these to the previous commit?

##########
File path: flink-clients/src/test/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrapITCase.java
##########
@@ -147,6 +153,57 @@ public void testDispatcherRecoversAfterLosingAndRegainingLeadership() throws Exc
         }
     }
 
+    @Test
+    public void testSubmitFailedJobOnApplicationError() throws Exception {
+        final Deadline deadline = Deadline.fromNow(TIMEOUT);
+        final JobID jobId = new JobID();
+        final Configuration configuration = new Configuration();
+        configuration.set(HighAvailabilityOptions.HA_MODE, HighAvailabilityMode.ZOOKEEPER.name());
+        configuration.set(DeploymentOptions.TARGET, EmbeddedExecutor.NAME);
+        configuration.set(ClientOptions.CLIENT_RETRY_PERIOD, Duration.ofMillis(100));
+        configuration.set(DeploymentOptions.SHUTDOWN_ON_APPLICATION_FINISH, false);
+        configuration.set(DeploymentOptions.SUBMIT_FAILED_JOB_ON_APPLICATION_ERROR, true);
+        configuration.set(PipelineOptionsInternal.PIPELINE_FIXED_JOB_ID, jobId.toHexString());
+        final TestingMiniClusterConfiguration clusterConfiguration =
+                TestingMiniClusterConfiguration.newBuilder()
+                        .setConfiguration(configuration)
+                        .build();
+        final EmbeddedHaServicesWithLeadershipControl haServices =
+                new EmbeddedHaServicesWithLeadershipControl(TestingUtils.defaultExecutor());
+        final TestingMiniCluster.Builder clusterBuilder =
+                TestingMiniCluster.newBuilder(clusterConfiguration)
+                        .setHighAvailabilityServicesSupplier(() -> haServices)
+                        .setDispatcherResourceManagerComponentFactorySupplier(
+                                createApplicationModeDispatcherResourceManagerComponentFactorySupplier(
+                                        clusterConfiguration.getConfiguration(),
+                                        FailingJob.getProgram()));
+        try (final MiniCluster cluster = clusterBuilder.build()) {
+
+            // start mini cluster and submit the job
+            cluster.start();
+
+            // wait until job is running

Review comment:
       `// wait until job was submitted`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r789500760



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       > For this to be useful, the user should know the jobId upfront
   
   I don't think that's really true; outside of application mode users don't know the job ID upfront.
   The job name & stacktrace are the identifiable bits imo.
   
   * the stacktrace provides information on where it failed
   * the job name could be something like "Job # 4" or "Job after <insert job name of previous successful job>", "UserClass#Line<where execute was called>", anything that is reasonable deterministic.
   
   > I think the current approach should be sufficient for now (failing the whole dispatcher bootstrap)
   
   It's not documented though ;) Neither for users (config docs) and devs (comment).
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820",
       "triggerID" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "triggerType" : "PUSH"
     }, {
       "hash" : "9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29903",
       "triggerID" : "9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d6fd188546d369feeb985da588e9496f0a532d47 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820) 
   * 9b8c8e8ace0ea4922f12c3014fa8d9b8c1acf11c Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29903) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
rmetzger commented on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1023323763


   Thanks for implementing this nice feature.
   
   I have one question: I just tried. this out, and it works as expected. However, when I issue a REST `DELETE /cluster` call (once I've retrieved the exception), the JobManager process will exit with code 0.
   For a cancelled job, this is fine, but for a failed Application Mode cluster, I would expect exit code 1 (so that my resource manager can restart the process)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d7d71f3d8f1219c69134edd68dd58b18101a0b33 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r788920768



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       The main reason was not having to think about scenarios when the driver can actually submit more than one job.
   
   For example:
   - If the exception happens between first and second submission (first one has already completed). What job id do we submit the job with?
   - Choosing the submission id is tricky here in general, as we can't really use the `$internal.pipeline.job-id` (which implies the "single job mode")
   
   Any thoughts on this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r789479364



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       > If the exception happens between first and second submission (first one has already completed). What job id do we submit the job with?
   
   If the job had run the job ID would be random as well, right? Couldn't we use that then?
   
   > using ZERO_JOB_ID might not be correct
   
   We should try to reduce this usage as much as possible, because it is quite problematic (e.g., it breaks archiving).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820",
       "triggerID" : "d6fd188546d369feeb985da588e9496f0a532d47",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d6fd188546d369feeb985da588e9496f0a532d47 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29820) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r788923721



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       The intention was actually having a way to communicate this limitation to the user, but I see that using ZERO_JOB_ID is wrong 🤔I think failing the future here might be actually a better option here 👍 

##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       The intention was actually having a way to communicate this limitation to the user, but I see that using ZERO_JOB_ID is wrong 🤔I think failing the future here might be actually a better option 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r788923721



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       The intention was actually having a way to communicate this limitation to the user, but I see that using ZERO_JOB_ID might not be correct.. I need to think about it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r789487585



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       > If the job had run the job ID would be random as well, right? Couldn't we use that then?
   
   For this to be useful, the user should know the jobId upfront (that's one of the reasons for supporting the single execution mode only). Also this is not really an exception from the "application driver", but just an unsupported combination of configurations.
   
   I think the current approach should be sufficient for now (failing the whole dispatcher bootstrap). Also it's an experimental feature, so we can reiterate on this later if we find this confusing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d7d71f3d8f1219c69134edd68dd58b18101a0b33 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29772) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017255537


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d7d71f3d8f1219c69134edd68dd58b18101a0b33",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d7d71f3d8f1219c69134edd68dd58b18101a0b33 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r789479364



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       > If the exception happens between first and second submission (first one has already completed). What job id do we submit the job with?
   
   If the job had run the job ID would be random as well, right? Couldn't we use that then?
   
   > using ZERO_JOB_ID might not be correct
   
   We should try to reduce this usage as much as possible, because it is quite problematic (e.g., it breaks archiving).
   (Ideally we find a way to have a proper user-facing job ID)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #18416:
URL: https://github.com/apache/flink/pull/18416#discussion_r789500760



##########
File path: flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java
##########
@@ -266,10 +271,22 @@ private void runApplicationEntryPoint(
             final Set<JobID> tolerateMissingResult,
             final DispatcherGateway dispatcherGateway,
             final ScheduledExecutor scheduledExecutor,
-            final boolean enforceSingleJobExecution) {
+            final boolean enforceSingleJobExecution,
+            final boolean submitFailedJobOnApplicationError) {
+        if (submitFailedJobOnApplicationError && !enforceSingleJobExecution) {
+            dispatcherGateway.submitFailedJob(
+                    ZERO_JOB_ID,
+                    FAILED_JOB_NAME,
+                    new IllegalStateException(
+                            String.format(
+                                    "Submission of failed job in case of an application error ('%s') is not supported in non-HA setups.",

Review comment:
       > For this to be useful, the user should know the jobId upfront
   
   I don't think that's really true; outside of application mode users don't know the job ID upfront.
   The job name & stacktrace are the identifiable bits imo.
   
   * the stacktrace provides information on where it failed
   * the job name could be something like "Job # 4" or "Job after \<insert job name of previous successful job>", "UserClass#Line\<where execute was called>", anything that is reasonable deterministic.
   
   > I think the current approach should be sufficient for now (failing the whole dispatcher bootstrap)
   
   It's not documented though ;) Neither for users (config docs) and devs (comment).
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] dmvk commented on pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
dmvk commented on pull request #18416:
URL: https://github.com/apache/flink/pull/18416#issuecomment-1017709954


   Thanks for the review @zentol, I've addressed your comments. Ready for the 2nd pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol merged pull request #18416: [FLINK-25715][clients] Add deployment option (`execution.submit-failed-job-on-application-error`) for submitting a failed job when there is an error in application driver.

Posted by GitBox <gi...@apache.org>.
zentol merged pull request #18416:
URL: https://github.com/apache/flink/pull/18416


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org