You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/10/05 08:14:16 UTC

[GitHub] [flink] rmetzger opened a new pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

rmetzger opened a new pull request #13540:
URL: https://github.com/apache/flink/pull/13540


   
   ## What is the purpose of the change
   
   This is removing a test-instability, where sometimes, the DispatcherResourceCleanupTest. testJobSubmissionUnderSameJobId() would report a `DuplicateJobSubmissionException`. 
   The test is initialized with a dispatcher, recovering the testing job graph. Once the TestingJobManagerRunner for this job graph has been created, the result future is completed, and a new job gets submitted.
   The problem is that the `DispatcherJob.isDuplicateJob()` method might still find the job in the `runningJobs` list, because the cleanup from that list happens asynchronously.
   
   The test instability is resolved by waiting until TestingJobManagerRunner has been closed.
   
   
   
   ## Verifying this change
   
   I have executed this patch 4000 times on CI, without a failure: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8421&view=results
   
   ## Does this pull request potentially affect one of the following parts:
   
   The change only affects tests.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-703484500


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6650579aa244c53d76237be84b861224c451b4ba",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199",
       "triggerID" : "6650579aa244c53d76237be84b861224c451b4ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6650579aa244c53d76237be84b861224c451b4ba Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-703484500


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6650579aa244c53d76237be84b861224c451b4ba",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199",
       "triggerID" : "6650579aa244c53d76237be84b861224c451b4ba",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8842d1927613cdcd09f4e6c671c9cb1dc671269e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7247",
       "triggerID" : "8842d1927613cdcd09f4e6c671c9cb1dc671269e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8842d1927613cdcd09f4e6c671c9cb1dc671269e Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7247) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-703484500


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6650579aa244c53d76237be84b861224c451b4ba",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199",
       "triggerID" : "6650579aa244c53d76237be84b861224c451b4ba",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8842d1927613cdcd09f4e6c671c9cb1dc671269e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "8842d1927613cdcd09f4e6c671c9cb1dc671269e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6650579aa244c53d76237be84b861224c451b4ba Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199) 
   * 8842d1927613cdcd09f4e6c671c9cb1dc671269e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
rmetzger commented on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-704719240


   Thanks for your review. Merging ...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13540:
URL: https://github.com/apache/flink/pull/13540#discussion_r499630797



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/TestingJobManagerRunner.java
##########
@@ -39,6 +39,8 @@
 
 	private final CompletableFuture<ArchivedExecutionGraph> resultFuture;
 
+	private final CompletableFuture<Void> closeAsyncCalledFuture = new CompletableFuture<>();

Review comment:
       I would suggest to use a `OneShotLatch`.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherResourceCleanupTest.java
##########
@@ -350,6 +350,10 @@ public void testJobSubmissionUnderSameJobId() throws Exception {
 		final TestingJobManagerRunner testingJobManagerRunner = jobManagerRunnerFactory.takeCreatedJobManagerRunner();
 		testingJobManagerRunner.completeResultFutureExceptionally(new JobNotFinishedException(jobId));
 
+		// wait until termination JobManagerRunner closeAsync has been called.
+		// this is necessary to avoid race conditions with completion of the 1st job and the submission of the 2nd job (DuplicateJobSubmissionException).
+		testingJobManagerRunner.getCloseAsyncCalledFuture().get();
+

Review comment:
       I agree that relying on the fact that the testing main thread will enqueue the handleAsync payload before the `submitJob` is not a safe assumption. Hence, the fix should be fine. Still I would like to understand what exactly is going wrong here.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherResourceCleanupTest.java
##########
@@ -350,6 +350,10 @@ public void testJobSubmissionUnderSameJobId() throws Exception {
 		final TestingJobManagerRunner testingJobManagerRunner = jobManagerRunnerFactory.takeCreatedJobManagerRunner();
 		testingJobManagerRunner.completeResultFutureExceptionally(new JobNotFinishedException(jobId));
 
+		// wait until termination JobManagerRunner closeAsync has been called.
+		// this is necessary to avoid race conditions with completion of the 1st job and the submission of the 2nd job (DuplicateJobSubmissionException).
+		testingJobManagerRunner.getCloseAsyncCalledFuture().get();
+

Review comment:
       I don't fully understand why this additional synchronization step is necessary. If I am not mistaken, then `testingJobManagerRunner.completeResultFutureExceptionally` won't trigger `Dispatcher.jobNotFinished` directly but at least it will enqueue the `RunAsync` message which will run this task into the mailbox of the `Dispatcher`. `dispatcherGateway.submitJob` should do the same just that the submit message is enqueued after the `RunAsync` message.
   
   Could you show me an execution order in which the `submitJob` RPC call is executed before the `handleAsync` (https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L374)?
   
   Could you reproduce the problem locally?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13540:
URL: https://github.com/apache/flink/pull/13540#discussion_r500275816



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherResourceCleanupTest.java
##########
@@ -350,6 +350,10 @@ public void testJobSubmissionUnderSameJobId() throws Exception {
 		final TestingJobManagerRunner testingJobManagerRunner = jobManagerRunnerFactory.takeCreatedJobManagerRunner();
 		testingJobManagerRunner.completeResultFutureExceptionally(new JobNotFinishedException(jobId));
 
+		// wait until termination JobManagerRunner closeAsync has been called.
+		// this is necessary to avoid race conditions with completion of the 1st job and the submission of the 2nd job (DuplicateJobSubmissionException).
+		testingJobManagerRunner.getCloseAsyncCalledFuture().get();
+

Review comment:
       Thanks for the clarification @rmetzger. I believe that your analysis is correct.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on a change in pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
rmetzger commented on a change in pull request #13540:
URL: https://github.com/apache/flink/pull/13540#discussion_r500200594



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherResourceCleanupTest.java
##########
@@ -350,6 +350,10 @@ public void testJobSubmissionUnderSameJobId() throws Exception {
 		final TestingJobManagerRunner testingJobManagerRunner = jobManagerRunnerFactory.takeCreatedJobManagerRunner();
 		testingJobManagerRunner.completeResultFutureExceptionally(new JobNotFinishedException(jobId));
 
+		// wait until termination JobManagerRunner closeAsync has been called.
+		// this is necessary to avoid race conditions with completion of the 1st job and the submission of the 2nd job (DuplicateJobSubmissionException).
+		testingJobManagerRunner.getCloseAsyncCalledFuture().get();
+

Review comment:
       Thanks a lot for your review.
   
   
   The `DuplicateJobSubmissionException` exception is thrown, because the same job is already in the  `runningJobs` list. 
   
   I believe there is a race condition between the main thread executing the test and the dispatcher main thread: 
   (dispatcher thread) As soon as the `TestingJobManagerRunner` has been created, it is offered to a queue in the `TestingJobManagerRunnerFactory`. The (main test thread) is waiting on the queue. The next step for this thread will be updating the DispatcherJob's status, and setting up a forward of the result future from the jobManagerRunner.
   (main test thread) the `TestingJobManagerRunner` is available now, so the thread can continue. The thread uses the `TestingJobManagerRunner` to complete the result future. The result future will be forwarded in `DispatcherJob`. However, if (dispatcher thread) is not fast enough in registering the forward, the main thread might manage to submit the 2nd job before (dispatcher thread) forwards the completion.
   
   I can reproduce the problem locally by introducing a sleep for 10 milliseconds after offering the `TestingJobManagerRunner` in the `TestingJobManagerRunnerFactory`. This will guarantee that the result forward setup is always happening too late.
   
   In my opinion this test instability is purely an issue of the test, not the Dispatcher implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-703484500


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6650579aa244c53d76237be84b861224c451b4ba",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199",
       "triggerID" : "6650579aa244c53d76237be84b861224c451b4ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6650579aa244c53d76237be84b861224c451b4ba Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-703478380


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 6650579aa244c53d76237be84b861224c451b4ba (Mon Oct 05 08:16:53 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-703484500


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6650579aa244c53d76237be84b861224c451b4ba",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199",
       "triggerID" : "6650579aa244c53d76237be84b861224c451b4ba",
       "triggerType" : "PUSH"
     }, {
       "hash" : "8842d1927613cdcd09f4e6c671c9cb1dc671269e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7247",
       "triggerID" : "8842d1927613cdcd09f4e6c671c9cb1dc671269e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6650579aa244c53d76237be84b861224c451b4ba Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7199) 
   * 8842d1927613cdcd09f4e6c671c9cb1dc671269e Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7247) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger merged pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
rmetzger merged pull request #13540:
URL: https://github.com/apache/flink/pull/13540


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13540:
URL: https://github.com/apache/flink/pull/13540#issuecomment-703484500


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6650579aa244c53d76237be84b861224c451b4ba",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6650579aa244c53d76237be84b861224c451b4ba",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6650579aa244c53d76237be84b861224c451b4ba UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org