You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/10/06 11:28:34 UTC

[GitHub] [flink] rmetzger commented on a change in pull request #13540: [FLINK-19344] Fix DispatcherResourceCleanupTest race condition

rmetzger commented on a change in pull request #13540:
URL: https://github.com/apache/flink/pull/13540#discussion_r500200594



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherResourceCleanupTest.java
##########
@@ -350,6 +350,10 @@ public void testJobSubmissionUnderSameJobId() throws Exception {
 		final TestingJobManagerRunner testingJobManagerRunner = jobManagerRunnerFactory.takeCreatedJobManagerRunner();
 		testingJobManagerRunner.completeResultFutureExceptionally(new JobNotFinishedException(jobId));
 
+		// wait until termination JobManagerRunner closeAsync has been called.
+		// this is necessary to avoid race conditions with completion of the 1st job and the submission of the 2nd job (DuplicateJobSubmissionException).
+		testingJobManagerRunner.getCloseAsyncCalledFuture().get();
+

Review comment:
       Thanks a lot for your review.
   
   
   The `DuplicateJobSubmissionException` exception is thrown, because the same job is already in the  `runningJobs` list. 
   
   I believe there is a race condition between the main thread executing the test and the dispatcher main thread: 
   (dispatcher thread) As soon as the `TestingJobManagerRunner` has been created, it is offered to a queue in the `TestingJobManagerRunnerFactory`. The (main test thread) is waiting on the queue. The next step for this thread will be updating the DispatcherJob's status, and setting up a forward of the result future from the jobManagerRunner.
   (main test thread) the `TestingJobManagerRunner` is available now, so the thread can continue. The thread uses the `TestingJobManagerRunner` to complete the result future. The result future will be forwarded in `DispatcherJob`. However, if (dispatcher thread) is not fast enough in registering the forward, the main thread might manage to submit the 2nd job before (dispatcher thread) forwards the completion.
   
   I can reproduce the problem locally by introducing a sleep for 10 milliseconds after offering the `TestingJobManagerRunner` in the `TestingJobManagerRunnerFactory`. This will guarantee that the result forward setup is always happening too late.
   
   In my opinion this test instability is purely an issue of the test, not the Dispatcher implementation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org