You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/10/14 14:37:12 UTC

[GitHub] [flink] rmetzger opened a new pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

rmetzger opened a new pull request #13639:
URL: https://github.com/apache/flink/pull/13639


   ## What is the purpose of the change
   
   Fix rejected slot offer bug in JobMaster. The bug surfaced as a test instability in `LeaderChangeClusterComponentsTest.testReelectionOfJobMaster`.
   
   
   ## Brief change log
   
   - Move disconnect of TaskManagers to the Dispatcher.suspend() method.
   - Add a test
   
   
   ## Verifying this change
   
   Covered by a dedicated test.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / *no*)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / *no*)
     - The serializers: (yes / *no* / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / *no* / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (*yes* / no / don't know)
     - The S3 file system connector: (yes / *no* / don't know)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708447990


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 4bac1d3f88930c269cab61688034af64b0f20a10 (Wed Oct 14 14:39:27 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
rmetzger commented on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708981296


   Addressed & rebased. Will merge once CI has passed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on a change in pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
rmetzger commented on a change in pull request #13639:
URL: https://github.com/apache/flink/pull/13639#discussion_r504731335



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java
##########
@@ -280,6 +286,65 @@ public static void teardownClass() {
 		}
 	}
 
+	/**
+	 * This test ensures that the bookkeeping of TaskExecutors in the JobMaster handles cases where TaskExecutors with the same
+	 * ID re-register properly. FLINK-19237 was a bug where the TaskExecutors and the SlotPool got out of sync, and
+	 * slot offers were rejected.
+	 */
+	@Test
+	public void testAcceptSlotOfferAfterLeaderchange() throws Exception {
+
+		final JobManagerSharedServices jobManagerSharedServices = new TestingJobManagerSharedServicesBuilder().build();
+		final JobMasterConfiguration jobMasterConfiguration = JobMasterConfiguration.fromConfiguration(configuration);
+
+		final SchedulerNGFactory schedulerNGFactory = SchedulerNGFactoryFactory.createSchedulerNGFactory(configuration);
+
+		final TestingJobMaster jobMaster = new TestingJobMaster(
+			rpcService,
+			jobMasterConfiguration,
+			jmResourceId,
+			jobGraph,
+			haServices,
+			SlotPoolFactory.fromConfiguration(configuration),
+			jobManagerSharedServices,
+			heartbeatServices,
+			UnregisteredJobManagerJobMetricGroupFactory.INSTANCE,
+			new JobMasterBuilder.TestingOnCompletionActions(),
+			testingFatalErrorHandler,
+			JobMasterTest.class.getClassLoader(),
+			schedulerNGFactory,
+			NettyShuffleMaster.INSTANCE,
+			NoOpJobMasterPartitionTracker.FACTORY,
+			new DefaultExecutionDeploymentTracker(),
+			DefaultExecutionDeploymentReconciler::new,
+			System.currentTimeMillis());
+
+		jobMaster.start(jobMasterId).get();
+
+		log.info("Register TaskManager");
+
+		String testingTaskManagerAddress = "fake";
+		UnresolvedTaskManagerLocation unresolvedTaskManagerLocation = new LocalUnresolvedTaskManagerLocation();
+		TestingTaskExecutorGateway testingTaskExecutorGateway = new TestingTaskExecutorGatewayBuilder().createTestingTaskExecutorGateway();
+		rpcService.registerGateway(testingTaskManagerAddress, testingTaskExecutorGateway);
+		Assert.assertThat(jobMaster.registerTaskManager(testingTaskManagerAddress, unresolvedTaskManagerLocation, testingTimeout).get(), instanceOf(RegistrationResponse.Success.class));
+
+		log.info("Revoke leadership & re-grant leadership");
+		jobMaster.suspend(new FlinkException("Lost leadership")).get();
+
+		jobMaster.start(JobMasterId.generate()).get();
+
+		log.info("re-register same TaskManager");
+		Assert.assertThat(jobMaster.registerTaskManager(testingTaskManagerAddress, unresolvedTaskManagerLocation, testingTimeout).get(), instanceOf(RegistrationResponse.Success.class));
+
+		log.info("Ensure JobMaster accepts slot offer");
+		final SlotOffer slotOffer = new SlotOffer(new AllocationID(), 0, ResourceProfile.ANY);
+
+		Collection<SlotOffer> acceptedSlots = jobMaster.executeInMainThreadExecutor(() -> jobMaster.offerSlots(unresolvedTaskManagerLocation.getResourceID(), Collections.singleton(slotOffer), testingTimeout).get()).get();
+		Assert.assertThat(acceptedSlots.size(), is(1));
+

Review comment:
       I will remove this empty line before merging or when addressing comments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7662",
       "triggerID" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7674",
       "triggerID" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "triggerType" : "PUSH"
     }, {
       "hash" : "edc867bcfe18df71043801cb8a38eb0ed2d4d502",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "edc867bcfe18df71043801cb8a38eb0ed2d4d502",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2876ce46b800db5d0fe10f12ebf55efd1ab22ace Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7674) 
   * edc867bcfe18df71043801cb8a38eb0ed2d4d502 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4bac1d3f88930c269cab61688034af64b0f20a10 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on a change in pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
rmetzger commented on a change in pull request #13639:
URL: https://github.com/apache/flink/pull/13639#discussion_r505317363



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java
##########
@@ -280,6 +286,65 @@ public static void teardownClass() {
 		}
 	}
 
+	/**
+	 * This test ensures that the bookkeeping of TaskExecutors in the JobMaster handles cases where TaskExecutors with the same
+	 * ID re-register properly. FLINK-19237 was a bug where the TaskExecutors and the SlotPool got out of sync, and
+	 * slot offers were rejected.
+	 */
+	@Test
+	public void testAcceptSlotOfferAfterLeaderchange() throws Exception {
+
+		final JobManagerSharedServices jobManagerSharedServices = new TestingJobManagerSharedServicesBuilder().build();
+		final JobMasterConfiguration jobMasterConfiguration = JobMasterConfiguration.fromConfiguration(configuration);
+
+		final SchedulerNGFactory schedulerNGFactory = SchedulerNGFactoryFactory.createSchedulerNGFactory(configuration);
+
+		final TestingJobMaster jobMaster = new TestingJobMaster(
+			rpcService,
+			jobMasterConfiguration,
+			jmResourceId,
+			jobGraph,
+			haServices,
+			SlotPoolFactory.fromConfiguration(configuration),
+			jobManagerSharedServices,
+			heartbeatServices,
+			UnregisteredJobManagerJobMetricGroupFactory.INSTANCE,
+			new JobMasterBuilder.TestingOnCompletionActions(),
+			testingFatalErrorHandler,
+			JobMasterTest.class.getClassLoader(),
+			schedulerNGFactory,
+			NettyShuffleMaster.INSTANCE,
+			NoOpJobMasterPartitionTracker.FACTORY,
+			new DefaultExecutionDeploymentTracker(),
+			DefaultExecutionDeploymentReconciler::new,
+			System.currentTimeMillis());
+
+		jobMaster.start(jobMasterId).get();
+
+		log.info("Register TaskManager");
+
+		String testingTaskManagerAddress = "fake";
+		UnresolvedTaskManagerLocation unresolvedTaskManagerLocation = new LocalUnresolvedTaskManagerLocation();
+		TestingTaskExecutorGateway testingTaskExecutorGateway = new TestingTaskExecutorGatewayBuilder().createTestingTaskExecutorGateway();
+		rpcService.registerGateway(testingTaskManagerAddress, testingTaskExecutorGateway);
+		Assert.assertThat(jobMaster.registerTaskManager(testingTaskManagerAddress, unresolvedTaskManagerLocation, testingTimeout).get(), instanceOf(RegistrationResponse.Success.class));
+
+		log.info("Revoke leadership & re-grant leadership");
+		jobMaster.suspend(new FlinkException("Lost leadership")).get();
+
+		jobMaster.start(JobMasterId.generate()).get();
+
+		log.info("re-register same TaskManager");
+		Assert.assertThat(jobMaster.registerTaskManager(testingTaskManagerAddress, unresolvedTaskManagerLocation, testingTimeout).get(), instanceOf(RegistrationResponse.Success.class));
+
+		log.info("Ensure JobMaster accepts slot offer");
+		final SlotOffer slotOffer = new SlotOffer(new AllocationID(), 0, ResourceProfile.ANY);
+
+		Collection<SlotOffer> acceptedSlots = jobMaster.executeInMainThreadExecutor(() -> jobMaster.offerSlots(unresolvedTaskManagerLocation.getResourceID(), Collections.singleton(slotOffer), testingTimeout).get()).get();
+		Assert.assertThat(acceptedSlots.size(), is(1));

Review comment:
       Ha! Thanks. The old rule applied here again: I was doing something hacky -- which is a clear indicator of missing something obvious ;) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4bac1d3f88930c269cab61688034af64b0f20a10 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623) 
   * e5b605b27d5f0f75a623b31065c3062894c97725 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
rmetzger commented on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-709544783


   CI has passed. Merging ...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7662",
       "triggerID" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7674",
       "triggerID" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "triggerType" : "PUSH"
     }, {
       "hash" : "edc867bcfe18df71043801cb8a38eb0ed2d4d502",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7698",
       "triggerID" : "edc867bcfe18df71043801cb8a38eb0ed2d4d502",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2876ce46b800db5d0fe10f12ebf55efd1ab22ace Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7674) 
   * edc867bcfe18df71043801cb8a38eb0ed2d4d502 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7698) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7662",
       "triggerID" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e5b605b27d5f0f75a623b31065c3062894c97725 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7662) 
   * 2876ce46b800db5d0fe10f12ebf55efd1ab22ace UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger commented on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
rmetzger commented on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708970927


   Thank you for your review. I'll address your comments.
   
   According to my experiments, this PR fixes the issue reported in FLINK-19237, but the other problem you mentioned started surfacing when running the test repeatedly : https://issues.apache.org/jira/browse/FLINK-18293?focusedCommentId=17213165&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17213165
   I would risk merging the PR w/o disabling the test. I anticipate a very low failure rate, if at all.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7662",
       "triggerID" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7674",
       "triggerID" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "triggerType" : "PUSH"
     }, {
       "hash" : "edc867bcfe18df71043801cb8a38eb0ed2d4d502",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7698",
       "triggerID" : "edc867bcfe18df71043801cb8a38eb0ed2d4d502",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * edc867bcfe18df71043801cb8a38eb0ed2d4d502 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7698) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4bac1d3f88930c269cab61688034af64b0f20a10 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rmetzger merged pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
rmetzger merged pull request #13639:
URL: https://github.com/apache/flink/pull/13639


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708950024


   Does this PR fully fixes FLINK-19237 or is there still the problem of offering not fully freed slots?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] tillrohrmann commented on a change in pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #13639:
URL: https://github.com/apache/flink/pull/13639#discussion_r505238512



##########
File path: flink-core/src/test/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializerConcurrencyCheckInactiveITCase.java
##########
@@ -47,6 +50,7 @@
 	 */
 	@Test
 	public void testWithNoConcurrencyCheck() throws Exception {
+		assumeFalse(log.isDebugEnabled()); // this test will fail on DEBUG log level.

Review comment:
       I think it would be good to explain why the test fails.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java
##########
@@ -280,6 +286,65 @@ public static void teardownClass() {
 		}
 	}
 
+	/**
+	 * This test ensures that the bookkeeping of TaskExecutors in the JobMaster handles cases where TaskExecutors with the same
+	 * ID re-register properly. FLINK-19237 was a bug where the TaskExecutors and the SlotPool got out of sync, and
+	 * slot offers were rejected.
+	 */
+	@Test
+	public void testAcceptSlotOfferAfterLeaderchange() throws Exception {

Review comment:
       ```suggestion
   	public void testAcceptSlotOfferAfterLeaderChange() throws Exception {
   ```

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java
##########
@@ -2177,4 +2242,58 @@ public void disposeStorageLocation() throws IOException {
 
 		}
 	}
+
+
+	private static class TestingJobMaster extends JobMaster {
+
+		public TestingJobMaster(RpcService rpcService,
+				JobMasterConfiguration jobMasterConfiguration,
+				ResourceID resourceId,
+				JobGraph jobGraph,
+				HighAvailabilityServices highAvailabilityService,
+				SlotPoolFactory slotPoolFactory,
+				JobManagerSharedServices jobManagerSharedServices,
+				HeartbeatServices heartbeatServices,
+				JobManagerJobMetricGroupFactory jobMetricGroupFactory,
+				OnCompletionActions jobCompletionActions,
+				FatalErrorHandler fatalErrorHandler,
+				ClassLoader userCodeLoader,
+				SchedulerNGFactory schedulerNGFactory,
+				ShuffleMaster<?> shuffleMaster,
+				PartitionTrackerFactory partitionTrackerFactory,
+				ExecutionDeploymentTracker executionDeploymentTracker,
+				ExecutionDeploymentReconciler.Factory executionDeploymentReconcilerFactory,
+				long initializationTimestamp) throws Exception {
+			super(rpcService,
+				jobMasterConfiguration,
+				resourceId,
+				jobGraph,
+				highAvailabilityService,
+				slotPoolFactory,
+				jobManagerSharedServices,
+				heartbeatServices,
+				jobMetricGroupFactory,
+				jobCompletionActions,
+				fatalErrorHandler,
+				userCodeLoader,
+				schedulerNGFactory,
+				shuffleMaster,
+				partitionTrackerFactory,
+				executionDeploymentTracker,
+				executionDeploymentReconcilerFactory,
+				initializationTimestamp);
+		}
+
+		public <T> CompletableFuture<T> executeInMainThreadExecutor(SupplierWithException<T, Throwable> runnable) {
+			CompletableFuture<T> result = new CompletableFuture<>();
+			getMainThreadExecutor().execute(() -> {
+				try {
+					result.complete(runnable.get());
+				} catch (Throwable throwable) {
+					result.completeExceptionally(throwable);
+				}
+			});
+			return result;
+		}
+	}

Review comment:
       This class should not be needed.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java
##########
@@ -280,6 +286,65 @@ public static void teardownClass() {
 		}
 	}
 
+	/**
+	 * This test ensures that the bookkeeping of TaskExecutors in the JobMaster handles cases where TaskExecutors with the same
+	 * ID re-register properly. FLINK-19237 was a bug where the TaskExecutors and the SlotPool got out of sync, and
+	 * slot offers were rejected.
+	 */
+	@Test
+	public void testAcceptSlotOfferAfterLeaderchange() throws Exception {
+
+		final JobManagerSharedServices jobManagerSharedServices = new TestingJobManagerSharedServicesBuilder().build();
+		final JobMasterConfiguration jobMasterConfiguration = JobMasterConfiguration.fromConfiguration(configuration);
+
+		final SchedulerNGFactory schedulerNGFactory = SchedulerNGFactoryFactory.createSchedulerNGFactory(configuration);
+
+		final TestingJobMaster jobMaster = new TestingJobMaster(
+			rpcService,
+			jobMasterConfiguration,
+			jmResourceId,
+			jobGraph,
+			haServices,
+			SlotPoolFactory.fromConfiguration(configuration),
+			jobManagerSharedServices,
+			heartbeatServices,
+			UnregisteredJobManagerJobMetricGroupFactory.INSTANCE,
+			new JobMasterBuilder.TestingOnCompletionActions(),
+			testingFatalErrorHandler,
+			JobMasterTest.class.getClassLoader(),
+			schedulerNGFactory,
+			NettyShuffleMaster.INSTANCE,
+			NoOpJobMasterPartitionTracker.FACTORY,
+			new DefaultExecutionDeploymentTracker(),
+			DefaultExecutionDeploymentReconciler::new,
+			System.currentTimeMillis());
+
+		jobMaster.start(jobMasterId).get();
+
+		log.info("Register TaskManager");
+
+		String testingTaskManagerAddress = "fake";
+		UnresolvedTaskManagerLocation unresolvedTaskManagerLocation = new LocalUnresolvedTaskManagerLocation();
+		TestingTaskExecutorGateway testingTaskExecutorGateway = new TestingTaskExecutorGatewayBuilder().createTestingTaskExecutorGateway();
+		rpcService.registerGateway(testingTaskManagerAddress, testingTaskExecutorGateway);
+		Assert.assertThat(jobMaster.registerTaskManager(testingTaskManagerAddress, unresolvedTaskManagerLocation, testingTimeout).get(), instanceOf(RegistrationResponse.Success.class));
+
+		log.info("Revoke leadership & re-grant leadership");
+		jobMaster.suspend(new FlinkException("Lost leadership")).get();
+
+		jobMaster.start(JobMasterId.generate()).get();
+
+		log.info("re-register same TaskManager");
+		Assert.assertThat(jobMaster.registerTaskManager(testingTaskManagerAddress, unresolvedTaskManagerLocation, testingTimeout).get(), instanceOf(RegistrationResponse.Success.class));
+
+		log.info("Ensure JobMaster accepts slot offer");
+		final SlotOffer slotOffer = new SlotOffer(new AllocationID(), 0, ResourceProfile.ANY);
+
+		Collection<SlotOffer> acceptedSlots = jobMaster.executeInMainThreadExecutor(() -> jobMaster.offerSlots(unresolvedTaskManagerLocation.getResourceID(), Collections.singleton(slotOffer), testingTimeout).get()).get();
+		Assert.assertThat(acceptedSlots.size(), is(1));

Review comment:
       Instead of calling methods directly on the `JobMaster` I suggest to retrieve the self gateway via `jobMaster.getSelfGateway(JobMasterGateway.class)` and then to use this interface to send RPCs. That way, we also don't need `executeInMainThreadExecutor`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4bac1d3f88930c269cab61688034af64b0f20a10 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13639: [FLINK-19237] Fix rejected slot offer bug in JobMaster

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13639:
URL: https://github.com/apache/flink/pull/13639#issuecomment-708476237


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7623",
       "triggerID" : "4bac1d3f88930c269cab61688034af64b0f20a10",
       "triggerType" : "PUSH"
     }, {
       "hash" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7662",
       "triggerID" : "e5b605b27d5f0f75a623b31065c3062894c97725",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7674",
       "triggerID" : "2876ce46b800db5d0fe10f12ebf55efd1ab22ace",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2876ce46b800db5d0fe10f12ebf55efd1ab22ace Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=7674) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org