You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/01/17 14:23:12 UTC

[GitHub] [flink] zentol opened a new pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

zentol opened a new pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887
 
 
   Fixes an instability in the `ZookeeperLeaderElectionITCase` where the shutdown of the Dispatcher caused a slot allocation to fail, resulting in the job failing, reaching a terminal state and afterwards being removed from Zookeeper.
   
   We now prevent the job from reaching a terminal state by enabling a fixed-delay restart strategy. Should the allocation fail the JM will retry until the JM itself is being shut down. On shutdown the JM will suspend the job, allowing it to be recovered by other Dispatchers.
   
   The exact behavior for what happens to running jobs when the Dispatcher is shut down in an orderly fashion is currently undefined, and this PR makes no attempt remedy this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#issuecomment-575653696
 
 
   <!--
   Meta data
   Hash:7f91b3855bca5e7e2d2d9abf196cca861e122312 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/144944259 TriggerType:PUSH TriggerID:7f91b3855bca5e7e2d2d9abf196cca861e122312
   Hash:7f91b3855bca5e7e2d2d9abf196cca861e122312 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4440 TriggerType:PUSH TriggerID:7f91b3855bca5e7e2d2d9abf196cca861e122312
   -->
   ## CI report:
   
   * 7f91b3855bca5e7e2d2d9abf196cca861e122312 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/144944259) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4440) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#issuecomment-575647657
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 7f91b3855bca5e7e2d2d9abf196cca861e122312 (Fri Jan 17 14:27:53 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zentol commented on a change in pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#discussion_r369171452
 
 

 ##########
 File path: flink-tests/src/test/java/org/apache/flink/test/runtime/leaderelection/ZooKeeperLeaderElectionITCase.java
 ##########
 @@ -141,13 +144,23 @@ private DispatcherGateway getNextLeadingDispatcherGateway(TestingMiniCluster min
 		return miniCluster.getDispatcherGatewayFuture().get();
 	}
 
-	private JobGraph createJobGraph(int parallelism) {
+	private JobGraph createJobGraph(int parallelism) throws IOException {
 		BlockingOperator.isBlocking = true;
 		final JobVertex vertex = new JobVertex("blocking operator");
 		vertex.setParallelism(parallelism);
 		vertex.setInvokableClass(BlockingOperator.class);
 
-		return new JobGraph("Blocking test job", vertex);
+		JobGraph jobGraph = new JobGraph("Blocking test job", vertex);
+
+		// explicitly allow restarts; this is necessary since the shutdown may result in the job failing and hence being
+		// removed from ZooKeeper. What happens to running jobs if the Dispatcher shuts down in an orderly fashion
+		// is undefined behavior. By allowing restarts we prevent the job from reaching a globally terminal state,
+		// causing it to be recovered by the next Dispatcher.
+		ExecutionConfig executionConfig = new ExecutionConfig();
+		executionConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Duration.ofSeconds(10).toMillis()));
 
 Review comment:
   My idea here was to not have the job actually restart since it shouldn't be relevant to the test whether the job is running/restarting (just that it's not failed), and these state transitions add additional noise to the logs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#issuecomment-575653696
 
 
   <!--
   Meta data
   Hash:7f91b3855bca5e7e2d2d9abf196cca861e122312 Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/144944259 TriggerType:PUSH TriggerID:7f91b3855bca5e7e2d2d9abf196cca861e122312
   Hash:7f91b3855bca5e7e2d2d9abf196cca861e122312 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4440 TriggerType:PUSH TriggerID:7f91b3855bca5e7e2d2d9abf196cca861e122312
   -->
   ## CI report:
   
   * 7f91b3855bca5e7e2d2d9abf196cca861e122312 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/144944259) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4440) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zentol merged pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
zentol merged pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#issuecomment-575653696
 
 
   <!--
   Meta data
   Hash:7f91b3855bca5e7e2d2d9abf196cca861e122312 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/144944259 TriggerType:PUSH TriggerID:7f91b3855bca5e7e2d2d9abf196cca861e122312
   Hash:7f91b3855bca5e7e2d2d9abf196cca861e122312 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4440 TriggerType:PUSH TriggerID:7f91b3855bca5e7e2d2d9abf196cca861e122312
   -->
   ## CI report:
   
   * 7f91b3855bca5e7e2d2d9abf196cca861e122312 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/144944259) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4440) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] tillrohrmann commented on a change in pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#discussion_r369546055
 
 

 ##########
 File path: flink-tests/src/test/java/org/apache/flink/test/runtime/leaderelection/ZooKeeperLeaderElectionITCase.java
 ##########
 @@ -141,13 +144,23 @@ private DispatcherGateway getNextLeadingDispatcherGateway(TestingMiniCluster min
 		return miniCluster.getDispatcherGatewayFuture().get();
 	}
 
-	private JobGraph createJobGraph(int parallelism) {
+	private JobGraph createJobGraph(int parallelism) throws IOException {
 		BlockingOperator.isBlocking = true;
 		final JobVertex vertex = new JobVertex("blocking operator");
 		vertex.setParallelism(parallelism);
 		vertex.setInvokableClass(BlockingOperator.class);
 
-		return new JobGraph("Blocking test job", vertex);
+		JobGraph jobGraph = new JobGraph("Blocking test job", vertex);
+
+		// explicitly allow restarts; this is necessary since the shutdown may result in the job failing and hence being
+		// removed from ZooKeeper. What happens to running jobs if the Dispatcher shuts down in an orderly fashion
+		// is undefined behavior. By allowing restarts we prevent the job from reaching a globally terminal state,
+		// causing it to be recovered by the next Dispatcher.
+		ExecutionConfig executionConfig = new ExecutionConfig();
+		executionConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Duration.ofSeconds(10).toMillis()));
 
 Review comment:
   Good point. Then ignore my comment.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] zentol commented on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
zentol commented on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#issuecomment-576822678
 
 
   Ensuring that the RM shuts down after the Dispatcher conceptually makes sense (assuming the RM is responsible for managing the Dispatcher component); what I'm wondering is whether we can maintain this contract (if we even want to define it as such) should we ever move the RM into a separate process.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] tillrohrmann commented on a change in pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
tillrohrmann commented on a change in pull request #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#discussion_r369163346
 
 

 ##########
 File path: flink-tests/src/test/java/org/apache/flink/test/runtime/leaderelection/ZooKeeperLeaderElectionITCase.java
 ##########
 @@ -141,13 +144,23 @@ private DispatcherGateway getNextLeadingDispatcherGateway(TestingMiniCluster min
 		return miniCluster.getDispatcherGatewayFuture().get();
 	}
 
-	private JobGraph createJobGraph(int parallelism) {
+	private JobGraph createJobGraph(int parallelism) throws IOException {
 		BlockingOperator.isBlocking = true;
 		final JobVertex vertex = new JobVertex("blocking operator");
 		vertex.setParallelism(parallelism);
 		vertex.setInvokableClass(BlockingOperator.class);
 
-		return new JobGraph("Blocking test job", vertex);
+		JobGraph jobGraph = new JobGraph("Blocking test job", vertex);
+
+		// explicitly allow restarts; this is necessary since the shutdown may result in the job failing and hence being
+		// removed from ZooKeeper. What happens to running jobs if the Dispatcher shuts down in an orderly fashion
+		// is undefined behavior. By allowing restarts we prevent the job from reaching a globally terminal state,
+		// causing it to be recovered by the next Dispatcher.
+		ExecutionConfig executionConfig = new ExecutionConfig();
+		executionConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, Duration.ofSeconds(10).toMillis()));
 
 Review comment:
   ```suggestion
   		executionConfig.setRestartStrategy(RestartStrategies.fixedDelayRestart(100, Duration.ofMillis(100).toMillis()));
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10887: [FLINK-15150][tests] Prevent job from reaching terminal state
URL: https://github.com/apache/flink/pull/10887#issuecomment-575653696
 
 
   <!--
   Meta data
   Hash:7f91b3855bca5e7e2d2d9abf196cca861e122312 Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:7f91b3855bca5e7e2d2d9abf196cca861e122312
   -->
   ## CI report:
   
   * 7f91b3855bca5e7e2d2d9abf196cca861e122312 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services