You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/10/22 13:20:57 UTC

[GitHub] [flink] azagrebin opened a new pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

azagrebin opened a new pull request #13749:
URL: https://github.com/apache/flink/pull/13749


   ## What is the purpose of the change
   
   When a task fails and it is `RestartPipelinedRegionFailoverStrategy`, all tasks in the region of the failed task and in the downstream regions will be canceled for later re-scheduling. However, these tasks can be still in `CREATED` state so that there is no need to cancel these tasks.
   
   The PR skips canceling these tasks which can speed up the failover and reduce a lot of unnecessary CANCELING logs.ELING logs.
   
   ## Brief change log
   
     - refactor builder for `TestingSchedulingExecutionVertex`
     - add state to executions in `RestartPipelinedRegionFailoverStrategyTest`
     - refactor/deduplicate verification logic in `RestartPipelinedRegionFailoverStrategyTest`
     - add test that executions in `CREATED` state do not get restarted
     - fix expected restarts in `BatchFineGrainedRecoveryITCase` because subsequent mappers do not get restarted anymore if their parent mapper fails
   
   ## Verifying this change
   
   unit tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d1a4068273d54499e90fb54f00044d4955dc5789 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319) 
   * a87afea927d59ffab8d757ae57929f50241141c9 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501) 
   * ee082f24469b5fe9bf6472883c199e6f04e8cb6c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e35da3721be8a766e339dec86cf858cde021e690 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #13749:
URL: https://github.com/apache/flink/pull/13749#discussion_r510547615



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartPipelinedRegionFailoverStrategyTest.java
##########
@@ -260,6 +260,41 @@ public void testRegionFailoverForMultipleVerticesRegions() throws Exception {
 				.restarts(v1, v2, v3, v4, v5, v6);
 	}
 
+	/**
+	 * Tests region failover does not restart vertexes which are already in initial CREATED state.
+	 * <pre>
+	 *     (v1) ---> (v2) --|--> (v3) ---> (v4) --|--> (v5) ---> (v6)

Review comment:
       seems overly complex; wouldn't it be sufficient to have `v2` and `v3`?

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartPipelinedRegionFailoverStrategyTest.java
##########
@@ -369,35 +252,85 @@ public void testRegionFailoverForMultipleVerticesRegions() throws Exception {
 
 		RestartPipelinedRegionFailoverStrategy strategy = new RestartPipelinedRegionFailoverStrategy(topology);
 
-		// when v3 fails due to internal error, {v3,v4,v5,v6} should be restarted
-		HashSet<ExecutionVertexID> expectedResult = new HashSet<>();
-		expectedResult.add(v3.getId());
-		expectedResult.add(v4.getId());
-		expectedResult.add(v5.getId());
-		expectedResult.add(v6.getId());
-		assertEquals(expectedResult,
-			strategy.getTasksNeedingRestart(v3.getId(), new Exception("Test failure")));
-
-		// when v3 fails to consume from v2, {v1,v2,v3,v4,v5,v6} should be restarted
-		expectedResult.clear();
-		expectedResult.add(v1.getId());
-		expectedResult.add(v2.getId());
-		expectedResult.add(v3.getId());
-		expectedResult.add(v4.getId());
-		expectedResult.add(v5.getId());
-		expectedResult.add(v6.getId());
-		assertEquals(expectedResult,
-			strategy.getTasksNeedingRestart(v3.getId(),
-				new PartitionConnectionException(
-					new ResultPartitionID(
-						v3.getConsumedResults().iterator().next().getId(),
-						new ExecutionAttemptID()),
-					new Exception("Test failure"))));
+		verifyThatFailedExecution(strategy, v3).restarts(v3, v4, v5, v6);
+
+		TestingSchedulingResultPartition v2out = v3.getConsumedResults().iterator().next();
+		verifyThatFailedExecution(strategy, v3)
+				.partitionConnectionCause(v2out)
+				.restarts(v1, v2, v3, v4, v5, v6);
 	}
 
-	// ------------------------------------------------------------------------
-	//  utilities
-	// ------------------------------------------------------------------------
+	/**
+	 * Tests region failover does not restart vertexes which are already in initial CREATED state.
+	 * <pre>
+	 *     (v1) ---> (v2) --|--> (v3) ---> (v4) --|--> (v5) ---> (v6)
+	 *
+	 *           ^          ^          ^          ^          ^
+	 *           |          |          |          |          |
+	 *     (pipelined) (blocking) (pipelined) (blocking) (pipelined)
+	 * </pre>
+	 * Component 1: 1,2; component 2: 3,4; component 3: 5,6
+	 */
+	@Test
+	public void testRegionFailoverDoesNotRestartCreatedExecutions() {
+		TestingSchedulingTopology topology = new TestingSchedulingTopology();
+
+		TestingSchedulingExecutionVertex v1 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v2 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v3 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v4 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v5 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v6 = topology.newExecutionVertex(ExecutionState.CREATED);
+
+		topology.connect(v1, v2, ResultPartitionType.PIPELINED);
+		topology.connect(v2, v3, ResultPartitionType.BLOCKING);
+		topology.connect(v3, v4, ResultPartitionType.PIPELINED);
+		topology.connect(v4, v5, ResultPartitionType.BLOCKING);
+		topology.connect(v5, v6, ResultPartitionType.PIPELINED);
+
+		RestartPipelinedRegionFailoverStrategy strategy = new RestartPipelinedRegionFailoverStrategy(topology);
+
+		verifyThatFailedExecution(strategy, v3).restarts();
+		TestingSchedulingResultPartition v2out = v3.getConsumedResults().iterator().next();
+		verifyThatFailedExecution(strategy, v3).partitionConnectionCause(v2out).restarts();
+	}
+
+	private static VerificationContext verifyThatFailedExecution(
+			FailoverStrategy strategy,
+			SchedulingExecutionVertex executionVertex) {
+		return new VerificationContext(strategy, executionVertex);
+	}
+
+	private static class VerificationContext {
+		private final FailoverStrategy strategy;
+		private final SchedulingExecutionVertex executionVertex;
+		private Throwable cause = new Exception("Test failure");
+
+		private VerificationContext(
+				FailoverStrategy strategy,
+				SchedulingExecutionVertex executionVertex) {
+			this.strategy = strategy;
+			this.executionVertex = executionVertex;
+		}
+
+		private VerificationContext partitionConnectionCause(

Review comment:
       Do we even need this method at all? We could just overload `verifyThatFailedExecution`.

##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartPipelinedRegionFailoverStrategyTest.java
##########
@@ -369,35 +252,85 @@ public void testRegionFailoverForMultipleVerticesRegions() throws Exception {
 
 		RestartPipelinedRegionFailoverStrategy strategy = new RestartPipelinedRegionFailoverStrategy(topology);
 
-		// when v3 fails due to internal error, {v3,v4,v5,v6} should be restarted
-		HashSet<ExecutionVertexID> expectedResult = new HashSet<>();
-		expectedResult.add(v3.getId());
-		expectedResult.add(v4.getId());
-		expectedResult.add(v5.getId());
-		expectedResult.add(v6.getId());
-		assertEquals(expectedResult,
-			strategy.getTasksNeedingRestart(v3.getId(), new Exception("Test failure")));
-
-		// when v3 fails to consume from v2, {v1,v2,v3,v4,v5,v6} should be restarted
-		expectedResult.clear();
-		expectedResult.add(v1.getId());
-		expectedResult.add(v2.getId());
-		expectedResult.add(v3.getId());
-		expectedResult.add(v4.getId());
-		expectedResult.add(v5.getId());
-		expectedResult.add(v6.getId());
-		assertEquals(expectedResult,
-			strategy.getTasksNeedingRestart(v3.getId(),
-				new PartitionConnectionException(
-					new ResultPartitionID(
-						v3.getConsumedResults().iterator().next().getId(),
-						new ExecutionAttemptID()),
-					new Exception("Test failure"))));
+		verifyThatFailedExecution(strategy, v3).restarts(v3, v4, v5, v6);
+
+		TestingSchedulingResultPartition v2out = v3.getConsumedResults().iterator().next();
+		verifyThatFailedExecution(strategy, v3)
+				.partitionConnectionCause(v2out)
+				.restarts(v1, v2, v3, v4, v5, v6);
 	}
 
-	// ------------------------------------------------------------------------
-	//  utilities
-	// ------------------------------------------------------------------------
+	/**
+	 * Tests region failover does not restart vertexes which are already in initial CREATED state.
+	 * <pre>
+	 *     (v1) ---> (v2) --|--> (v3) ---> (v4) --|--> (v5) ---> (v6)
+	 *
+	 *           ^          ^          ^          ^          ^
+	 *           |          |          |          |          |
+	 *     (pipelined) (blocking) (pipelined) (blocking) (pipelined)
+	 * </pre>
+	 * Component 1: 1,2; component 2: 3,4; component 3: 5,6
+	 */
+	@Test
+	public void testRegionFailoverDoesNotRestartCreatedExecutions() {
+		TestingSchedulingTopology topology = new TestingSchedulingTopology();
+
+		TestingSchedulingExecutionVertex v1 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v2 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v3 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v4 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v5 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v6 = topology.newExecutionVertex(ExecutionState.CREATED);
+
+		topology.connect(v1, v2, ResultPartitionType.PIPELINED);
+		topology.connect(v2, v3, ResultPartitionType.BLOCKING);
+		topology.connect(v3, v4, ResultPartitionType.PIPELINED);
+		topology.connect(v4, v5, ResultPartitionType.BLOCKING);
+		topology.connect(v5, v6, ResultPartitionType.PIPELINED);
+
+		RestartPipelinedRegionFailoverStrategy strategy = new RestartPipelinedRegionFailoverStrategy(topology);
+
+		verifyThatFailedExecution(strategy, v3).restarts();
+		TestingSchedulingResultPartition v2out = v3.getConsumedResults().iterator().next();
+		verifyThatFailedExecution(strategy, v3).partitionConnectionCause(v2out).restarts();
+	}
+
+	private static VerificationContext verifyThatFailedExecution(
+			FailoverStrategy strategy,
+			SchedulingExecutionVertex executionVertex) {
+		return new VerificationContext(strategy, executionVertex);
+	}
+
+	private static class VerificationContext {
+		private final FailoverStrategy strategy;
+		private final SchedulingExecutionVertex executionVertex;
+		private Throwable cause = new Exception("Test failure");
+
+		private VerificationContext(
+				FailoverStrategy strategy,
+				SchedulingExecutionVertex executionVertex) {
+			this.strategy = strategy;
+			this.executionVertex = executionVertex;
+		}
+
+		private VerificationContext partitionConnectionCause(

Review comment:
       ```suggestion
   		private VerificationContext withPartitionConnectionCause(
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8523",
       "triggerID" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "",
       "status" : "CANCELED",
       "url" : "TBD",
       "triggerID" : "718439759",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8572",
       "triggerID" : "718439759",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8572",
       "triggerID" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 89cbfaa91442b74532ed657a0eb3c558c0215247 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8572) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e35da3721be8a766e339dec86cf858cde021e690 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117) 
   * d1a4068273d54499e90fb54f00044d4955dc5789 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin commented on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin commented on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-718439759


   @flinkbot run azure


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e35da3721be8a766e339dec86cf858cde021e690 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8523",
       "triggerID" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee082f24469b5fe9bf6472883c199e6f04e8cb6c Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8523) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d1a4068273d54499e90fb54f00044d4955dc5789 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319) 
   * a87afea927d59ffab8d757ae57929f50241141c9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin commented on a change in pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin commented on a change in pull request #13749:
URL: https://github.com/apache/flink/pull/13749#discussion_r513530929



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartPipelinedRegionFailoverStrategyTest.java
##########
@@ -260,6 +260,41 @@ public void testRegionFailoverForMultipleVerticesRegions() throws Exception {
 				.restarts(v1, v2, v3, v4, v5, v6);
 	}
 
+	/**
+	 * Tests region failover does not restart vertexes which are already in initial CREATED state.
+	 * <pre>
+	 *     (v1) ---> (v2) --|--> (v3) ---> (v4) --|--> (v5) ---> (v6)

Review comment:
       I simplified the test




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin commented on a change in pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin commented on a change in pull request #13749:
URL: https://github.com/apache/flink/pull/13749#discussion_r512003909



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartPipelinedRegionFailoverStrategyTest.java
##########
@@ -369,35 +252,85 @@ public void testRegionFailoverForMultipleVerticesRegions() throws Exception {
 
 		RestartPipelinedRegionFailoverStrategy strategy = new RestartPipelinedRegionFailoverStrategy(topology);
 
-		// when v3 fails due to internal error, {v3,v4,v5,v6} should be restarted
-		HashSet<ExecutionVertexID> expectedResult = new HashSet<>();
-		expectedResult.add(v3.getId());
-		expectedResult.add(v4.getId());
-		expectedResult.add(v5.getId());
-		expectedResult.add(v6.getId());
-		assertEquals(expectedResult,
-			strategy.getTasksNeedingRestart(v3.getId(), new Exception("Test failure")));
-
-		// when v3 fails to consume from v2, {v1,v2,v3,v4,v5,v6} should be restarted
-		expectedResult.clear();
-		expectedResult.add(v1.getId());
-		expectedResult.add(v2.getId());
-		expectedResult.add(v3.getId());
-		expectedResult.add(v4.getId());
-		expectedResult.add(v5.getId());
-		expectedResult.add(v6.getId());
-		assertEquals(expectedResult,
-			strategy.getTasksNeedingRestart(v3.getId(),
-				new PartitionConnectionException(
-					new ResultPartitionID(
-						v3.getConsumedResults().iterator().next().getId(),
-						new ExecutionAttemptID()),
-					new Exception("Test failure"))));
+		verifyThatFailedExecution(strategy, v3).restarts(v3, v4, v5, v6);
+
+		TestingSchedulingResultPartition v2out = v3.getConsumedResults().iterator().next();
+		verifyThatFailedExecution(strategy, v3)
+				.partitionConnectionCause(v2out)
+				.restarts(v1, v2, v3, v4, v5, v6);
 	}
 
-	// ------------------------------------------------------------------------
-	//  utilities
-	// ------------------------------------------------------------------------
+	/**
+	 * Tests region failover does not restart vertexes which are already in initial CREATED state.
+	 * <pre>
+	 *     (v1) ---> (v2) --|--> (v3) ---> (v4) --|--> (v5) ---> (v6)
+	 *
+	 *           ^          ^          ^          ^          ^
+	 *           |          |          |          |          |
+	 *     (pipelined) (blocking) (pipelined) (blocking) (pipelined)
+	 * </pre>
+	 * Component 1: 1,2; component 2: 3,4; component 3: 5,6
+	 */
+	@Test
+	public void testRegionFailoverDoesNotRestartCreatedExecutions() {
+		TestingSchedulingTopology topology = new TestingSchedulingTopology();
+
+		TestingSchedulingExecutionVertex v1 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v2 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v3 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v4 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v5 = topology.newExecutionVertex(ExecutionState.CREATED);
+		TestingSchedulingExecutionVertex v6 = topology.newExecutionVertex(ExecutionState.CREATED);
+
+		topology.connect(v1, v2, ResultPartitionType.PIPELINED);
+		topology.connect(v2, v3, ResultPartitionType.BLOCKING);
+		topology.connect(v3, v4, ResultPartitionType.PIPELINED);
+		topology.connect(v4, v5, ResultPartitionType.BLOCKING);
+		topology.connect(v5, v6, ResultPartitionType.PIPELINED);
+
+		RestartPipelinedRegionFailoverStrategy strategy = new RestartPipelinedRegionFailoverStrategy(topology);
+
+		verifyThatFailedExecution(strategy, v3).restarts();
+		TestingSchedulingResultPartition v2out = v3.getConsumedResults().iterator().next();
+		verifyThatFailedExecution(strategy, v3).partitionConnectionCause(v2out).restarts();
+	}
+
+	private static VerificationContext verifyThatFailedExecution(
+			FailoverStrategy strategy,
+			SchedulingExecutionVertex executionVertex) {
+		return new VerificationContext(strategy, executionVertex);
+	}
+
+	private static class VerificationContext {
+		private final FailoverStrategy strategy;
+		private final SchedulingExecutionVertex executionVertex;
+		private Throwable cause = new Exception("Test failure");
+
+		private VerificationContext(
+				FailoverStrategy strategy,
+				SchedulingExecutionVertex executionVertex) {
+			this.strategy = strategy;
+			this.executionVertex = executionVertex;
+		}
+
+		private VerificationContext partitionConnectionCause(

Review comment:
       true, but I think this improves readability




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin commented on a change in pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin commented on a change in pull request #13749:
URL: https://github.com/apache/flink/pull/13749#discussion_r512036774



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartPipelinedRegionFailoverStrategyTest.java
##########
@@ -260,6 +260,41 @@ public void testRegionFailoverForMultipleVerticesRegions() throws Exception {
 				.restarts(v1, v2, v3, v4, v5, v6);
 	}
 
+	/**
+	 * Tests region failover does not restart vertexes which are already in initial CREATED state.
+	 * <pre>
+	 *     (v1) ---> (v2) --|--> (v3) ---> (v4) --|--> (v5) ---> (v6)

Review comment:
       true, although if we agree on the next @zhuzhurk's suggestion, I would leave it to include all states which do not require restart.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714491588


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit e35da3721be8a766e339dec86cf858cde021e690 (Thu Oct 22 13:24:18 UTC 2020)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d1a4068273d54499e90fb54f00044d4955dc5789 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319) 
   * a87afea927d59ffab8d757ae57929f50241141c9 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e35da3721be8a766e339dec86cf858cde021e690 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d1a4068273d54499e90fb54f00044d4955dc5789 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin closed pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin closed pull request #13749:
URL: https://github.com/apache/flink/pull/13749


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8523",
       "triggerID" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a87afea927d59ffab8d757ae57929f50241141c9 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501) 
   * ee082f24469b5fe9bf6472883c199e6f04e8cb6c Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8523) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * e35da3721be8a766e339dec86cf858cde021e690 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117) 
   * d1a4068273d54499e90fb54f00044d4955dc5789 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8523",
       "triggerID" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "718439759",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "",
       "status" : "CANCELED",
       "url" : "TBD",
       "triggerID" : "718439759",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 89cbfaa91442b74532ed657a0eb3c558c0215247 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #13749:
URL: https://github.com/apache/flink/pull/13749#discussion_r510545111



##########
File path: flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/strategy/TestingSchedulingTopology.java
##########
@@ -350,12 +352,21 @@ public SchedulingExecutionVerticesBuilder withInputDependencyConstraint(final In
 		public List<TestingSchedulingExecutionVertex> finish() {
 			final List<TestingSchedulingExecutionVertex> vertices = new ArrayList<>();
 			for (int subtaskIndex = 0; subtaskIndex < parallelism; subtaskIndex++) {
-				vertices.add(new TestingSchedulingExecutionVertex(jobVertexId, subtaskIndex, inputDependencyConstraint));
+				vertices.add(createTestingSchedulingExecutionVertex(subtaskIndex));
 			}
 
 			TestingSchedulingTopology.this.addSchedulingExecutionVertices(vertices);
 
 			return vertices;
 		}
+
+		private TestingSchedulingExecutionVertex createTestingSchedulingExecutionVertex(
+					final int subtaskIndex) {

Review comment:
       This wrapping caused by the updated .editorconfig is seriously annoying 💢 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin commented on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin commented on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-716612595


   Thanks for the reviews @zentol @zhuzhurk 
   @zhuzhurk I tried what you suggested in the last fixup commit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin commented on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin commented on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-718751026


   The CI failure looks unrelated and already filed in [FLINK-19699](https://issues.apache.org/jira/browse/FLINK-19699).
   Passed previously in my [CI](https://dev.azure.com/azagrebin/azagrebin/_build/results?buildId=332&view=results), merging the PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] azagrebin commented on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
azagrebin commented on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-718004619


   After offline discussion, the decision is that further suggested optimisations need more thinking and testing as all consequences are not 100% clear for various types of scheduling. The optimisations can potentially improve processing of huge execution graphs where the runtime matters. However, they will also make the judgement about the process more complicated.
   
   I created a follow-up ticket for further discussions ([FLINK-19860](https://issues.apache.org/jira/browse/FLINK-19860)) and I am merging this PR with its initial intent.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #13749:
URL: https://github.com/apache/flink/pull/13749#issuecomment-714539634


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "e35da3721be8a766e339dec86cf858cde021e690",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8117",
       "triggerID" : "e35da3721be8a766e339dec86cf858cde021e690",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8319",
       "triggerID" : "d1a4068273d54499e90fb54f00044d4955dc5789",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8501",
       "triggerID" : "a87afea927d59ffab8d757ae57929f50241141c9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8523",
       "triggerID" : "ee082f24469b5fe9bf6472883c199e6f04e8cb6c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8572",
       "triggerID" : "718439759",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8572",
       "triggerID" : "89cbfaa91442b74532ed657a0eb3c558c0215247",
       "triggerType" : "PUSH"
     }, {
       "hash" : "",
       "status" : "DELETED",
       "url" : "TBD",
       "triggerID" : "718439759",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 89cbfaa91442b74532ed657a0eb3c558c0215247 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=8572) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zhuzhurk commented on a change in pull request #13749: [FLINK-19712][Coordination] Do not restart CREATED executions in RestartPipelinedRegionFailoverStrategy

Posted by GitBox <gi...@apache.org>.
zhuzhurk commented on a change in pull request #13749:
URL: https://github.com/apache/flink/pull/13749#discussion_r511632430



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartPipelinedRegionFailoverStrategy.java
##########
@@ -121,7 +122,12 @@ public RestartPipelinedRegionFailoverStrategy(
 		// calculate the tasks to restart based on the result of regions to restart
 		Set<ExecutionVertexID> tasksToRestart = new HashSet<>();
 		for (SchedulingPipelinedRegion region : getRegionsToRestart(failedRegion)) {
-			region.getVertices().forEach(vertex -> tasksToRestart.add(vertex.getId()));
+			for (SchedulingExecutionVertex vertex : region.getVertices()) {
+				// we do not need to restart tasks which are already in the initial state
+				if (vertex.getState() != ExecutionState.CREATED) {

Review comment:
       How about we skip visiting regions in which all vertices are `CREATED` in `getRegionsToRestart()`?
   This may significantly reduce the computing complexity for most batch job, because we may only need to check N vertices and N edges, instead of KN vertices and K*N^2 edges. (N is the parallelism and K is the JobVertex count) (Exceptional cases are `PartitionException` cases and jobs using `LazyFromSourcesSchedulingStrategy` with `InputDependencyConstraint=ANY` but they may also benefit from it)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org