You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/07/10 12:16:51 UTC

[GitHub] [flink] zentol commented on a change in pull request #9038: [FLINK-13169][tests][coordination] IT test for fine-grained recovery (task executor failures)

zentol commented on a change in pull request #9038: [FLINK-13169][tests][coordination] IT test for fine-grained recovery (task executor failures)
URL: https://github.com/apache/flink/pull/9038#discussion_r302032425
 
 

 ##########
 File path: flink-tests/src/test/java/org/apache/flink/test/recovery/BatchFineGrainedRecoveryITCase.java
 ##########
 @@ -59,19 +68,35 @@
  * the next mapper starts when the previous is done. The mappers are not chained into one task which makes them
  * separate fail-over regions.
  *
- * <p>The test verifies that fine-grained recovery works by randomly incuding failures in any of the mappers.
- * Since all mappers are connected via blocking partitions, which should be re-used on failure, and the consumer
- * of the mapper wasn't deployed yet, as the consumed partition was not fully produced yet, only the failed mapper
- * should actually restart.
+ * <p>The test verifies that fine-grained recovery works by randomly including failures in any of the mappers.
+ * There are multiple failure strategies:
+ *
+ * <ul>
+ *   <li> The {@link RandomExceptionFailureStrategy} throws an exception in the user function code.
+ *   Since all mappers are connected via blocking partitions, which should be re-used on failure, and the consumer
+ *   of the mapper wasn't deployed yet, as the consumed partition was not fully produced yet, only the failed mapper
+ *   should actually restart.
+ *   <li> The {@link RandomTaskExecutorFailureStrategy} abruptly shuts down the task executor. This leads to the loss
+ *   of all previously completed and the in-progress mapper result partitions. The fail-over strategy should restart
+ *   the current in-progress mapper which will get the {@link PartitionNotFoundException} because the previous result
+ *   becomes unavailable and the previous mapper has to be restarted as well. The same should happen subsequently with
+ *   all previous mappers. When the source is recomputed, all mappers has to be restarted again to recalculate their
+ *   lost results.
+ * </ul>
  */
 public class BatchFineGrainedRecoveryITCase extends TestLogger {
+	private static final Logger LOG = LoggerFactory.getLogger(BatchFineGrainedRecoveryITCase.class);
+
 	private static final int EMITTED_RECORD_NUMBER = 1000;
-	private static final int MAX_FAILURE_NUMBER = 10;
 	private static final int MAP_NUMBER = 3;
+	private static final int MAX_MAP_FAILURES = 4;
+	private static final int MAX_JOB_RESTART_ATTEMPTS = MAP_NUMBER * (MAP_NUMBER + 1) * MAX_MAP_FAILURES / 2;
 
 Review comment:
   please add a comment for how you arrived at this formula.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services