You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/07/23 14:10:52 UTC
[GitHub] [flink] tillrohrmann commented on a change in pull request #9113: [FLINK-13222] [runtime] Add documentation for failover strategy option

tillrohrmann commented on a change in pull request #9113: [FLINK-13222] [runtime] Add documentation for failover strategy option
URL: https://github.com/apache/flink/pull/9113#discussion_r306339896
 
 

 ##########
 File path: docs/dev/task_failure_recovery.md
 ##########
 @@ -264,4 +267,54 @@ The cluster defined restart strategy is used.
 This is helpful for streaming programs which enable checkpointing.
 By default, a fixed delay restart strategy is chosen if there is no other restart strategy defined.
 
+## Failover Strategies
+
+Flink supports different failover strategies which can be configured via the configuration parameter
+*jobmanager.execution.failover-strategy* in Flink's configuration file `flink-conf.yaml`.
+
+<table class="table table-bordered">
+  <thead>
+    <tr>
+      <th class="text-left" style="width: 50%">Failover Strategy</th>
+      <th class="text-left">Value for jobmanager.execution.failover-strategy</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+        <td>Restart all</td>
+        <td>full</td>
+    </tr>
+    <tr>
+        <td>Restart pipelined region</td>
+        <td>region</td>
+    </tr>
+  </tbody>
+</table>
+
+### Restart All Failover Strategy
+
+This strategy restarts all tasks in the job to recover from a task failure.
+
+### Restart Pipelined Region Failover Strategy
+
+This strategy groups tasks into disjoint regions. When a task failure is detected, 
+this strategy computes the smallest set of regions that must be restarted to recover from the failure. 
+For some jobs this can result in fewer tasks that will be restarted compared to the Restart All Failover Strategy.
+
+A region is a set of tasks that communicate via pipelined data exchanges. 
+That is, batch data exchanges denote the boundaries of a region.
+- All data exchanges in a DataStream job or Streaming Table/SQL job are pipelined.
+- All data exchanges in a Batch Table/SQL job are batched by default.
+- The data exchange types in a DataSet job are determined by the 
+  [ExecutionMode]({{ site.javadocs_baseurl }}/api/java/org/apache/flink/api/common/ExecutionMode.html) 
+  which can be set through [ExecutionConfig]({{ site.baseurl }}/dev/execution_configuration.html).
+
+The regions to restart are decided as below:
+1. The region containing the failed task will be restarted.
+2. If a result partition is not available while it is required by a region that will be restarted,
+   the region producing the result partition will be restarted as well.
+3. If a region is to be restarted, all of its consumer regions will also be restarted. This is to guarantee
+   data consistency because nondeterministic processing or partitioning can result in that a partition
 
 Review comment:
   nit: maybe replace `in that a partition ...` with `in different result partitions`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services