You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by se...@apache.org on 2015/08/14 16:59:48 UTC
flink git commit: [FLINK-2393] [docs] Updates docs to describe switch between exactly-once and at-least-once

Repository: flink
Updated Branches:
  refs/heads/master 51872d73b -> 21a0c94ba


[FLINK-2393] [docs] Updates docs to describe switch between exactly-once and at-least-once


Project: http://git-wip-us.apache.org/repos/asf/flink/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/21a0c94b
Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/21a0c94b
Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/21a0c94b

Branch: refs/heads/master
Commit: 21a0c94baafd77297c8eb88367fc8caaac43d8ee
Parents: 51872d7
Author: Stephan Ewen <se...@apache.org>
Authored: Fri Aug 14 14:13:03 2015 +0200
Committer: Stephan Ewen <se...@apache.org>
Committed: Fri Aug 14 14:13:03 2015 +0200

----------------------------------------------------------------------
 docs/apis/streaming_guide.md           | 27 +++++++++++++++++++++++++--
 docs/internals/stream_checkpointing.md | 19 ++++++++++++++++++-
 2 files changed, 43 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flink/blob/21a0c94b/docs/apis/streaming_guide.md
----------------------------------------------------------------------
diff --git a/docs/apis/streaming_guide.md b/docs/apis/streaming_guide.md
index e375dab..35f6147 100644
--- a/docs/apis/streaming_guide.md
+++ b/docs/apis/streaming_guide.md
@@ -1274,6 +1274,29 @@ Rich functions provide, in addition to the user-defined function (`map()`, `redu
 
 [Back to top](#top)
 
+
+
+Fault Tolerance
+----------------
+
+Flink has a checkpointig mechanism that recovers streaming jobs after failues. The checkpointing mechanism requires a *persistent* or *durable* source that
+can be asked for prior records again (Apache Kafka is a good example of a durable source).
+
+The checkpointing mechanism stores the progress in the source as well as the user-defined state (see [Stateful Computation](#Stateful_computation))
+consistently to provide *exactly once* processing guarantees.
+
+To enable checkpointing, call `enableCheckpointing(n)` on the `StreamExecutionEnvironment`, where *n* is the checkpoint interval, in milliseconds.
+
+Other parameters for checkpointing include:
+
+  - *Number of retries*: The `setNumberOfExecutionRerties()` method defines how many times the job is restarted after a failure. When checkpointing is activated, but this value is not explicitly set, the job is restarted infinitely often.
+  - *exactly-once vs. at-least-once*: You can optionally pass a mode to the `enableCheckpointing(n)` method to choose between the two guarantee levels. Exactly-once is preferrable for most applications. At-least-once may be relevant for certain super-low-latency (consistently few milliseconds) applications.
+
+The [docs on streaming fault tolerance](../internals/stream_checkpointing.html) describe in detail the technique behind Flink's streaming fault tolerance mechanism.
+
+[Back to top](#top)
+
+
 Stateful computation
 ------------
 
@@ -1488,9 +1511,9 @@ env.genereateSequence(1,10).map(myMap).setBufferTimeout(timeoutMillis)
 To maximise the throughput the user can call `setBufferTimeout(-1)` which will remove the timeout and buffers will only be flushed when they are full.
 To minimise latency, set the timeout to a value close to 0 (for example 5 or 10 ms). Theoretically, a buffer timeout of 0 will cause all output to be flushed when produced, but this setting should be avoided, because it can cause severe performance degradation.
 
-
 [Back to top](#top)
-    
+
+
 Stream connectors
 ----------------
 

http://git-wip-us.apache.org/repos/asf/flink/blob/21a0c94b/docs/internals/stream_checkpointing.md
----------------------------------------------------------------------
diff --git a/docs/internals/stream_checkpointing.md b/docs/internals/stream_checkpointing.md
index 8428911..27eae6b 100644
--- a/docs/internals/stream_checkpointing.md
+++ b/docs/internals/stream_checkpointing.md
@@ -30,7 +30,8 @@ This document describes Flink' fault tolerance mechanism for streaming data flow
 
 Apache Flink offers a fault tolerance mechanism to consistently recover the state of data streaming applications.
 The mechanism ensures that even in the presence of failures, the program's state will eventually reflect every
-record from the data stream **exactly once**.
+record from the data stream **exactly once**. Note that there is a switch to *downgrade* the guarantees to *at least once*
+(described below).
 
 The fault tolerance mechanism continuously draws snapshots of the distributed streaming data flow. For streaming applications
 with small state, these snapshots are very light-weight and can be drawn frequently without impacting the performance much.
@@ -113,6 +114,22 @@ The resulting snapshot now contains:
   <img src="{{ site.baseurl }}/internals/fig/checkpointing.svg" alt="Illustration of the Checkpointing Mechanism" style="width:100%; padding-top:10px; padding-bottom:10px;" />
 </div>
 
+
+### Exactly Once vs. At Least Once
+
+The alignment step may add latency to the streaming program. Usually, this extra latency is in the order of a few milliseconds, but we have seen cases where the latency
+of some outliers increased noticeably. For applications that require consistenty super low latencies (few milliseconds) for all records, Flink has a switch to skip the 
+stream alignment during a checkpoint. Checkpoint snapshots are still drawn as soon as an operator has seen the checkpoint barrier from each input.
+
+When the alignment is skipped, an operator keeps processing all inputs, even after some checkpoint barriers for checkpoint *n* arrived. That way, the operator also processes
+elements that belong to checkpoint *n+1* before the state snapshot for checkpoint *n* was taken.
+On a restore, these records will occur as duplicates, because they are both included in the state snapshot of checkpoint *n*, and will be replayed as part
+of the data after checkoint *n*.
+
+*NOTE*: Alignment happens only for operators wih multiple predecessors (joins) as well as operators with multiple senders (after a stream repartitionging/shuffle).
+Because of that, dataflows with only embarassingly parallel streaming operations (`map()`, `flatMap()`, `filter()`, ...) actually give *exactly once* guarantees even
+in *at least once* mode.
+
 <!--
 
 ### Asynchronous State Snapshots