You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by se...@apache.org on 2018/05/03 13:03:41 UTC

[3/3] flink git commit: [FLINK-9288] [docs] clarify the event time / watermark docs

[FLINK-9288] [docs] clarify the event time / watermark docs

This closes #5949


Project: http://git-wip-us.apache.org/repos/asf/flink/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/90855b63
Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/90855b63
Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/90855b63

Branch: refs/heads/master
Commit: 90855b638caebd1032e699528a3e0bd232b7c95a
Parents: a904964
Author: David Anderson <da...@alpinegizmo.com>
Authored: Wed May 2 15:50:48 2018 +0400
Committer: Stephan Ewen <se...@apache.org>
Committed: Thu May 3 11:19:36 2018 +0200

----------------------------------------------------------------------
 docs/dev/event_time.md                  | 39 ++++++++++++++++------------
 docs/dev/event_timestamps_watermarks.md |  6 ++---
 2 files changed, 25 insertions(+), 20 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flink/blob/90855b63/docs/dev/event_time.md
----------------------------------------------------------------------
diff --git a/docs/dev/event_time.md b/docs/dev/event_time.md
index a3e697d..886bf22 100644
--- a/docs/dev/event_time.md
+++ b/docs/dev/event_time.md
@@ -35,31 +35,34 @@ Flink supports different notions of *time* in streaming programs.
     respective operation.
 
     When a streaming program runs on processing time, all time-based operations (like time windows) will
-    use the system clock of the machines that run the respective operator. For example, an hourly
+    use the system clock of the machines that run the respective operator. An hourly
     processing time window will include all records that arrived at a specific operator between the
-    times when the system clock indicated the full hour.
+    times when the system clock indicated the full hour. For example, if an application
+    begins running at 9:15am, the first hourly processing time window will include events
+    processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
 
     Processing time is the simplest notion of time and requires no coordination between streams and machines.
     It provides the best performance and the lowest latency. However, in distributed and asynchronous
     environments processing time does not provide determinism, because it is susceptible to the speed at which
-    records arrive in the system (for example from the message queue), and to the speed at which the
-    records flow between operators inside the system.
+    records arrive in the system (for example from the message queue), to the speed at which the
+    records flow between operators inside the system, and to outages (scheduled, or otherwise).
 
 - **Event time:** Event time is the time that each individual event occurred on its producing device.
-    This time is typically embedded within the records before they enter Flink and that *event timestamp*
-    can be extracted from the record. An hourly event time window will contain all records that carry an
-    event timestamp that falls into that hour, regardless of when the records arrive, and in what order
-    they arrive.
-
-    Event time gives correct results even on out-of-order events, late events, or on replays
-    of data from backups or persistent logs. In event time, the progress of time depends on the data,
+    This time is typically embedded within the records before they enter Flink, and that *event timestamp*
+    can be extracted from each record. In event time, the progress of time depends on the data,
     not on any wall clocks. Event time programs must specify how to generate *Event Time Watermarks*,
-    which is the mechanism that signals progress in event time. The mechanism is
-    described below.
+    which is the mechanism that signals progress in event time. This watermarking mechanism is
+    described in a later section, [below](#event-time-and-watermarks).
+
+    In a perfect world, event time processing would yield completely consistent and deterministic results, regardless of when events arrive, or their ordering.
+    However, unless the events are known to arrive in-order (by timestamp), event time processing incurs some latency while waiting for out-of-order events. As it is only possible to wait for a finite period of time, this places a limit on how deterministic event time applications can be.
+
+    Assuming all of the data has arrived, event time operations will behave as expected, and produce correct and consistent results even when working with out-of-order or late events, or when reprocessing historic data. For example, an hourly event time window will contain all records
+    that carry an event timestamp that falls into that hour, regardless of the order in which they arrive, or when they are processed. (See the section on [late events](#late-elements) for more information.)
 
-    Event time processing often incurs a certain latency, due to its nature of waiting a certain time for
-    late events and out-of-order events. Because of that, event time programs are often combined with
-    *processing time* operations.
+
+
+    Note that sometimes when event time programs are processing live data in real-time, they will use some *processing time* operations in order to guarantee that they are progressing in a timely fashion.
 
 - **Ingestion time:** Ingestion time is the time that events enter Flink. At the source operator each
     record gets the source's current time as a timestamp, and time-based operations (like time windows)
@@ -194,6 +197,8 @@ The figure below shows an example of events and watermarks flowing through paral
 
 <img src="{{ site.baseurl }}/fig/parallel_streams_watermarks.svg" alt="Parallel data streams and operators with events and watermarks" class="center" width="80%" />
 
+Note that the Kafka source supports per-partition watermarking, which you can read more about [here]({{ site.baseurl }}/dev/event_timestamps_watermarks.html#timestamps-per-kafka-partition).
+
 
 ## Late Elements
 
@@ -201,7 +206,7 @@ It is possible that certain elements will violate the watermark condition, meani
 more elements with timestamp *t' <= t* will occur. In fact, in many real world setups, certain elements can be arbitrarily
 delayed, making it impossible to specify a time by which all elements of a certain event timestamp will have occurred.
 Furthermore, even if the lateness can be bounded, delaying the watermarks by too much is often not desirable, because it
-causes too much delay in the evaluation of the event time windows.
+causes too much delay in the evaluation of event time windows.
 
 For this reason, streaming programs may explicitly expect some *late* elements. Late elements are elements that
 arrive after the system's event time clock (as signaled by the watermarks) has already passed the time of the late element's

http://git-wip-us.apache.org/repos/asf/flink/blob/90855b63/docs/dev/event_timestamps_watermarks.md
----------------------------------------------------------------------
diff --git a/docs/dev/event_timestamps_watermarks.md b/docs/dev/event_timestamps_watermarks.md
index 44ecf54..79ac953 100644
--- a/docs/dev/event_timestamps_watermarks.md
+++ b/docs/dev/event_timestamps_watermarks.md
@@ -58,14 +58,14 @@ progress in event time.
 There are two ways to assign timestamps and generate watermarks:
 
   1. Directly in the data stream source
-  2. Via a timestamp assigner / watermark generator: in Flink timestamp assigners also define the watermarks to be emitted
+  2. Via a timestamp assigner / watermark generator: in Flink, timestamp assigners also define the watermarks to be emitted
 
 <span class="label label-danger">Attention</span> Both timestamps and watermarks are specified as
 milliseconds since the Java epoch of 1970-01-01T00:00:00Z.
 
 ### Source Functions with Timestamps and Watermarks
 
-Stream sources can also directly assign timestamps to the elements they produce, and they can also emit watermarks.
+Stream sources can directly assign timestamps to the elements they produce, and they can also emit watermarks.
 When this is done, no timestamp assigner is needed.
 Note that if a timestamp assigner is used, any timestamps and watermarks provided by the source will be overwritten.
 
@@ -180,7 +180,7 @@ The interval (every *n* milliseconds) in which the watermark will be generated i
 called each time, and a new watermark will be emitted if the returned watermark is non-null and larger than the previous
 watermark.
 
-Two simple examples of timestamp assigners with periodic watermark generation are below.
+Here we show two simple examples of timestamp assigners that use periodic watermark generation. Note that Flink ships with a `BoundedOutOfOrdernessTimestampExtractor` similar to the `BoundedOutOfOrdernessGenerator` shown below, which you can read about [here]({{ site.baseurl }}/dev/event_timestamp_extractors.html#assigners-allowing-a-fixed-amount-of-lateness).
 
 <div class="codetabs" markdown="1">
 <div data-lang="java" markdown="1">