You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by da...@apache.org on 2016/06/14 20:34:41 UTC

[1/2] incubator-beam-site git commit: Add Flink Batch Runner Blog Post

Repository: incubator-beam-site
Updated Branches:
  refs/heads/asf-site 0c5c6471b -> 4f3d0f9fc


Add Flink Batch Runner Blog Post


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/3e35c5fb
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/3e35c5fb
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/3e35c5fb

Branch: refs/heads/asf-site
Commit: 3e35c5fbee413b5ea0dfc5885400b5b8b8d2aef6
Parents: 0c5c647
Author: Aljoscha Krettek <al...@gmail.com>
Authored: Wed Jun 1 17:57:45 2016 +0200
Committer: Davor Bonaci <da...@google.com>
Committed: Tue Jun 14 13:34:12 2016 -0700

----------------------------------------------------------------------
 _data/authors.yml                               |  4 +++
 .../2016-06-15-flink-batch-runner-milestone.md  | 32 ++++++++++++++++++++
 2 files changed, 36 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/3e35c5fb/_data/authors.yml
----------------------------------------------------------------------
diff --git a/_data/authors.yml b/_data/authors.yml
index 930a514..034a2ee 100644
--- a/_data/authors.yml
+++ b/_data/authors.yml
@@ -18,3 +18,7 @@ dhalperi:
     name: Dan Halperin
     email: dhalperi@apache.org
     twitter:
+aljoscha:
+    name: Aljoscha Krettek
+    email: aljoscha@apache.org
+    twitter: aljoscha

http://git-wip-us.apache.org/repos/asf/incubator-beam-site/blob/3e35c5fb/_posts/2016-06-15-flink-batch-runner-milestone.md
----------------------------------------------------------------------
diff --git a/_posts/2016-06-15-flink-batch-runner-milestone.md b/_posts/2016-06-15-flink-batch-runner-milestone.md
new file mode 100644
index 0000000..5b98d31
--- /dev/null
+++ b/_posts/2016-06-15-flink-batch-runner-milestone.md
@@ -0,0 +1,32 @@
+---
+layout: post
+title:  "How We Added Windowing to the Apache Flink Batch Runner"
+date:   2016-06-15 09:00:00 -0700
+excerpt_separator: <!--more-->
+categories: blog
+authors:
+  - aljoscha
+---
+We recently achieved a major milestone by adding support for windowing to the [Apache Flink](http://flink.apache.org) Batch runner. In this post we would like to explain what this means for users of Apache Beam and highlight some of the implementation details.
+
+<!--more-->
+
+Before we start, though, let\u2019s quickly talk about the execution of Beam programs and how this is relevant to today\u2019s post. A Beam pipeline can contain bounded and unbounded sources. If the pipeline only contains bounded sources it can be executed in a batch fashion, if it contains some unbounded sources it must be executed in a streaming fashion. When executing a Beam pipeline on Flink, you don\u2019t have to choose the execution mode. Internally, the Flink runner either translates the pipeline to a Flink `DataSet` program or a `DataStream` program, depending on whether unbounded sources are used in the pipeline. In the following, when we say \u201cBatch runner\u201d what we are really talking about is the Flink runner being in batch execution mode.
+
+## What does this mean for users?
+
+Support for windowing was the last missing puzzle piece for making the Flink Batch runner compatible with the Beam model. With the latest change to the Batch runner users can now run any pipeline that only contains bounded sources and be certain that the results match those of the original reference-implementation runners that were provided by Google as part of the initial code drop coming from the Google Dataflow SDK.
+
+The most obvious part of the change is that windows can now be assigned to elements and that the runner respects these windows for the `GroupByKey` and `Combine` operations. A not-so-obvious change concerns side-inputs. In the Beam model, side inputs respect windows; when a value of the main input is being processed only the side input that corresponds to the correct window is available to the processing function, the `DoFn`.
+
+Getting side-input semantics right is an important milestone in it\u2019s own because it allows to use a big suite of unit tests for verifying the correctness of a runner implementation. These tests exercise every obscure detail of the Beam programming model and verify that the results produced by a runner match what you would expect from a correct implementation. In the suite, side inputs are used to compare the expected result to the actual result. With these tests being executed regularly we can now be more confident that the implementation produces correct results for user-specified pipelines.
+
+## Under the Hood
+The basis for the changes is the introduction of `WindowedValue` in the generated Flink transformations. Before, a Beam `PCollection<T>` would be transformed to a `DataSet<T>`. Now, we instead create a `DataSet<WindowedValue<T>>`. The `WindowedValue<T>` stores meta data about the value, such as the timestamp and the windows to which it was assigned.
+
+With this basic change out of the way we just had to make sure that windows were respected for side inputs and that `Combine` and `GroupByKey` correctly handled windows. The tricky part there is the handling of merging windows such as session windows. For these we essentially emulate the behavior of a merging `WindowFn` in our own code.
+
+After we got side inputs working we could enable the aforementioned suite of tests to check how well the runner behaves with respect to the Beam model. As can be expected there were quite some discrepancies but we managed to resolve them all. In the process, we also slimmed down the runner implementation. For example, we removed all custom translations for sources and sinks and are now relying only on Beam code for these, thereby greatly reducing the maintenance overhead.
+
+## Summary
+We reached a major milestone in adding windowing support to the Flink Batch runner, thereby making it compatible with the Beam model. Because of the large suite of tests that can now be executed on the runner we are also confident about the correctness of the implementation and about it staying that way in the future.


[2/2] incubator-beam-site git commit: This closes #20

Posted by da...@apache.org.
This closes #20


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/commit/4f3d0f9f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/tree/4f3d0f9f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam-site/diff/4f3d0f9f

Branch: refs/heads/asf-site
Commit: 4f3d0f9fccac1b325f4fb89d2eca79de68629ff0
Parents: 0c5c647 3e35c5f
Author: Davor Bonaci <da...@google.com>
Authored: Tue Jun 14 13:34:27 2016 -0700
Committer: Davor Bonaci <da...@google.com>
Committed: Tue Jun 14 13:34:27 2016 -0700

----------------------------------------------------------------------
 _data/authors.yml                               |  4 +++
 .../2016-06-15-flink-batch-runner-milestone.md  | 32 ++++++++++++++++++++
 2 files changed, 36 insertions(+)
----------------------------------------------------------------------