You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2017/11/06 18:42:18 UTC

[beam-site] 01/02: Add page for the portability framework

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 147e38088f424ef7e69f95d421d16a1a490f2203
Author: Henning Rohde <he...@google.com>
AuthorDate: Wed Nov 1 10:02:45 2017 -0700

    Add page for the portability framework
---
 src/contribute/portability.md | 167 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)

diff --git a/src/contribute/portability.md b/src/contribute/portability.md
new file mode 100644
index 0000000..f8e00fd
--- /dev/null
+++ b/src/contribute/portability.md
@@ -0,0 +1,167 @@
+---
+layout: section
+title: "Portability Framework"
+permalink: /contribute/portability/
+section_menu: section-menu/contribute.html
+redirect_from: /portability/
+---
+
+# Portability Framework
+
+* TOC
+{:toc}
+
+## Overview
+
+Interoperability between SDKs and runners is a key aspect of Apache
+Beam. So far, however, the reality is that most runners support the
+Java SDK only, because each SDK-runner combination requires non-trivial
+work on both sides. All runners are also currently written in Java,
+which makes support of non-Java SDKs far more expensive. The
+_portability framework_ aims to rectify this situation and provide
+full interoperability across the Beam ecosystem.
+
+The portability framework introduces well-defined, language-neutral
+data structures and protocols between the SDK and runner. This interop
+layer -- called the _portability API_ -- ensures that SDKs and runners
+can work with each other uniformly, reducing the interoperability
+burden for both SDKs and runners to a constant effort.  It notably
+ensures that _new_ SDKs automatically work with existing runners and
+vice versa.  The framework introduces a new runner, the _Universal
+Local Runner (ULR)_, as a practical reference implementation that
+complements the direct runners. Finally, it enables cross-language
+pipelines (sharing I/O or transformations across SDKs) and
+user-customized execution environments ("custom containers").
+
+The portability API consists of a set of smaller contracts that
+isolate SDKs and runners for job submission, management and
+execution. These contracts use protobufs and gRPC for broad language
+support.
+
+ * **Job submission and management**: The _Runner API_ defines a
+   language-neutral pipeline representation with transformations
+   specifying the execution environment as a docker container
+   image. The latter both allows the execution side to set up the
+   right environment as well as opens the door for custom containers
+   and cross-environment pipelines. The _Job API_ allows pipeline
+   execution and configuration to be managed uniformly.
+
+ * **Job execution**: The _SDK harness_ is a SDK-provided
+   program responsible for executing user code and is run separately
+   from the runner.  The _Fn API_ defines an execution-time binary
+   contract between the SDK harness and the runner that describes how
+   execution tasks are managed and how data is transferred. In
+   addition, the runner needs to handle progress and monitoring in an
+   efficient and language-neutral way. SDK harness initialization
+   relies on the _Provision_ and _Artifact APIs_ for obtaining staged
+   files, pipeline options and environment information. Docker
+   provides isolation between the runner and SDK/user environments to
+   the benefit of both as defined by the _container contract_. The
+   containerization of the SDK gives it (and the user, unless the SDK
+   is closed) full control over its own environment without risk of
+   dependency conflicts. The runner has significant freedom regarding
+   how it manages the SDK harness containers.
+
+The goal is that all (non-direct) runners and SDKs eventually support
+the portability API, perhaps exclusively.
+
+## Design
+
+The [model protos](https://github.com/apache/beam/tree/master/model)
+contain all aspects of the portability API and is the truth on the
+ground. The proto definitions supercede any design documents. The main
+design documents are the following:
+
+ * [Runner API](https://s.apache.org/beam-runner-api). Pipeline
+   representation and discussion on primitive/composite transforms and
+   optimizations.
+   
+ * [Job API](https://s.apache.org/beam-job-api). Job submission and
+   management protocol.
+
+ * [Fn API](https://s.apache.org/beam-fn-api). Execution-side control
+   and data protocols and overview.
+
+ * [Container
+   contract](https://s.apache.org/beam-fn-api-container-contract).
+   Execution-side docker container invocation and provisioning
+   protocols. See
+   [CONTAINERS.md](https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md)
+   for how to build container images.
+
+In discussion:
+
+ * [Cross
+   language](https://s.apache.org/beam-mixed-language-pipelines). Options
+   and tradeoffs for how to handle various kinds of
+   multi-language/multi-SDK pipelines.
+
+## Development
+
+The portability framework is a substantial effort that touches every
+Beam component. In addition to the sheer magnitude, a major challenge
+is engineering an interop layer that does not significantly compromise
+performance due to the additional serialization overhead of a
+language-neutral protocol.
+
+### Roadmap
+
+The proposed project phases are roughly as follows and are not
+strictly sequential, as various components will likely move at
+different speeds. Additionally, there have been (and continues to be)
+supporting refactorings that are not always tracked as part of the
+portability effort. Work already done is not tracked here either.
+
+ * **P1 [MVP]**: Implement the fundamental plumbing for portable SDKs
+   and runners for batch and streaming, including containers and the
+   ULR
+   [[BEAM-2899](https://issues.apache.org/jira/browse/BEAM-2899)]. Each
+   SDK and runner should use the portability framework at least to the
+   extent that wordcount
+   [[BEAM-2896](https://issues.apache.org/jira/browse/BEAM-2896)] and
+   windowed wordcount
+   [[BEAM-2941](https://issues.apache.org/jira/browse/BEAM-2941)] run
+   portably.
+    
+ * **P2 [Feature complete]**: Design and implement portability support
+   for remaining execution-side features, so that any pipeline from
+   any SDK can run portably on any runner. These features include side
+   inputs
+   [[BEAM-2863](https://issues.apache.org/jira/browse/BEAM-2863)], User
+   timers
+   [[BEAM-2925](https://issues.apache.org/jira/browse/BEAM-2925)],
+   Splittable DoFn
+   [[BEAM-2896](https://issues.apache.org/jira/browse/BEAM-2896)] and
+   more.  Each SDK and runner should use the portability framework at
+   least to the extent that the mobile gaming examples
+   [[BEAM-2940](https://issues.apache.org/jira/browse/BEAM-2940)] run
+   portably.
+    
+ * **P3 [Performance]**: Measure and tune performance of portable
+   pipelines using benchmarks such as Nexmark. Features such as
+   progress reporting
+   [[BEAM-2940](https://issues.apache.org/jira/browse/BEAM-2940)],
+   combiner lifting
+   [[BEAM-2937](https://issues.apache.org/jira/browse/BEAM-2937)] and
+   fusion are expected to be needed.
+
+ * **P4 [Cross language]**: Design and implement cross-language
+   pipeline support, including how the ecosystem of shared transforms
+   should work.
+
+### Issues
+
+The portability effort touches every component, so the "portability"
+label is used to identify all portability-related issues. Pure
+design or proto definitions should use the "beam-model" component. A
+common pattern for new portability features is that the overall
+feature is in "beam-model" with subtasks for each SDK and runner in
+their respective components.
+
+**JIRA:** [query](https://issues.apache.org/jira/issues/?filter=12341256)
+
+### Status
+
+MVP in progress. No SDK or runner supports the full portability API
+yet, but once that happens a more detailed progress table will be
+added here.

-- 
To stop receiving notification emails like this one, please contact
"commits@beam.apache.org" <co...@beam.apache.org>.