You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Bill Neubauer <wc...@google.com.INVALID> on 2016/06/21 14:36:41 UTC

Fn API - Overview and Motivation

Hi! I’m Bill. I’m working with Lukasz on defining the Fn API, which will
provide guidance on the boundary between runner-specific code, and
constructs the SDK implements.
Here's our high level framing of the problem, and the type of problems we
want to address.

Concept:

Decouple runner developers from SDK authors by having a clean separation
between the execution of user definable functions (DoFns and others)
marshaled by SDK, and the responsibilities of the runner in providing a
distributed execution framework for the parallel execution of those
functions.

The vision that compels us is providing a minimal standard that enables
SDKs in all languages, and provide non-binding implementation guidance for
runner authors to help provide additional seams between the SDK and runner,
trading off increased implementation complexity for better performance
opportunities.

The concept characterizes the interface in 3 dimensions: the data format
describing function invocations and their results and side effects, which
bridges the runner and SDK; the transport of bytes that moves this data
between runner and SDK, and a specification of the execution environment in
which the SDK code runs.

Data Format: A UDF is called with arguments containing the data of its
input arguments. The UDF returns the data that is associated with the named
outputs, and additional data for counter values and TBD.

Transport: As the SDK implementation language may differ from the runner’s
language, there are different options for moving the function invocation
data between the two. SWIG is a common technique for cross-language
binding. We are investigating using GRPC as a transport, since servers and
clients exist for it in many languages, providing a useful set of potential
targets for SDK development that would not be reachable via SWIG.

Execution environment: While less of a concern for certain environments,
other runners execute in dynamically instantiated environments that may
require configuration/provisioning to support the execution of arbitrary
user code. Examples include Python code expecting dependencies to be
installed, user code expecting files staged on the local filesystem, and so
on. We are exploring using Docker containers as a mechanism for specifying
these execution environments. They are advantageous in that they solve the
deployment issues described above, and are admissible to the various
implementations of the Beam model today.

I’ll be sharing my thinking as it moves forward, but wanted to touch base
and invite comments and thoughts.

Re: Fn API - Overview and Motivation

Posted by Ismaël Mejía <ie...@gmail.com>.

Hello,

I have been intrigued by the Fn API since the first time I heard about it,
really interesting indeed, maybe you can put your ideas in a google doc, so
it can be part of the Beam Technical Docs dir (and commented by others).

Ismaël


On Tue, Jun 21, 2016 at 4:36 PM, Bill Neubauer <wc...@google.com.invalid>
wrote:

> Hi! I’m Bill. I’m working with Lukasz on defining the Fn API, which will
> provide guidance on the boundary between runner-specific code, and
> constructs the SDK implements.
> Here's our high level framing of the problem, and the type of problems we
> want to address.
>
> Concept:
>
> Decouple runner developers from SDK authors by having a clean separation
> between the execution of user definable functions (DoFns and others)
> marshaled by SDK, and the responsibilities of the runner in providing a
> distributed execution framework for the parallel execution of those
> functions.
>
> The vision that compels us is providing a minimal standard that enables
> SDKs in all languages, and provide non-binding implementation guidance for
> runner authors to help provide additional seams between the SDK and runner,
> trading off increased implementation complexity for better performance
> opportunities.
>
> The concept characterizes the interface in 3 dimensions: the data format
> describing function invocations and their results and side effects, which
> bridges the runner and SDK; the transport of bytes that moves this data
> between runner and SDK, and a specification of the execution environment in
> which the SDK code runs.
>
> Data Format: A UDF is called with arguments containing the data of its
> input arguments. The UDF returns the data that is associated with the named
> outputs, and additional data for counter values and TBD.
>
> Transport: As the SDK implementation language may differ from the runner’s
> language, there are different options for moving the function invocation
> data between the two. SWIG is a common technique for cross-language
> binding. We are investigating using GRPC as a transport, since servers and
> clients exist for it in many languages, providing a useful set of potential
> targets for SDK development that would not be reachable via SWIG.
>
> Execution environment: While less of a concern for certain environments,
> other runners execute in dynamically instantiated environments that may
> require configuration/provisioning to support the execution of arbitrary
> user code. Examples include Python code expecting dependencies to be
> installed, user code expecting files staged on the local filesystem, and so
> on. We are exploring using Docker containers as a mechanism for specifying
> these execution environments. They are advantageous in that they solve the
> deployment issues described above, and are admissible to the various
> implementations of the Beam model today.
>
> I’ll be sharing my thinking as it moves forward, but wanted to touch base
> and invite comments and thoughts.
>