You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Charles Chen <cc...@google.com> on 2018/02/02 01:21:11 UTC

Replacing Python DirectRunner apply_* hooks with PTransformOverrides

In the Python DirectRunner, we currently use apply_* overrides to override
the operation of the default .expand() operation for certain transforms.
For example, GroupByKey has a special implementation in the DirectRunner,
so we use an apply_* override hook to replace the implementation of
GroupByKey.expand().

However, this strategy has drawbacks. Because this override operation
happens eagerly during graph construction, the pipeline graph is
specialized and modified before a specific runner is bound to the
pipeline's execution. This makes the pipeline graph non-portable and blocks
full migration to using the Runner API pipeline representation in the
DirectRunner.

By contrast, the SDK's PTransformOverride mechanism allows the expression
of matchers that operate on the unspecialized graph, replacing PTransforms
as necessary to produce a DirectRunner-specialized pipeline graph for
execution.

I therefore propose to replace these eager apply_* overrides with
PTransformOverrides that operate on the completely constructed graph.

The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and I've
prepared a candidate patch at
https://github.com/apache/incubator-beam/pull/4529.

Best,
Charles

Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

Posted by Ahmet Altay <al...@google.com>.

+1 to this change.

Thank you Charles for improving the DirectRunner, sharing your progress and
seeking feedback. This change would allow us to migrate to a faster
DirectRunner for Python. A long time requested feature and an important
part of the first use experience for new users trying out Beam.

Ahmet

On Fri, Feb 2, 2018 at 11:00 AM, Charles Chen <cc...@google.com> wrote:

> Thanks Kenn.  We already do the Runner API roundtripping (I believe Robert
> implemented this).  With this change, we would start doing exactly what
> you're suggesting, where we apply overrides to a post-deserialization
> pipeline.
>
> On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> +1 for removing apply_*
>>
>> For the Java SDK, removing specialized intercepts was an important first
>> step towards the portability framework. I wonder if there is a way for the
>> Python SDK to leapfrog, taking advantage of some of the lessons that Java
>> learned a bit more painfully. Most pertinent I think is that if an SDK's
>> role is to construct a pipeline and ship the proto to a runner (service)
>> then overrides apply to a post-deserialization pipeline. The Java
>> DirectRunner does a proto round-trip to avoid accidentally depending on
>> things that are not really part of the pipeline. I would this crisp
>> abstraction enforcement would add even more value to Python.
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <cc...@google.com> wrote:
>>
>>> In the Python DirectRunner, we currently use apply_* overrides to
>>> override the operation of the default .expand() operation for certain
>>> transforms. For example, GroupByKey has a special implementation in the
>>> DirectRunner, so we use an apply_* override hook to replace the
>>> implementation of GroupByKey.expand().
>>>
>>> However, this strategy has drawbacks. Because this override operation
>>> happens eagerly during graph construction, the pipeline graph is
>>> specialized and modified before a specific runner is bound to the
>>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>>> full migration to using the Runner API pipeline representation in the
>>> DirectRunner.
>>>
>>> By contrast, the SDK's PTransformOverride mechanism allows the
>>> expression of matchers that operate on the unspecialized graph, replacing
>>> PTransforms as necessary to produce a DirectRunner-specialized pipeline
>>> graph for execution.
>>>
>>> I therefore propose to replace these eager apply_* overrides with
>>> PTransformOverrides that operate on the completely constructed graph.
>>>
>>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>>> I've prepared a candidate patch at https://github.com/apache/
>>> incubator-beam/pull/4529.
>>>
>>> Best,
>>> Charles
>>>
>>
>>

Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

Posted by Kenneth Knowles <kl...@google.com>.

Awesome, nice!

On Fri, Feb 2, 2018 at 11:00 AM, Charles Chen <cc...@google.com> wrote:

> Thanks Kenn.  We already do the Runner API roundtripping (I believe Robert
> implemented this).  With this change, we would start doing exactly what
> you're suggesting, where we apply overrides to a post-deserialization
> pipeline.
>
> On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> +1 for removing apply_*
>>
>> For the Java SDK, removing specialized intercepts was an important first
>> step towards the portability framework. I wonder if there is a way for the
>> Python SDK to leapfrog, taking advantage of some of the lessons that Java
>> learned a bit more painfully. Most pertinent I think is that if an SDK's
>> role is to construct a pipeline and ship the proto to a runner (service)
>> then overrides apply to a post-deserialization pipeline. The Java
>> DirectRunner does a proto round-trip to avoid accidentally depending on
>> things that are not really part of the pipeline. I would this crisp
>> abstraction enforcement would add even more value to Python.
>>
>> Kenn
>>
>> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <cc...@google.com> wrote:
>>
>>> In the Python DirectRunner, we currently use apply_* overrides to
>>> override the operation of the default .expand() operation for certain
>>> transforms. For example, GroupByKey has a special implementation in the
>>> DirectRunner, so we use an apply_* override hook to replace the
>>> implementation of GroupByKey.expand().
>>>
>>> However, this strategy has drawbacks. Because this override operation
>>> happens eagerly during graph construction, the pipeline graph is
>>> specialized and modified before a specific runner is bound to the
>>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>>> full migration to using the Runner API pipeline representation in the
>>> DirectRunner.
>>>
>>> By contrast, the SDK's PTransformOverride mechanism allows the
>>> expression of matchers that operate on the unspecialized graph, replacing
>>> PTransforms as necessary to produce a DirectRunner-specialized pipeline
>>> graph for execution.
>>>
>>> I therefore propose to replace these eager apply_* overrides with
>>> PTransformOverrides that operate on the completely constructed graph.
>>>
>>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>>> I've prepared a candidate patch at https://github.com/apache/
>>> incubator-beam/pull/4529.
>>>
>>> Best,
>>> Charles
>>>
>>
>>

Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

Posted by Charles Chen <cc...@google.com>.

Thanks Kenn.  We already do the Runner API roundtripping (I believe Robert
implemented this).  With this change, we would start doing exactly what
you're suggesting, where we apply overrides to a post-deserialization
pipeline.

On Thu, Feb 1, 2018 at 6:45 PM Kenneth Knowles <kl...@google.com> wrote:

> +1 for removing apply_*
>
> For the Java SDK, removing specialized intercepts was an important first
> step towards the portability framework. I wonder if there is a way for the
> Python SDK to leapfrog, taking advantage of some of the lessons that Java
> learned a bit more painfully. Most pertinent I think is that if an SDK's
> role is to construct a pipeline and ship the proto to a runner (service)
> then overrides apply to a post-deserialization pipeline. The Java
> DirectRunner does a proto round-trip to avoid accidentally depending on
> things that are not really part of the pipeline. I would this crisp
> abstraction enforcement would add even more value to Python.
>
> Kenn
>
> On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <cc...@google.com> wrote:
>
>> In the Python DirectRunner, we currently use apply_* overrides to
>> override the operation of the default .expand() operation for certain
>> transforms. For example, GroupByKey has a special implementation in the
>> DirectRunner, so we use an apply_* override hook to replace the
>> implementation of GroupByKey.expand().
>>
>> However, this strategy has drawbacks. Because this override operation
>> happens eagerly during graph construction, the pipeline graph is
>> specialized and modified before a specific runner is bound to the
>> pipeline's execution. This makes the pipeline graph non-portable and blocks
>> full migration to using the Runner API pipeline representation in the
>> DirectRunner.
>>
>> By contrast, the SDK's PTransformOverride mechanism allows the expression
>> of matchers that operate on the unspecialized graph, replacing PTransforms
>> as necessary to produce a DirectRunner-specialized pipeline graph for
>> execution.
>>
>> I therefore propose to replace these eager apply_* overrides with
>> PTransformOverrides that operate on the completely constructed graph.
>>
>> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
>> I've prepared a candidate patch at
>> https://github.com/apache/incubator-beam/pull/4529.
>>
>> Best,
>> Charles
>>
>
>

Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

Posted by Kenneth Knowles <kl...@google.com>.

+1 for removing apply_*

For the Java SDK, removing specialized intercepts was an important first
step towards the portability framework. I wonder if there is a way for the
Python SDK to leapfrog, taking advantage of some of the lessons that Java
learned a bit more painfully. Most pertinent I think is that if an SDK's
role is to construct a pipeline and ship the proto to a runner (service)
then overrides apply to a post-deserialization pipeline. The Java
DirectRunner does a proto round-trip to avoid accidentally depending on
things that are not really part of the pipeline. I would this crisp
abstraction enforcement would add even more value to Python.

Kenn

On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <cc...@google.com> wrote:

> In the Python DirectRunner, we currently use apply_* overrides to override
> the operation of the default .expand() operation for certain transforms.
> For example, GroupByKey has a special implementation in the DirectRunner,
> so we use an apply_* override hook to replace the implementation of
> GroupByKey.expand().
>
> However, this strategy has drawbacks. Because this override operation
> happens eagerly during graph construction, the pipeline graph is
> specialized and modified before a specific runner is bound to the
> pipeline's execution. This makes the pipeline graph non-portable and blocks
> full migration to using the Runner API pipeline representation in the
> DirectRunner.
>
> By contrast, the SDK's PTransformOverride mechanism allows the expression
> of matchers that operate on the unspecialized graph, replacing PTransforms
> as necessary to produce a DirectRunner-specialized pipeline graph for
> execution.
>
> I therefore propose to replace these eager apply_* overrides with
> PTransformOverrides that operate on the completely constructed graph.
>
> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
> I've prepared a candidate patch at https://github.com/apache/
> incubator-beam/pull/4529.
>
> Best,
> Charles
>