You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Daniel Oliveira <da...@google.com> on 2020/01/11 00:54:55 UTC

Go SplittableDoFn prototype and proposed changes

Hey Beam devs,

So several months ago I posted my Go SDF proposal and got a lot of good
feedback (thread
<https://lists.apache.org/thread.html/327bc72a0b30e18c6152b562bac2613c0edc942465d67b215830819e%40%3Cdev.beam.apache.org%3E>,
doc <https://s.apache.org/beam-go-sdf>). Since then I've been working on
implementing it and I've got an initial prototype ready to show off! It
works with initial splitting on Flink, and has a decently documented API.
Also in the second part of the email I'll also be proposing changes to the
original doc, based on my experience working on this prototype.

To be clear, this is *not* ready to officially go into Beam yet; the API is
still likely to go through changes. Rather, I'm showing this off to show
that progress is being made on SDFs, and to provide some context to the
changes I'll be proposing below.

Here's a link to the repo and branch so you can download it, and a link to
the changes specifically:
Repo: https://github.com/youngoli/beam/tree/gosdf
Changes:
https://github.com/apache/beam/commit/28140ee3471d6cb80e74a16e6fd108cc380d4831

If you give it a try and have any thoughts, please let me know! I'm open to
any and all feedback.

==================================

Proposed Changes
Doc: https://s.apache.org/beam-go-sdf (Select "Version 1" from version
history.)

For anyone reading this who hasn't already read the doc above, I suggest
reading it first, since I'll be referring to concepts from it.

After working on the prototype I've changed my mind on the original
decisions to go with an interface approach and a combined restriction +
tracker. But I don't want to go all in and create another doc with a
detailed proposal, so I've laid out a brief summary of the changes to get
some initial feedback before I go ahead and start working on these changes
in detail. Please let me know what you think!

*1. Change from native Go interfaces to dynamic reflection-based API.*

Instead of the native Go interfaces (SplittableDoFn, RProvider, and
RTracker) described in the doc and implemented in the prototype, use the
same dynamic approach that the Go SDK already uses for DoFns: Use the
reflection system to examine the names and signatures of methods in the
user's DoFn, RProvider, and RTracker.

Original approach reasoning:

   - Simpler, so faster to implement and less bug-prone.
   - The extra burden on the user to keep types consistent is ok since most
   users of SDFs are more advanced

Change reasoning:

   - In the prototype, I found interfaces to require too much extra
   boilerplate which added more complexity than expected. (Examples: Constant
   casting,
   - More consistent API: Inconsistency between regular DoFns (dynamic) and
   SDF API (interfaces) was jarring and unintuitive when implementing SDFs as
   a user.

Implementation: Full details are up for discussion, but the goal is to make
the RProvider and  RTracker interfaces dynamic, so we can replace all
instances of interface{} in the methods with the actual element types (i.e.
fake generics). Also uses of the RProvider and RTracker interfaces in
signatures can be replaced with the implementations of those
providers/trackers. This will require a good amount of additional work in
the DoFn validation codebase and the code generator. Plus a fair amount of
additional user code validation will be needed and more testing since the
new code is more complex.

*2. Seperate the restriction tracker and restriction.*

Currently the API has the restriction combined with the tracker. In most
other SDKs and within the SDF model, the two are usually separate concepts,
and this change is to follow that approach and split the two.

Original approach reasoning:

   - It was considered simpler to avoid another level of type casting in
   the API with the interface approach.

Change reasoning:

   - We are no longer going with the interface approach. With "fake
   generics", it is simpler to keep the two concepts separate.
   - Requiring users to specify custom coders in order to only encode the
   restriction and not the tracker ended up adding additional complexity
   anyway.

Implementation: In the API have the restriction tracker initialized with a
restriction object accessible via a getter. The restriction itself will be
the only thing serialized, so it will be wrapped and unwrapped with the
tracker before the user code is invoked. This wouldn't add very little work
as it would mostly be bundled with the interface->dynamic approach change.


Thanks,
Daniel Oliveira

Re: Go SplittableDoFn prototype and proposed changes

Posted by Daniel Oliveira <da...@google.com>.
As a follow-up to the proposed changes from my first email, I've worked on
a doc with a more detailed changelist, including details still up for
discussion:
https://docs.google.com/document/d/1UeG5uNO00xCByGEZzDXk0m0LghX6HBWlMfRbMv_Xiyc/edit?usp=sharing

The doc is mostly full of my brainstorming on what the next version of the
user-facing Go SDF API will look like, so it's not too polished. But if
anyone's interested in this, I welcome any and all feedback!

On Mon, Jan 13, 2020 at 2:22 PM Luke Cwik <lc...@google.com> wrote:

> Thanks for the update and I agree with the points that you have made.
>
> On Fri, Jan 10, 2020 at 5:58 PM Robert Burke <ro...@frantil.com> wrote:
>
>> Thank you for sharing Daniel!
>>
>> Resolving SplittableDoFns for the Go SDK even just as far as initial
>> splitting will take the SDK that much closer to exiting its experimental
>> status.
>>
>> It's especially exciting seeing this work on Flink and on the Python
>> direct runner!
>>
>> On Fri, Jan 10, 2020, 5:36 PM Daniel Oliveira <da...@google.com>
>> wrote:
>>
>>> Hey Beam devs,
>>>
>>> So several months ago I posted my Go SDF proposal and got a lot of good
>>> feedback (thread
>>> <https://lists.apache.org/thread.html/327bc72a0b30e18c6152b562bac2613c0edc942465d67b215830819e%40%3Cdev.beam.apache.org%3E>,
>>> doc <https://s.apache.org/beam-go-sdf>). Since then I've been working
>>> on implementing it and I've got an initial prototype ready to show off! It
>>> works with initial splitting on Flink, and has a decently documented API.
>>> Also in the second part of the email I'll also be proposing changes to the
>>> original doc, based on my experience working on this prototype.
>>>
>>> To be clear, this is *not* ready to officially go into Beam yet; the
>>> API is still likely to go through changes. Rather, I'm showing this off to
>>> show that progress is being made on SDFs, and to provide some context to
>>> the changes I'll be proposing below.
>>>
>>> Here's a link to the repo and branch so you can download it, and a link
>>> to the changes specifically:
>>> Repo: https://github.com/youngoli/beam/tree/gosdf
>>> Changes:
>>> https://github.com/apache/beam/commit/28140ee3471d6cb80e74a16e6fd108cc380d4831
>>>
>>> If you give it a try and have any thoughts, please let me know! I'm open
>>> to any and all feedback.
>>>
>>> ==================================
>>>
>>> Proposed Changes
>>> Doc: https://s.apache.org/beam-go-sdf (Select "Version 1" from version
>>> history.)
>>>
>>> For anyone reading this who hasn't already read the doc above, I suggest
>>> reading it first, since I'll be referring to concepts from it.
>>>
>>> After working on the prototype I've changed my mind on the original
>>> decisions to go with an interface approach and a combined restriction +
>>> tracker. But I don't want to go all in and create another doc with a
>>> detailed proposal, so I've laid out a brief summary of the changes to get
>>> some initial feedback before I go ahead and start working on these changes
>>> in detail. Please let me know what you think!
>>>
>>> *1. Change from native Go interfaces to dynamic reflection-based API.*
>>>
>>> Instead of the native Go interfaces (SplittableDoFn, RProvider, and
>>> RTracker) described in the doc and implemented in the prototype, use the
>>> same dynamic approach that the Go SDK already uses for DoFns: Use the
>>> reflection system to examine the names and signatures of methods in the
>>> user's DoFn, RProvider, and RTracker.
>>>
>>> Original approach reasoning:
>>>
>>>    - Simpler, so faster to implement and less bug-prone.
>>>    - The extra burden on the user to keep types consistent is ok since
>>>    most users of SDFs are more advanced
>>>
>>> Change reasoning:
>>>
>>>    - In the prototype, I found interfaces to require too much extra
>>>    boilerplate which added more complexity than expected. (Examples: Constant
>>>    casting,
>>>    - More consistent API: Inconsistency between regular DoFns (dynamic)
>>>    and SDF API (interfaces) was jarring and unintuitive when implementing SDFs
>>>    as a user.
>>>
>>> Implementation: Full details are up for discussion, but the goal is to
>>> make the RProvider and  RTracker interfaces dynamic, so we can replace all
>>> instances of interface{} in the methods with the actual element types
>>> (i.e. fake generics). Also uses of the RProvider and RTracker interfaces in
>>> signatures can be replaced with the implementations of those
>>> providers/trackers. This will require a good amount of additional work in
>>> the DoFn validation codebase and the code generator. Plus a fair amount of
>>> additional user code validation will be needed and more testing since the
>>> new code is more complex.
>>>
>>> *2. Seperate the restriction tracker and restriction.*
>>>
>>> Currently the API has the restriction combined with the tracker. In most
>>> other SDKs and within the SDF model, the two are usually separate concepts,
>>> and this change is to follow that approach and split the two.
>>>
>>> Original approach reasoning:
>>>
>>>    - It was considered simpler to avoid another level of type casting
>>>    in the API with the interface approach.
>>>
>>> Change reasoning:
>>>
>>>    - We are no longer going with the interface approach. With "fake
>>>    generics", it is simpler to keep the two concepts separate.
>>>    - Requiring users to specify custom coders in order to only encode
>>>    the restriction and not the tracker ended up adding additional complexity
>>>    anyway.
>>>
>>> Implementation: In the API have the restriction tracker initialized with
>>> a restriction object accessible via a getter. The restriction itself will
>>> be the only thing serialized, so it will be wrapped and unwrapped with the
>>> tracker before the user code is invoked. This wouldn't add very little work
>>> as it would mostly be bundled with the interface->dynamic approach change.
>>>
>>>
>>> Thanks,
>>> Daniel Oliveira
>>>
>>

Re: Go SplittableDoFn prototype and proposed changes

Posted by Luke Cwik <lc...@google.com>.
Thanks for the update and I agree with the points that you have made.

On Fri, Jan 10, 2020 at 5:58 PM Robert Burke <ro...@frantil.com> wrote:

> Thank you for sharing Daniel!
>
> Resolving SplittableDoFns for the Go SDK even just as far as initial
> splitting will take the SDK that much closer to exiting its experimental
> status.
>
> It's especially exciting seeing this work on Flink and on the Python
> direct runner!
>
> On Fri, Jan 10, 2020, 5:36 PM Daniel Oliveira <da...@google.com>
> wrote:
>
>> Hey Beam devs,
>>
>> So several months ago I posted my Go SDF proposal and got a lot of good
>> feedback (thread
>> <https://lists.apache.org/thread.html/327bc72a0b30e18c6152b562bac2613c0edc942465d67b215830819e%40%3Cdev.beam.apache.org%3E>,
>> doc <https://s.apache.org/beam-go-sdf>). Since then I've been working on
>> implementing it and I've got an initial prototype ready to show off! It
>> works with initial splitting on Flink, and has a decently documented API.
>> Also in the second part of the email I'll also be proposing changes to the
>> original doc, based on my experience working on this prototype.
>>
>> To be clear, this is *not* ready to officially go into Beam yet; the API
>> is still likely to go through changes. Rather, I'm showing this off to show
>> that progress is being made on SDFs, and to provide some context to the
>> changes I'll be proposing below.
>>
>> Here's a link to the repo and branch so you can download it, and a link
>> to the changes specifically:
>> Repo: https://github.com/youngoli/beam/tree/gosdf
>> Changes:
>> https://github.com/apache/beam/commit/28140ee3471d6cb80e74a16e6fd108cc380d4831
>>
>> If you give it a try and have any thoughts, please let me know! I'm open
>> to any and all feedback.
>>
>> ==================================
>>
>> Proposed Changes
>> Doc: https://s.apache.org/beam-go-sdf (Select "Version 1" from version
>> history.)
>>
>> For anyone reading this who hasn't already read the doc above, I suggest
>> reading it first, since I'll be referring to concepts from it.
>>
>> After working on the prototype I've changed my mind on the original
>> decisions to go with an interface approach and a combined restriction +
>> tracker. But I don't want to go all in and create another doc with a
>> detailed proposal, so I've laid out a brief summary of the changes to get
>> some initial feedback before I go ahead and start working on these changes
>> in detail. Please let me know what you think!
>>
>> *1. Change from native Go interfaces to dynamic reflection-based API.*
>>
>> Instead of the native Go interfaces (SplittableDoFn, RProvider, and
>> RTracker) described in the doc and implemented in the prototype, use the
>> same dynamic approach that the Go SDK already uses for DoFns: Use the
>> reflection system to examine the names and signatures of methods in the
>> user's DoFn, RProvider, and RTracker.
>>
>> Original approach reasoning:
>>
>>    - Simpler, so faster to implement and less bug-prone.
>>    - The extra burden on the user to keep types consistent is ok since
>>    most users of SDFs are more advanced
>>
>> Change reasoning:
>>
>>    - In the prototype, I found interfaces to require too much extra
>>    boilerplate which added more complexity than expected. (Examples: Constant
>>    casting,
>>    - More consistent API: Inconsistency between regular DoFns (dynamic)
>>    and SDF API (interfaces) was jarring and unintuitive when implementing SDFs
>>    as a user.
>>
>> Implementation: Full details are up for discussion, but the goal is to
>> make the RProvider and  RTracker interfaces dynamic, so we can replace all
>> instances of interface{} in the methods with the actual element types
>> (i.e. fake generics). Also uses of the RProvider and RTracker interfaces in
>> signatures can be replaced with the implementations of those
>> providers/trackers. This will require a good amount of additional work in
>> the DoFn validation codebase and the code generator. Plus a fair amount of
>> additional user code validation will be needed and more testing since the
>> new code is more complex.
>>
>> *2. Seperate the restriction tracker and restriction.*
>>
>> Currently the API has the restriction combined with the tracker. In most
>> other SDKs and within the SDF model, the two are usually separate concepts,
>> and this change is to follow that approach and split the two.
>>
>> Original approach reasoning:
>>
>>    - It was considered simpler to avoid another level of type casting in
>>    the API with the interface approach.
>>
>> Change reasoning:
>>
>>    - We are no longer going with the interface approach. With "fake
>>    generics", it is simpler to keep the two concepts separate.
>>    - Requiring users to specify custom coders in order to only encode
>>    the restriction and not the tracker ended up adding additional complexity
>>    anyway.
>>
>> Implementation: In the API have the restriction tracker initialized with
>> a restriction object accessible via a getter. The restriction itself will
>> be the only thing serialized, so it will be wrapped and unwrapped with the
>> tracker before the user code is invoked. This wouldn't add very little work
>> as it would mostly be bundled with the interface->dynamic approach change.
>>
>>
>> Thanks,
>> Daniel Oliveira
>>
>

Re: Go SplittableDoFn prototype and proposed changes

Posted by Robert Burke <ro...@frantil.com>.
Thank you for sharing Daniel!

Resolving SplittableDoFns for the Go SDK even just as far as initial
splitting will take the SDK that much closer to exiting its experimental
status.

It's especially exciting seeing this work on Flink and on the Python direct
runner!

On Fri, Jan 10, 2020, 5:36 PM Daniel Oliveira <da...@google.com>
wrote:

> Hey Beam devs,
>
> So several months ago I posted my Go SDF proposal and got a lot of good
> feedback (thread
> <https://lists.apache.org/thread.html/327bc72a0b30e18c6152b562bac2613c0edc942465d67b215830819e%40%3Cdev.beam.apache.org%3E>,
> doc <https://s.apache.org/beam-go-sdf>). Since then I've been working on
> implementing it and I've got an initial prototype ready to show off! It
> works with initial splitting on Flink, and has a decently documented API.
> Also in the second part of the email I'll also be proposing changes to the
> original doc, based on my experience working on this prototype.
>
> To be clear, this is *not* ready to officially go into Beam yet; the API
> is still likely to go through changes. Rather, I'm showing this off to show
> that progress is being made on SDFs, and to provide some context to the
> changes I'll be proposing below.
>
> Here's a link to the repo and branch so you can download it, and a link to
> the changes specifically:
> Repo: https://github.com/youngoli/beam/tree/gosdf
> Changes:
> https://github.com/apache/beam/commit/28140ee3471d6cb80e74a16e6fd108cc380d4831
>
> If you give it a try and have any thoughts, please let me know! I'm open
> to any and all feedback.
>
> ==================================
>
> Proposed Changes
> Doc: https://s.apache.org/beam-go-sdf (Select "Version 1" from version
> history.)
>
> For anyone reading this who hasn't already read the doc above, I suggest
> reading it first, since I'll be referring to concepts from it.
>
> After working on the prototype I've changed my mind on the original
> decisions to go with an interface approach and a combined restriction +
> tracker. But I don't want to go all in and create another doc with a
> detailed proposal, so I've laid out a brief summary of the changes to get
> some initial feedback before I go ahead and start working on these changes
> in detail. Please let me know what you think!
>
> *1. Change from native Go interfaces to dynamic reflection-based API.*
>
> Instead of the native Go interfaces (SplittableDoFn, RProvider, and
> RTracker) described in the doc and implemented in the prototype, use the
> same dynamic approach that the Go SDK already uses for DoFns: Use the
> reflection system to examine the names and signatures of methods in the
> user's DoFn, RProvider, and RTracker.
>
> Original approach reasoning:
>
>    - Simpler, so faster to implement and less bug-prone.
>    - The extra burden on the user to keep types consistent is ok since
>    most users of SDFs are more advanced
>
> Change reasoning:
>
>    - In the prototype, I found interfaces to require too much extra
>    boilerplate which added more complexity than expected. (Examples: Constant
>    casting,
>    - More consistent API: Inconsistency between regular DoFns (dynamic)
>    and SDF API (interfaces) was jarring and unintuitive when implementing SDFs
>    as a user.
>
> Implementation: Full details are up for discussion, but the goal is to
> make the RProvider and  RTracker interfaces dynamic, so we can replace all
> instances of interface{} in the methods with the actual element types
> (i.e. fake generics). Also uses of the RProvider and RTracker interfaces in
> signatures can be replaced with the implementations of those
> providers/trackers. This will require a good amount of additional work in
> the DoFn validation codebase and the code generator. Plus a fair amount of
> additional user code validation will be needed and more testing since the
> new code is more complex.
>
> *2. Seperate the restriction tracker and restriction.*
>
> Currently the API has the restriction combined with the tracker. In most
> other SDKs and within the SDF model, the two are usually separate concepts,
> and this change is to follow that approach and split the two.
>
> Original approach reasoning:
>
>    - It was considered simpler to avoid another level of type casting in
>    the API with the interface approach.
>
> Change reasoning:
>
>    - We are no longer going with the interface approach. With "fake
>    generics", it is simpler to keep the two concepts separate.
>    - Requiring users to specify custom coders in order to only encode the
>    restriction and not the tracker ended up adding additional complexity
>    anyway.
>
> Implementation: In the API have the restriction tracker initialized with a
> restriction object accessible via a getter. The restriction itself will be
> the only thing serialized, so it will be wrapped and unwrapped with the
> tracker before the user code is invoked. This wouldn't add very little work
> as it would mostly be bundled with the interface->dynamic approach change.
>
>
> Thanks,
> Daniel Oliveira
>