You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Henning Rohde <he...@google.com> on 2017/12/20 00:18:56 UTC

[DISCUSS] Guidelines for merging new runners and SDKs into master

Hi everyone,

 As part of the Go SDK development, I was looking at the guidelines for
merging new runners and SDKs into master [1] and I think they would benefit
from being updated to reflect the emerging portability framework. Specific
suggestions:

(1) Both runners and SDKs should support the portability framework (to the
extent the model is supported by the runner/SDK). It would be
counter-productive at this time for the ecosystem to go against that effort
without a compelling reason. Direct runners not included.

(2) What are the minimal set of IO connectors a new SDK must support? Given
the upcoming cross-language feature in the portability framework, can we
rely on that to meet the requirement
without implementing any native IO connectors?

(3) Similarly to new runners, new SDKs should handle at least a useful
subset of the model, but not necessarily the whole model (at the time of
merge). A global-window-batch-only SDK targeting the portability framework,
for example, could be as useful a contribution in master as a full model
SDK that is supported by a direct runner only. Of course, this is not to
say that SDKs should not strive to support the full model, but rather --
like Python streaming -- that it's fine to pursue that goal in master
beyond a certain point. That said, I'm curious as to why this guideline for
SDKs was set that specifically originally.

Finally, while portability support for various features -- such as side
input, cross-language I/O and the reference runner -- is still underway,
what should the guidelines be? For the Go SDK specifically, if in master,
it would bring the additional utility of helping test the portability
framework as it's being developed. On the other hand, it can't support
features that do not yet exist.

What do you all think?

Thanks,
 Henning

[1] https://beam.apache.org/contribute/feature-branches/

Re: [DISCUSS] Guidelines for merging new runners and SDKs into master

Posted by Lukasz Cwik <lc...@google.com>.

I would add that we should really be looking for test suites and test
utilities when a new runner/SDK is merged. The tests really drive the
majority of our validation and really help reduce maintenance burden. I'm
thinking of the suite of ValidatesRunner tests that live within the Java
and Python SDKs. The validation really goes both ways, it finds bugs in
execution assumptions on the Runner and SDK sides. Also concepts like
PAssert, TestStream, DoFnTester, source testing utilities, coder testing
utilities, ... are very useful as well.

On Wed, Dec 20, 2017 at 9:48 PM, Robert Bradshaw <ro...@google.com>
wrote:

> On Wed, Dec 20, 2017 at 6:45 PM, Henning Rohde <he...@google.com> wrote:
> >> > (3) Similarly to new runners, new SDKs should handle at least a useful
> >> > subset of the model, but not necessarily the whole model (at the time
> of
> >> > merge). A global-window-batch-only SDK targeting the portability
> >> > framework,
> >> > for example, could be as useful a contribution in master as a full
> model
> >> > SDK
> >> > that is supported by a direct runner only. Of course, this is not to
> say
> >> > that SDKs should not strive to support the full model, but rather --
> >> > like
> >> > Python streaming -- that it's fine to pursue that goal in master
> beyond
> >> > a
> >> > certain point. That said, I'm curious as to why this guideline for
> SDKs
> >> > was
> >> > set that specifically originally.
> >>
> >> While I don't think full model completeness is a feasible goal for
> >> merging into master (e.g. metadata driven triggers and retractions
> >> aren't even fully fleshed out yet), there are certain core pieces of
> >> the model that must be present, the notion of windowing among them. In
> >> addition, event-time driven windowing is one of the distinguishing
> >> features of Beam so it seems a regression to have SDKs without it, and
> >> may affect design choices like how windows and timestamps are assigned
> >> or inspected. Also, from a pragmatic point of view, accounting and
> >> tracking the fact that each element has an associated window and
> >> timestamp that must flow through the pipeline and taken into account
> >> during grouping is not something that is easily bolted on later to a
> >> global-window-batch-only SDK, and should be built in from the start,
> >> not offered as a vague promise someone will get to
> >> post-merge-to-master.
> >>
> >> I'd be OK supporting only a subset of WindowFns, but more than
> >> GlobalWindows (equivalent to "we'll just ignore the window
> >> everywhere") only.
> >>
> >> FWIW, streaming vs. batch is not part of the "model" per se, it's an
> >> operational difference. The full model was present in the SDK before
> >> any streaming backends were ready.
> >
> > These are good points. I am more thinking about it from a viewpoint of
> what
> > makes a useful contribution. For runners, for example, the guidelines
> allow
> > for a narrower focus -- traditional batch is called out -- and I think
> that
> > makes sense for SDKs as well. In both cases, one focus or another might
> make
> > it harder to support the full model later, but that seems beyond the
> scope
> > of general guideline. The portability framework hopefully also makes it
> less
> > likely that any particular design choice is too expensive to change
> later by
> > the added isolation.
>
> Some runners are batch only, some runners are streaming only, but all
> support windowing. Fortunately the portability framework makes it
> easier to support this. To not have windowing at all is a too sever
> omission in model in my book, and reflects a significant omission in
> the API and likely implementation as well.
>
> If it's easy to add, it's not a big hurdle. If it's difficult to
> retrofit after the fact, all the more reason to get it in early :).
>
> >> One more point of consideration, we should probably have at least one
> >> committer committed to (and in a position to) support it.
> >
> > Makes sense, although I hadn't really thought about it for Go until now.
> > What would you suggest for new committer-less SDKs/runners, where a fair
> > chunk of the code would almost by construction be unfamiliar to
> committers?
> > A cool part of Beam portability IMO is that it opens the door for
> > non-mainstream languages to participate.
>
> That's a good question. I would hope that by the time an SDK becomes
> mature enough to merge, at least some of the participants have become
> involved enough with the community to merit being committers.
>
> - Robert
>

Re: [DISCUSS] Guidelines for merging new runners and SDKs into master

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Dec 20, 2017 at 6:45 PM, Henning Rohde <he...@google.com> wrote:
>> > (3) Similarly to new runners, new SDKs should handle at least a useful
>> > subset of the model, but not necessarily the whole model (at the time of
>> > merge). A global-window-batch-only SDK targeting the portability
>> > framework,
>> > for example, could be as useful a contribution in master as a full model
>> > SDK
>> > that is supported by a direct runner only. Of course, this is not to say
>> > that SDKs should not strive to support the full model, but rather --
>> > like
>> > Python streaming -- that it's fine to pursue that goal in master beyond
>> > a
>> > certain point. That said, I'm curious as to why this guideline for SDKs
>> > was
>> > set that specifically originally.
>>
>> While I don't think full model completeness is a feasible goal for
>> merging into master (e.g. metadata driven triggers and retractions
>> aren't even fully fleshed out yet), there are certain core pieces of
>> the model that must be present, the notion of windowing among them. In
>> addition, event-time driven windowing is one of the distinguishing
>> features of Beam so it seems a regression to have SDKs without it, and
>> may affect design choices like how windows and timestamps are assigned
>> or inspected. Also, from a pragmatic point of view, accounting and
>> tracking the fact that each element has an associated window and
>> timestamp that must flow through the pipeline and taken into account
>> during grouping is not something that is easily bolted on later to a
>> global-window-batch-only SDK, and should be built in from the start,
>> not offered as a vague promise someone will get to
>> post-merge-to-master.
>>
>> I'd be OK supporting only a subset of WindowFns, but more than
>> GlobalWindows (equivalent to "we'll just ignore the window
>> everywhere") only.
>>
>> FWIW, streaming vs. batch is not part of the "model" per se, it's an
>> operational difference. The full model was present in the SDK before
>> any streaming backends were ready.
>
> These are good points. I am more thinking about it from a viewpoint of what
> makes a useful contribution. For runners, for example, the guidelines allow
> for a narrower focus -- traditional batch is called out -- and I think that
> makes sense for SDKs as well. In both cases, one focus or another might make
> it harder to support the full model later, but that seems beyond the scope
> of general guideline. The portability framework hopefully also makes it less
> likely that any particular design choice is too expensive to change later by
> the added isolation.

Some runners are batch only, some runners are streaming only, but all
support windowing. Fortunately the portability framework makes it
easier to support this. To not have windowing at all is a too sever
omission in model in my book, and reflects a significant omission in
the API and likely implementation as well.

If it's easy to add, it's not a big hurdle. If it's difficult to
retrofit after the fact, all the more reason to get it in early :).

>> One more point of consideration, we should probably have at least one
>> committer committed to (and in a position to) support it.
>
> Makes sense, although I hadn't really thought about it for Go until now.
> What would you suggest for new committer-less SDKs/runners, where a fair
> chunk of the code would almost by construction be unfamiliar to committers?
> A cool part of Beam portability IMO is that it opens the door for
> non-mainstream languages to participate.

That's a good question. I would hope that by the time an SDK becomes
mature enough to merge, at least some of the participants have become
involved enough with the community to merit being committers.

- Robert

Re: [DISCUSS] Guidelines for merging new runners and SDKs into master

Posted by Henning Rohde <he...@google.com>.

Thanks for the comments!


> > (2) What are the minimal set of IO connectors a new SDK must support?
> Given
> > the upcoming cross-language feature in the portability framework, can we
> > rely on that to meet the requirement
> > without implementing any native IO connectors?
>
> It could be argued that there needs to be enough IO to write
> end-to-end examples such as WordCount and demonstrate what IOs would
> look like. TextIO may satisfy this. Once we have cross-language
> universal local runners, we could eschew even that (and perhaps the
> requirement would be simply that it runs against that runner).
>

Yes -- TextIO is something that naturally is added with a direct runner,
but with cross-language IO it may be a toy version for that purpose alone.
Real use might use a more robust version from another SDK with more
supported filesystems, for example. The Go SDK is de facto sort of leaning
towards that approach.


> > (3) Similarly to new runners, new SDKs should handle at least a useful
> > subset of the model, but not necessarily the whole model (at the time of
> > merge). A global-window-batch-only SDK targeting the portability
> framework,
> > for example, could be as useful a contribution in master as a full model
> SDK
> > that is supported by a direct runner only. Of course, this is not to say
> > that SDKs should not strive to support the full model, but rather -- like
> > Python streaming -- that it's fine to pursue that goal in master beyond a
> > certain point. That said, I'm curious as to why this guideline for SDKs
> was
> > set that specifically originally.
>
> While I don't think full model completeness is a feasible goal for
> merging into master (e.g. metadata driven triggers and retractions
> aren't even fully fleshed out yet), there are certain core pieces of
> the model that must be present, the notion of windowing among them. In
> addition, event-time driven windowing is one of the distinguishing
> features of Beam so it seems a regression to have SDKs without it, and
> may affect design choices like how windows and timestamps are assigned
> or inspected. Also, from a pragmatic point of view, accounting and
> tracking the fact that each element has an associated window and
> timestamp that must flow through the pipeline and taken into account
> during grouping is not something that is easily bolted on later to a
> global-window-batch-only SDK, and should be built in from the start,
> not offered as a vague promise someone will get to
> post-merge-to-master.
>
> I'd be OK supporting only a subset of WindowFns, but more than
> GlobalWindows (equivalent to "we'll just ignore the window
> everywhere") only.
>
> FWIW, streaming vs. batch is not part of the "model" per se, it's an
> operational difference. The full model was present in the SDK before
> any streaming backends were ready.
>

These are good points. I am more thinking about it from a viewpoint of what
makes a useful contribution. For runners, for example, the guidelines allow
for a narrower focus -- traditional batch is called out -- and I think that
makes sense for SDKs as well. In both cases, one focus or another might
make it harder to support the full model later, but that seems beyond the
scope of general guideline. The portability framework hopefully also makes
it less likely that any particular design choice is too expensive to change
later by the added isolation.


> > Finally, while portability support for various features -- such as side
> > input, cross-language I/O and the reference runner -- is still underway,
> > what should the guidelines be? For the Go SDK specifically, if in
> master, it
> > would bring the additional utility of helping test the portability
> framework
> > as it's being developed. On the other hand, it can't support features
> that
> > do not yet exist.
>
> Fortunately I think we have a little bit of time to get the full
> portability story into place before the Go SDK is ready to be merged.
> (On the note of helping development, I don't see anything that the Go
> SDK could offer specifically that the Python SDK can't.)
>

Indeed :). There is nothing specific the Go SDK would offer for helping
development other than better exercising the framework.


> In short, I think the list at
> https://beam.apache.org/contribute/feature-branches/ stands, with the
> additional requirement of Fn API support, and on that note (3) may be
> the (FnApi speaking) Reference Runner against which the IOs for (2)
> could be more easily satisfied.
>
> One more point of consideration, we should probably have at least one
> committer committed to (and in a position to) support it.
>

Makes sense, although I hadn't really thought about it for Go until now. What
would you suggest for new committer-less SDKs/runners, where a fair chunk
of the code would almost by construction be unfamiliar to committers? A
cool part of Beam portability IMO is that it opens the door for
non-mainstream languages to participate.

Re: [DISCUSS] Guidelines for merging new runners and SDKs into master

Posted by Chamikara Jayalath <ch...@google.com>.

On Wed, Dec 20, 2017 at 2:35 PM Robert Bradshaw <ro...@google.com> wrote:

> On Tue, Dec 19, 2017 at 4:18 PM, Henning Rohde <he...@google.com> wrote:
> > Hi everyone,
> >
> >  As part of the Go SDK development, I was looking at the guidelines for
> > merging new runners and SDKs into master [1] and I think they would
> benefit
> > from being updated to reflect the emerging portability framework.
> Specific
> > suggestions:
> >
> > (1) Both runners and SDKs should support the portability framework (to
> the
> > extent the model is supported by the runner/SDK). It would be
> > counter-productive at this time for the ecosystem to go against that
> effort
> > without a compelling reason. Direct runners not included.
>
> +1
>
> > (2) What are the minimal set of IO connectors a new SDK must support?
> Given
> > the upcoming cross-language feature in the portability framework, can we
> > rely on that to meet the requirement
> > without implementing any native IO connectors?
>
> It could be argued that there needs to be enough IO to write
> end-to-end examples such as WordCount and demonstrate what IOs would
> look like. TextIO may satisfy this. Once we have cross-language
> universal local runners, we could eschew even that (and perhaps the
> requirement would be simply that it runs against that runner).
>

This is probably up to individual SDK authors but I feel like we should at
least strongly encourage making sources framework (SDF if we are
considering future) and other I/O utilities (for example source testing
framework, FileIO) available for all SDKs. This will allow I/O authors to
freely decide the SDK(s) where I/O is natively implemented based on
criteria such as (1) availability of client libraries (2) efficiency (3)
usability (4) other business requirements. Rest of the SDKs could use the
I/O in the form of a cross-language transform.

- Cham


> > (3) Similarly to new runners, new SDKs should handle at least a useful
> > subset of the model, but not necessarily the whole model (at the time of
> > merge). A global-window-batch-only SDK targeting the portability
> framework,
> > for example, could be as useful a contribution in master as a full model
> SDK
> > that is supported by a direct runner only. Of course, this is not to say
> > that SDKs should not strive to support the full model, but rather -- like
> > Python streaming -- that it's fine to pursue that goal in master beyond a
> > certain point. That said, I'm curious as to why this guideline for SDKs
> was
> > set that specifically originally.
>
> While I don't think full model completeness is a feasible goal for
> merging into master (e.g. metadata driven triggers and retractions
> aren't even fully fleshed out yet), there are certain core pieces of
> the model that must be present, the notion of windowing among them. In
> addition, event-time driven windowing is one of the distinguishing
> features of Beam so it seems a regression to have SDKs without it, and
> may affect design choices like how windows and timestamps are assigned
> or inspected. Also, from a pragmatic point of view, accounting and
> tracking the fact that each element has an associated window and
> timestamp that must flow through the pipeline and taken into account
> during grouping is not something that is easily bolted on later to a
> global-window-batch-only SDK, and should be built in from the start,
> not offered as a vague promise someone will get to
> post-merge-to-master.
>
> I'd be OK supporting only a subset of WindowFns, but more than
> GlobalWindows (equivalent to "we'll just ignore the window
> everywhere") only.
>
> FWIW, streaming vs. batch is not part of the "model" per se, it's an
> operational difference. The full model was present in the SDK before
> any streaming backends were ready.
>
> > Finally, while portability support for various features -- such as side
> > input, cross-language I/O and the reference runner -- is still underway,
> > what should the guidelines be? For the Go SDK specifically, if in
> master, it
> > would bring the additional utility of helping test the portability
> framework
> > as it's being developed. On the other hand, it can't support features
> that
> > do not yet exist.
>
> Fortunately I think we have a little bit of time to get the full
> portability story into place before the Go SDK is ready to be merged.
> (On the note of helping development, I don't see anything that the Go
> SDK could offer specifically that the Python SDK can't.)
>
> In short, I think the list at
> https://beam.apache.org/contribute/feature-branches/ stands, with the
> additional requirement of Fn API support, and on that note (3) may be
> the (FnApi speaking) Reference Runner against which the IOs for (2)
> could be more easily satisfied.
>
> One more point of consideration, we should probably have at least one
> committer committed to (and in a position to) support it.
>
> - Robert
>

Re: [DISCUSS] Guidelines for merging new runners and SDKs into master

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Dec 19, 2017 at 4:18 PM, Henning Rohde <he...@google.com> wrote:
> Hi everyone,
>
>  As part of the Go SDK development, I was looking at the guidelines for
> merging new runners and SDKs into master [1] and I think they would benefit
> from being updated to reflect the emerging portability framework. Specific
> suggestions:
>
> (1) Both runners and SDKs should support the portability framework (to the
> extent the model is supported by the runner/SDK). It would be
> counter-productive at this time for the ecosystem to go against that effort
> without a compelling reason. Direct runners not included.

+1

> (2) What are the minimal set of IO connectors a new SDK must support? Given
> the upcoming cross-language feature in the portability framework, can we
> rely on that to meet the requirement
> without implementing any native IO connectors?

It could be argued that there needs to be enough IO to write
end-to-end examples such as WordCount and demonstrate what IOs would
look like. TextIO may satisfy this. Once we have cross-language
universal local runners, we could eschew even that (and perhaps the
requirement would be simply that it runs against that runner).

> (3) Similarly to new runners, new SDKs should handle at least a useful
> subset of the model, but not necessarily the whole model (at the time of
> merge). A global-window-batch-only SDK targeting the portability framework,
> for example, could be as useful a contribution in master as a full model SDK
> that is supported by a direct runner only. Of course, this is not to say
> that SDKs should not strive to support the full model, but rather -- like
> Python streaming -- that it's fine to pursue that goal in master beyond a
> certain point. That said, I'm curious as to why this guideline for SDKs was
> set that specifically originally.

While I don't think full model completeness is a feasible goal for
merging into master (e.g. metadata driven triggers and retractions
aren't even fully fleshed out yet), there are certain core pieces of
the model that must be present, the notion of windowing among them. In
addition, event-time driven windowing is one of the distinguishing
features of Beam so it seems a regression to have SDKs without it, and
may affect design choices like how windows and timestamps are assigned
or inspected. Also, from a pragmatic point of view, accounting and
tracking the fact that each element has an associated window and
timestamp that must flow through the pipeline and taken into account
during grouping is not something that is easily bolted on later to a
global-window-batch-only SDK, and should be built in from the start,
not offered as a vague promise someone will get to
post-merge-to-master.

I'd be OK supporting only a subset of WindowFns, but more than
GlobalWindows (equivalent to "we'll just ignore the window
everywhere") only.

FWIW, streaming vs. batch is not part of the "model" per se, it's an
operational difference. The full model was present in the SDK before
any streaming backends were ready.

> Finally, while portability support for various features -- such as side
> input, cross-language I/O and the reference runner -- is still underway,
> what should the guidelines be? For the Go SDK specifically, if in master, it
> would bring the additional utility of helping test the portability framework
> as it's being developed. On the other hand, it can't support features that
> do not yet exist.

Fortunately I think we have a little bit of time to get the full
portability story into place before the Go SDK is ready to be merged.
(On the note of helping development, I don't see anything that the Go
SDK could offer specifically that the Python SDK can't.)

In short, I think the list at
https://beam.apache.org/contribute/feature-branches/ stands, with the
additional requirement of Fn API support, and on that note (3) may be
the (FnApi speaking) Reference Runner against which the IOs for (2)
could be more easily satisfied.

One more point of consideration, we should probably have at least one
committer committed to (and in a position to) support it.

- Robert