You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Luke Cwik <lc...@google.com> on 2020/06/12 15:49:34 UTC

DISCUSS: FnAPI proto stabiliization

A few months back there was a discussion[1] about performing work to
stabilize the protos used for pipeline execution looking forward to cross
language pipelines and runners who want to use them across SDK versions
(Dataflow).

All the proposed incompatible clean-up tasks were done and made it into
2.21 (there are some left related to documentation and cleaning up some
stuff that can be removed in a backwards compatible way and general
re-organization within the files to delineate what is stable and what
isn't).

Beyond documenting the versioning story (sketch below) in a more durable
location then this ML, performing these last clean-up tasks and general
re-organization within the files, is there anything else that should be
done before we can vote and consider the protos to be stable (which would
mean that 2.21 would contain the first stable version assuming no other
incompatible changes are suggested)?

The versioning story is around 3 parts and effectively occurs whenever
there is an incompatible change such as:
* adding a new field that didn't exist where it semantically changes what
is to be done
* removing a field that was effectively required
* requiring an SDK or runner to behave differently (e.g. support large
iterables, support a new API (such as a future map state for StatefulDoFns))
The three ways of handling versioning for incompatible changes are:
* many protos have URNs, when there is an incompatible change the URN
should be changed. If it is effectively the same thing then this should
lead to a version bump and update of the documentation reflecting what the
requirements of the new version are.
* there is a capabilities section on each environment, this should
enumerate everything the SDK can support, protocols (e.g. large iterables,
...), coders, well known transforms, ...
* there is a requirements section on the pipeline proto, this is an
enumeration of everything the SDK needs the runner to know to be able to
interpret the pipeline (e.g. splittable dofn, requires time sorted input,
...).

Updating the URN of the transform/coder is typically the easiest way to
handle incompatible changes followed by using the capabilities list to
enable new things (used like an allowlist) and the requirements list to
prevent runners from doing things they shouldn't (used like a denylist).
Many features/APIs that are part of the initial version are implicitly not
in either the capabilities or requirements lists to prevent a huge
definition list and can be disabled in the future by relying on adding
requirements that disable these currently unnamed features/APIs if it is
ever necessary.

1:
https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E

Re: DISCUSS: FnAPI proto stabiliization

Posted by Kenneth Knowles <ke...@apache.org>.

Getting back to this, I think Luke has outlined a good implementation
strategy. I have not followed progress on getting this documented durably
and voted on. Maybe gdoc draft to vote on and then web site since it should
be *very* stable and also forms the new core of what Beam "is" so it should
be clear to explain the concepts at a high level, with good PR review of
any changes to the protocol and documentation.

Kenn

On Fri, Jun 12, 2020 at 1:14 PM Udi Meiri <eh...@google.com> wrote:

> I'm not very familiar with this effort.
> Were there ITs / POCs created for these changes? (to surface any obvious
> bugs)
> Are these changes usable in DirectRunner?
>
>
> On Fri, Jun 12, 2020 at 8:50 AM Luke Cwik <lc...@google.com> wrote:
>
>> A few months back there was a discussion[1] about performing work to
>> stabilize the protos used for pipeline execution looking forward to cross
>> language pipelines and runners who want to use them across SDK versions
>> (Dataflow).
>>
>> All the proposed incompatible clean-up tasks were done and made it into
>> 2.21 (there are some left related to documentation and cleaning up some
>> stuff that can be removed in a backwards compatible way and general
>> re-organization within the files to delineate what is stable and what
>> isn't).
>>
>> Beyond documenting the versioning story (sketch below) in a more durable
>> location then this ML, performing these last clean-up tasks and general
>> re-organization within the files, is there anything else that should be
>> done before we can vote and consider the protos to be stable (which would
>> mean that 2.21 would contain the first stable version assuming no other
>> incompatible changes are suggested)?
>>
>> The versioning story is around 3 parts and effectively occurs whenever
>> there is an incompatible change such as:
>> * adding a new field that didn't exist where it semantically changes what
>> is to be done
>> * removing a field that was effectively required
>> * requiring an SDK or runner to behave differently (e.g. support large
>> iterables, support a new API (such as a future map state for StatefulDoFns))
>> The three ways of handling versioning for incompatible changes are:
>> * many protos have URNs, when there is an incompatible change the URN
>> should be changed. If it is effectively the same thing then this should
>> lead to a version bump and update of the documentation reflecting what the
>> requirements of the new version are.
>> * there is a capabilities section on each environment, this should
>> enumerate everything the SDK can support, protocols (e.g. large iterables,
>> ...), coders, well known transforms, ...
>> * there is a requirements section on the pipeline proto, this is an
>> enumeration of everything the SDK needs the runner to know to be able to
>> interpret the pipeline (e.g. splittable dofn, requires time sorted input,
>> ...).
>>
>> Updating the URN of the transform/coder is typically the easiest way to
>> handle incompatible changes followed by using the capabilities list to
>> enable new things (used like an allowlist) and the requirements list to
>> prevent runners from doing things they shouldn't (used like a denylist).
>> Many features/APIs that are part of the initial version are implicitly not
>> in either the capabilities or requirements lists to prevent a huge
>> definition list and can be disabled in the future by relying on adding
>> requirements that disable these currently unnamed features/APIs if it is
>> ever necessary.
>>
>> 1:
>> https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E
>>
>

Re: DISCUSS: FnAPI proto stabiliization

Posted by Udi Meiri <eh...@google.com>.

I'm not very familiar with this effort.
Were there ITs / POCs created for these changes? (to surface any obvious
bugs)
Are these changes usable in DirectRunner?


On Fri, Jun 12, 2020 at 8:50 AM Luke Cwik <lc...@google.com> wrote:

> A few months back there was a discussion[1] about performing work to
> stabilize the protos used for pipeline execution looking forward to cross
> language pipelines and runners who want to use them across SDK versions
> (Dataflow).
>
> All the proposed incompatible clean-up tasks were done and made it into
> 2.21 (there are some left related to documentation and cleaning up some
> stuff that can be removed in a backwards compatible way and general
> re-organization within the files to delineate what is stable and what
> isn't).
>
> Beyond documenting the versioning story (sketch below) in a more durable
> location then this ML, performing these last clean-up tasks and general
> re-organization within the files, is there anything else that should be
> done before we can vote and consider the protos to be stable (which would
> mean that 2.21 would contain the first stable version assuming no other
> incompatible changes are suggested)?
>
> The versioning story is around 3 parts and effectively occurs whenever
> there is an incompatible change such as:
> * adding a new field that didn't exist where it semantically changes what
> is to be done
> * removing a field that was effectively required
> * requiring an SDK or runner to behave differently (e.g. support large
> iterables, support a new API (such as a future map state for StatefulDoFns))
> The three ways of handling versioning for incompatible changes are:
> * many protos have URNs, when there is an incompatible change the URN
> should be changed. If it is effectively the same thing then this should
> lead to a version bump and update of the documentation reflecting what the
> requirements of the new version are.
> * there is a capabilities section on each environment, this should
> enumerate everything the SDK can support, protocols (e.g. large iterables,
> ...), coders, well known transforms, ...
> * there is a requirements section on the pipeline proto, this is an
> enumeration of everything the SDK needs the runner to know to be able to
> interpret the pipeline (e.g. splittable dofn, requires time sorted input,
> ...).
>
> Updating the URN of the transform/coder is typically the easiest way to
> handle incompatible changes followed by using the capabilities list to
> enable new things (used like an allowlist) and the requirements list to
> prevent runners from doing things they shouldn't (used like a denylist).
> Many features/APIs that are part of the initial version are implicitly not
> in either the capabilities or requirements lists to prevent a huge
> definition list and can be disabled in the future by relying on adding
> requirements that disable these currently unnamed features/APIs if it is
> ever necessary.
>
> 1:
> https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E
>