You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Saavan Nanavati <sa...@google.com> on 2020/06/29 22:22:01 UTC

[Proposal] Supporting Python Annotation Typehints on PTransform

Hi all!

Currently, in the Python SDK, we don't support annotation-style type hints
for PTransforms.

This email includes a proposal to support PEP 484 annotations on
PTransform's expand() function, and would allow you to write something like
the following:

class MapStrToInt(beam.PTransform):
def expand(pcoll: PCollection[str]) -> PCollection[int]:
return pcoll | beam.Map(lambda elem: int(elem)

You can view the proposal here
<https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit?usp=sharing>,
the WIP PR here <https://github.com/apache/beam/pull/12009>, or the full
text of the proposal below.

Let me know what you think, or if you have any suggestions for
improvements. Thanks!

Best,
Saavan

---------------

Supporting Python Annotations on PTransforms

https://issues.apache.org/jira/browse/BEAM-10258

Author: Saavan Nanavati (saavan@google.com <au...@example.org>)

Reviewer: Udi Meiri (ehudm@google.com)

Overview

For any developer working in a dynamically-typed language such as Python,
type safety is an important consideration that can help reduce runtime
bugs, speed up the development process with IDE-assisted code completion
and type hints using static type-checkers such as MyPy, and aid
investigations into otherwise complex error messages, with regards to Coder
objects, for instance.

Due to the complex nature of pipelines in Beam, an internal type-checking
system is employed by Beam to fill in the gaps that static type-checkers
have (such as failing to match type hints across PTransforms or
type-checking at runtime). These type hints are used to raise errors during
pipeline construction and, optionally, at runtime as well (but at the
expense of pipeline performance).

This design document outlines a proposed improvement that would allow
Beam’s internal typing system to integrate and utilize type hint
annotations (that conform to PEP 484
<https://www.python.org/dev/peps/pep-0484/>) on PTransform’s primary
method, expand(...).

Table of Contents

Background
<https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.drvuif6t4bls>

Goals
<https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.wlzpbos00ni>

Design Details
<https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.qds1titcbwpu>

Estimate of Work
<https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.7bp0u7a4fgs>

Background

There are currently 3 ways to declare type hints in Beam pipelines.


   1.

   Inline
   1.

      with_input_types(...)
      2.

      with_input_types(...)
      2.

   Decorators
   1.

      @beam.typehints.with_input_types(...)
      2.

      @beam.typehints.with_output_types(...)
      3.

   Annotations
   1.

      def func(elem: int) -> str:
      ...


Since inline hints are not associated with PTransforms themselves,
decorators and annotations are the preferred methods for code that is going
to be reused, however annotations are not currently supported for
PTransforms. This proposed feature would fix that, allowing you to write
code akin to the following:

import apache_beam as beam

from apache_beam.pvalue import PCollection

class MapStrToInt(beam.PTransform):

def expand(pcoll: PCollection[str]) -> PCollection[int]:

return pcoll | beam.Map(lambda elem: int(elem)

This is in contrast to the recommended way of type-hinting PTransforms in
the status quo, shown below:

import apache_beam as beam

@beam.typehints.with_input_types(str)

@beam.typehints.with_input_types(int)

class MapStrToInt(beam.PTransform):

def expand(pcoll):

return pcoll | beam.Map(lambda elem: int(elem)

In addition to providing the end-user a number of approaches for how they
can type-hint their pipelines, thereby expanding coverage for Beam’s
internal typing system, there is potentially an additional benefit for
static type checkers like MyPy to be able to catch errors, since they are
only capable of working with PEP 484 annotations.


Goals

   -

   Support PEP 484 annotations on the input/output PCollection of the
   expand(...) method of PTransforms
   -

   Update the documentation on Ensuring Python Type Safety
   <https://beam.apache.org/documentation/sdks/python-type-safety/> to
   reflect these changes


Design Details

First, for Beam’s internal typing system to work with PCollections,
PCollection must be capable of being used as a type. This is currently
already the case since PCollection inherits from Generic[T].

Second, Beam’s typing system must be able to work with annotations like
PCollection[T].  This can be done by modelling the approach that DoFn takes
for analyzing annotations on its’ process(...) method. To elaborate, DoFn
does this by implementing the default_type_hints(...) method from the
inherited WithTypeHints mixin class, which both DoFn and PTransform inherit
from, and using IOTypeHints.from_callable(...) to convert the annotations
on process(...) into an IOTypeHints object that the internal Beam typing
system can then use. A similar method can be used by PTransform on its’
expand(...) method, however the outer PCollection would need to be stripped
from the input and output types before the annotations can be compatible
with the rest of the Beam typing system.

There are some additional considerations to keep in mind however.

First, PBegin[T] and PDone[T] must also be supported as valid outer types
for PTransform annotations. This requires both classes to inherit from
Generic[T], which they do not currently.

Second, PCollection by itself, without any internal type hint T, must be
supported. In this case, T will be presumed to be of type any.

This is currently the only design approach under consideration.
Estimate of Work

The majority of the work is already complete, and available to view in this
PR <https://github.com/apache/beam/pull/12009>.


There is still some work TBD - more tests need to be added in order to
confirm edge-case scenarios work properly. These include:

   -

   Tests for PBegin/PDone annotations
   -

   Tests for PCollection annotations with multiple internal types
   -

   Tests for PCollection annotations with a nested PCollection


Additionally, more error checking is needed throughout the PR.

Lastly, the website documentation needs to be updated.

With the holidays coming up this week (July 4 weekend), this work can be
estimated to be complete around July 8. This is not including the time it
takes to run against internal tests in Google, some of which may or may not
be broken due to this change, therefore the time-to-merge will likely be
longer (possibly by a week or two).

Re: [Proposal] Supporting Python Annotation Typehints on PTransform

Posted by Robert Bradshaw <ro...@google.com>.
Huge +1 to this general direction, the design looks reasonable, and
from what I've seen so far of the PR this looks sound. Thanks for
pushing on this.

On Tue, Jul 7, 2020 at 6:10 PM Udi Meiri <eh...@google.com> wrote:
>
> Thank you Saavan, bumping this in case people missed it.
>
>
> On Mon, Jun 29, 2020 at 3:22 PM Saavan Nanavati <sa...@google.com> wrote:
>>
>> Hi all!
>>
>> Currently, in the Python SDK, we don't support annotation-style type hints for PTransforms.
>>
>> This email includes a proposal to support PEP 484 annotations on PTransform's expand() function, and would allow you to write something like the following:
>>
>> class MapStrToInt(beam.PTransform):
>> def expand(pcoll: PCollection[str]) -> PCollection[int]:
>> return pcoll | beam.Map(lambda elem: int(elem)
>>
>> You can view the proposal here, the WIP PR here, or the full text of the proposal below.
>>
>> Let me know what you think, or if you have any suggestions for improvements. Thanks!
>>
>> Best,
>> Saavan
>>
>> ---------------
>>
>> Supporting Python Annotations on PTransforms
>>
>> https://issues.apache.org/jira/browse/BEAM-10258
>>
>> Author: Saavan Nanavati (saavan@google.com)
>>
>> Reviewer: Udi Meiri (ehudm@google.com)
>>
>>
>> Overview
>>
>>
>> For any developer working in a dynamically-typed language such as Python, type safety is an important consideration that can help reduce runtime bugs, speed up the development process with IDE-assisted code completion and type hints using static type-checkers such as MyPy, and aid investigations into otherwise complex error messages, with regards to Coder objects, for instance.
>>
>>
>> Due to the complex nature of pipelines in Beam, an internal type-checking system is employed by Beam to fill in the gaps that static type-checkers have (such as failing to match type hints across PTransforms or type-checking at runtime). These type hints are used to raise errors during pipeline construction and, optionally, at runtime as well (but at the expense of pipeline performance).
>>
>>
>> This design document outlines a proposed improvement that would allow Beam’s internal typing system to integrate and utilize type hint annotations (that conform to PEP 484) on PTransform’s primary method, expand(...).
>>
>>
>> Table of Contents
>>
>> Background
>>
>> Goals
>>
>> Design Details
>>
>> Estimate of Work
>>
>>
>> Background
>>
>>
>> There are currently 3 ways to declare type hints in Beam pipelines.
>>
>>
>> Inline
>>
>> with_input_types(...)
>>
>> with_input_types(...)
>>
>> Decorators
>>
>> @beam.typehints.with_input_types(...)
>>
>> @beam.typehints.with_output_types(...)
>>
>> Annotations
>>
>> def func(elem: int) -> str:
>> ...
>>
>>
>> Since inline hints are not associated with PTransforms themselves, decorators and annotations are the preferred methods for code that is going to be reused, however annotations are not currently supported for PTransforms. This proposed feature would fix that, allowing you to write code akin to the following:
>>
>>
>> import apache_beam as beam
>>
>> from apache_beam.pvalue import PCollection
>>
>>
>> class MapStrToInt(beam.PTransform):
>>
>> def expand(pcoll: PCollection[str]) -> PCollection[int]:
>>
>> return pcoll | beam.Map(lambda elem: int(elem)
>>
>>
>> This is in contrast to the recommended way of type-hinting PTransforms in the status quo, shown below:
>>
>>
>> import apache_beam as beam
>>
>>
>> @beam.typehints.with_input_types(str)
>>
>> @beam.typehints.with_input_types(int)
>>
>> class MapStrToInt(beam.PTransform):
>>
>> def expand(pcoll):
>>
>> return pcoll | beam.Map(lambda elem: int(elem)
>>
>>
>> In addition to providing the end-user a number of approaches for how they can type-hint their pipelines, thereby expanding coverage for Beam’s internal typing system, there is potentially an additional benefit for static type checkers like MyPy to be able to catch errors, since they are only capable of working with PEP 484 annotations.
>>
>>
>>
>> Goals
>>
>>
>> Support PEP 484 annotations on the input/output PCollection of the expand(...) method of PTransforms
>>
>> Update the documentation on Ensuring Python Type Safety to reflect these changes
>>
>>
>> Design Details
>>
>>
>> First, for Beam’s internal typing system to work with PCollections, PCollection must be capable of being used as a type. This is currently already the case since PCollection inherits from Generic[T].
>>
>>
>> Second, Beam’s typing system must be able to work with annotations like PCollection[T].  This can be done by modelling the approach that DoFn takes for analyzing annotations on its’ process(...) method. To elaborate, DoFn does this by implementing the default_type_hints(...) method from the inherited WithTypeHints mixin class, which both DoFn and PTransform inherit from, and using IOTypeHints.from_callable(...) to convert the annotations on process(...) into an IOTypeHints object that the internal Beam typing system can then use. A similar method can be used by PTransform on its’ expand(...) method, however the outer PCollection would need to be stripped from the input and output types before the annotations can be compatible with the rest of the Beam typing system.
>>
>>
>> There are some additional considerations to keep in mind however.
>>
>>
>> First, PBegin[T] and PDone[T] must also be supported as valid outer types for PTransform annotations. This requires both classes to inherit from Generic[T], which they do not currently.
>>
>>
>> Second, PCollection by itself, without any internal type hint T, must be supported. In this case, T will be presumed to be of type any.
>>
>>
>> This is currently the only design approach under consideration.
>>
>> Estimate of Work
>>
>>
>> The majority of the work is already complete, and available to view in this PR.
>>
>>
>> There is still some work TBD - more tests need to be added in order to confirm edge-case scenarios work properly. These include:
>>
>> Tests for PBegin/PDone annotations
>>
>> Tests for PCollection annotations with multiple internal types
>>
>> Tests for PCollection annotations with a nested PCollection
>>
>>
>> Additionally, more error checking is needed throughout the PR.
>>
>>
>> Lastly, the website documentation needs to be updated.
>>
>>
>> With the holidays coming up this week (July 4 weekend), this work can be estimated to be complete around July 8. This is not including the time it takes to run against internal tests in Google, some of which may or may not be broken due to this change, therefore the time-to-merge will likely be longer (possibly by a week or two).
>>
>>

Re: [Proposal] Supporting Python Annotation Typehints on PTransform

Posted by Udi Meiri <eh...@google.com>.
Thank you Saavan, bumping this in case people missed it.


On Mon, Jun 29, 2020 at 3:22 PM Saavan Nanavati <sa...@google.com> wrote:

> Hi all!
>
> Currently, in the Python SDK, we don't support annotation-style type hints
> for PTransforms.
>
> This email includes a proposal to support PEP 484 annotations on
> PTransform's expand() function, and would allow you to write something like
> the following:
>
> class MapStrToInt(beam.PTransform):
> def expand(pcoll: PCollection[str]) -> PCollection[int]:
> return pcoll | beam.Map(lambda elem: int(elem)
>
> You can view the proposal here
> <https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit?usp=sharing>,
> the WIP PR here <https://github.com/apache/beam/pull/12009>, or the full
> text of the proposal below.
>
> Let me know what you think, or if you have any suggestions for
> improvements. Thanks!
>
> Best,
> Saavan
>
> ---------------
>
> Supporting Python Annotations on PTransforms
>
> https://issues.apache.org/jira/browse/BEAM-10258
>
> Author: Saavan Nanavati (saavan@google.com <au...@example.org>)
>
> Reviewer: Udi Meiri (ehudm@google.com)
>
> Overview
>
> For any developer working in a dynamically-typed language such as Python,
> type safety is an important consideration that can help reduce runtime
> bugs, speed up the development process with IDE-assisted code completion
> and type hints using static type-checkers such as MyPy, and aid
> investigations into otherwise complex error messages, with regards to Coder
> objects, for instance.
>
> Due to the complex nature of pipelines in Beam, an internal type-checking
> system is employed by Beam to fill in the gaps that static type-checkers
> have (such as failing to match type hints across PTransforms or
> type-checking at runtime). These type hints are used to raise errors during
> pipeline construction and, optionally, at runtime as well (but at the
> expense of pipeline performance).
>
> This design document outlines a proposed improvement that would allow
> Beam’s internal typing system to integrate and utilize type hint
> annotations (that conform to PEP 484
> <https://www.python.org/dev/peps/pep-0484/>) on PTransform’s primary
> method, expand(...).
>
> Table of Contents
>
> Background
> <https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.drvuif6t4bls>
>
> Goals
> <https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.wlzpbos00ni>
>
> Design Details
> <https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.qds1titcbwpu>
>
> Estimate of Work
> <https://docs.google.com/document/d/1B91mV_KSJvkQopvIqa1LhV0pkPfMEeavKHb6Ql9GFaM/edit#heading=h.7bp0u7a4fgs>
>
> Background
>
> There are currently 3 ways to declare type hints in Beam pipelines.
>
>
>    1.
>
>    Inline
>    1.
>
>       with_input_types(...)
>       2.
>
>       with_input_types(...)
>       2.
>
>    Decorators
>    1.
>
>       @beam.typehints.with_input_types(...)
>       2.
>
>       @beam.typehints.with_output_types(...)
>       3.
>
>    Annotations
>    1.
>
>       def func(elem: int) -> str:
>       ...
>
>
> Since inline hints are not associated with PTransforms themselves,
> decorators and annotations are the preferred methods for code that is going
> to be reused, however annotations are not currently supported for
> PTransforms. This proposed feature would fix that, allowing you to write
> code akin to the following:
>
> import apache_beam as beam
>
> from apache_beam.pvalue import PCollection
>
> class MapStrToInt(beam.PTransform):
>
> def expand(pcoll: PCollection[str]) -> PCollection[int]:
>
> return pcoll | beam.Map(lambda elem: int(elem)
>
> This is in contrast to the recommended way of type-hinting PTransforms in
> the status quo, shown below:
>
> import apache_beam as beam
>
> @beam.typehints.with_input_types(str)
>
> @beam.typehints.with_input_types(int)
>
> class MapStrToInt(beam.PTransform):
>
> def expand(pcoll):
>
> return pcoll | beam.Map(lambda elem: int(elem)
>
> In addition to providing the end-user a number of approaches for how they
> can type-hint their pipelines, thereby expanding coverage for Beam’s
> internal typing system, there is potentially an additional benefit for
> static type checkers like MyPy to be able to catch errors, since they are
> only capable of working with PEP 484 annotations.
>
>
> Goals
>
>    -
>
>    Support PEP 484 annotations on the input/output PCollection of the
>    expand(...) method of PTransforms
>    -
>
>    Update the documentation on Ensuring Python Type Safety
>    <https://beam.apache.org/documentation/sdks/python-type-safety/> to
>    reflect these changes
>
>
> Design Details
>
> First, for Beam’s internal typing system to work with PCollections,
> PCollection must be capable of being used as a type. This is currently
> already the case since PCollection inherits from Generic[T].
>
> Second, Beam’s typing system must be able to work with annotations like
> PCollection[T].  This can be done by modelling the approach that DoFn
> takes for analyzing annotations on its’ process(...) method. To
> elaborate, DoFn does this by implementing the default_type_hints(...)
> method from the inherited WithTypeHints mixin class, which both DoFn and
> PTransform inherit from, and using IOTypeHints.from_callable(...) to
> convert the annotations on process(...) into an IOTypeHints object that
> the internal Beam typing system can then use. A similar method can be used
> by PTransform on its’ expand(...) method, however the outer PCollection
> would need to be stripped from the input and output types before the
> annotations can be compatible with the rest of the Beam typing system.
>
> There are some additional considerations to keep in mind however.
>
> First, PBegin[T] and PDone[T] must also be supported as valid outer types
> for PTransform annotations. This requires both classes to inherit from
> Generic[T], which they do not currently.
>
> Second, PCollection by itself, without any internal type hint T, must be
> supported. In this case, T will be presumed to be of type any.
>
> This is currently the only design approach under consideration.
> Estimate of Work
>
> The majority of the work is already complete, and available to view in
> this PR <https://github.com/apache/beam/pull/12009>.
>
>
> There is still some work TBD - more tests need to be added in order to
> confirm edge-case scenarios work properly. These include:
>
>    -
>
>    Tests for PBegin/PDone annotations
>    -
>
>    Tests for PCollection annotations with multiple internal types
>    -
>
>    Tests for PCollection annotations with a nested PCollection
>
>
> Additionally, more error checking is needed throughout the PR.
>
> Lastly, the website documentation needs to be updated.
>
> With the holidays coming up this week (July 4 weekend), this work can be
> estimated to be complete around July 8. This is not including the time it
> takes to run against internal tests in Google, some of which may or may not
> be broken due to this change, therefore the time-to-merge will likely be
> longer (possibly by a week or two).
>
>