You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2019/07/31 21:50:44 UTC

[PROPOSAL] An initial Schema API in Python

tl;dr: I have a PR at [1] that defines an initial Schema API in python
based on the typing module, and uses typing.NamedTuple to represent a
Schema. There are some risks with that approach but I propose we move
forward with it as a first draft and iterate.


I've opened up a PR [1] that implements RowCoder in the Python SDK and
verifies it's compatibility with the Java implementation via tests in
standard_coders.yaml. A lot of miscellaneous changes are required to get
that point, including a pretty significant one: providing some native
python representation for schemas.

As discussed in the PR description I opted to fully embrace the typing
module for the native representation of schema types:
- Primitive types all map to numpy types (e.g. np.int16, np.unicode).
- Arrays map to typing.List. In https://s.apache.org/beam-schemas we
settled on typing.Collection, but unfortunately this doesn't seem to be
supported in python 2, I'm open to other suggestions here.
- Map maps to typing.Mapping.
- Rows map to typing.NamedTuple.
- nullability is indicated with typing.Optional. Note there's no
distinction between Optional[Optional[T]] and Optional[T] in typing, both
map to Union[T, None] - so this is actually a good analog for the nullable
flag on FieldType in schema.proto.

With this approach a schema in Python might look like:
```
class Movie(NamedTuple):
  name: np.unicode
  year: Optional[np.int16]

# The class/type annotation syntax doesn't work in Python 2. Instead you
can use:
# Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
Optional[np.int16])]

# DoFns annotated with_output_types(Movie) will use RowCoder
coders.registry.register_coder(Movie, coders.RowCoder)
```

I think the choice to use typing.NamedTuple as a row type is potentially
controversial - Udi, Robert Bradshaw and I were already discussing it a bit
in a comment on the portable schemas doc [2], but I wanted to bring that
discussion to the ML.

On the pro side:
+ NamedTuple is a pretty great analog for Java's Row type [3]. Both store
attributes internally as an ordered collection (List<Object> in Row, a
tuple in NamedTuple) and provide shortcuts for accessing those attributes
by field name based on the schema.
+  NamedTuple is a native type, and we're trying to get out of the business
of defining our own type hints (I think).

On the con side:
- When using the class-based version of NamedTuple in python 3 a user might
be tempted to add more functionality to their class (for example, define a
method) rather than just defining a schema - but I'm not sure we're
prepared to guarantee that we will always produce an instance of their
class, just something that has the defined attributes. This concern can
potentially be alleviated once we have support for logical types.

Unless there are any objections I think it would make sense to start with
this implementation (documenting the limitations), and then iterate on it.
Please take a look at the PR [1] and let me know what you think about this
proposal.

Thanks,
Brian

[1] https://github.com/apache/beam/pull/9188
[2]
https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
[3]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java

Re: [PROPOSAL] An initial Schema API in Python

Posted by Ahmet Altay <al...@google.com>.

On Tue, Aug 20, 2019 at 3:48 PM Brian Hulette <bh...@google.com> wrote:

>
>
> On Tue, Aug 20, 2019 at 1:41 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay <al...@google.com> wrote:
>> >
>> >
>> >
>> > On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bh...@google.com>
>> wrote:
>> >>
>> >>
>> >>
>> >> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <ch...@gmail.com>
>> wrote:
>> >>>>
>> >>>> >> Agreed on float since it seems to trivially map to a double, but
>> I’m torn on int still. While I do want int type hints to work, it doesn’t
>> seem appropriate to map it to AtomicType.INT64, since it has a completely
>> different range of values.
>> >>>> >>
>> >>>> >> Let’s say we used native int for the runtime field type, not just
>> as a schema declaration for numpy.int64. What is the real world fallout
>> from this? Would there be data loss?
>> >>>> >
>> >>>> > I'm not sure I follow the question exactly, what is the interplay
>> between int and numpy.int64 in this scenario? Are you saying that np.int64
>> is used in the schema declaration, but we just use native int at runtime,
>> and check the bit width when encoding?
>> >>>> >
>> >>>> > In any case, I don't think the real world fallout of using int is
>> nearly that dire. I suppose data loss is possible if a poorly designed
>> pipeline overflows an int64 and crashes,
>> >>>>
>> >>>> The primary risk is that it *won't* crash when overflowing an int64,
>> >>>> it'll just silently give the wrong answer. That's much less safe than
>> >>>> using a native int and then actually crashing in the case it's too
>> >>>> large at the point one tries to encode it.
>> >>>
>> >>>
>> >>> If the behavior of numpy.int64 is less safe than int, and both
>> support 64-bit integers, and int is the more intuitive type to use, then
>> that seems to make a strong case for using int rather than numpy.int64.
>> >>>
>> >>
>> >> I'm not sure we established numpy.int64 is less safe, just that a
>> silent overflow is a risk.
>>
>> Silent overflows are inherently less safe, especially for a language
>> where users in general never have to deal with this.
>>
>
> Absolutely agree that silent overflows are unsafe! I was just trying to
> point out that numpy isn't strictly silent. But as you point out below it's
> irrelevant because the runtime type is still int.
>
>
>> >> By default numpy will just log a warning when an overflow occurs, so
>> it's not totally silent, but definitely risky. numpy can however be made to
>> throw an exception when an overflow occurs with `np.seterr(over='raise')`.
>>
>> Warning logs on remote machines are unlikely to ever be seen. Even if
>> one knew about the numpy setting (keep in mind the user may not ever
>> directly user or import numpy), it doesn't seem to work (and one would
>> have to set it on the remote workers, or propagate this setting if set
>> in the main program).
>>
>> In [1]: import numpy as np
>> In [2]: np.seterr(over='raise')  # returns previous value
>> Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
>> 'ignore'}
>> In [3]: np.int64(2**36) * np.int64(2**36)
>> Out[3]: 0
>>
>>
> That's odd.. I ran the same test (Python 2.7, numpy 1.16) and it worked
> for me:
>
> In [4]: import numpy as np
>
> In [5]: np.int64(2**36) * np.int64(2**36)
> /usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/ipython:1:
> RuntimeWarning: overflow encountered in long_scalars
>
>
>
> #!/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/python
> Out[5]: 0
>
> In [6]: np.seterr(over='raise')
> Out[6]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
> 'ignore'}
>
> In [7]: np.int64(2**36) * np.int64(2**36)
> ---------------------------------------------------------------------------
> FloatingPointError                        Traceback (most recent call last)
> <ipython-input-7-962da6705127> in <module>()
> ----> 1 np.int64(2**36) * np.int64(2**36)
>
> FloatingPointError: overflow encountered in long_scalars
>
>
>
>> >> Regardless of what type is used in the typing representation of a
>> schema, we've established that RowCoder.encode should accept anything
>> convertible to an int for integer fields. So it will need to check it's
>> width and raise an error if it's too large.
>> >> I added some tests last week to ensure that RowCoder does this [1].
>> However they're currently skipped because I'm unsure of the proper place to
>> raise the error. I wrote up the details in a comment [2] (sorry I did a
>> force push so the comment doesn't show up in the appropriate place).
>> >>
>> >> Note that when decoding an INT32/64 field RowCoder still produces
>> plain old ints (since it relies on VarIntCoder), so int really is the
>> runtime type, and the numpy types are just for the typing representation of
>> a schema.
>> >>
>> >> I also updated my PR to accept int, float, and str in the typing
>> representation of a schema, and added the following summary of type
>> mappings to typehints.schema [1], since it's not readily apparent from the
>> code itself:
>> >
>> >
>> > Cool!
>> >
>> >>
>> >>
>> >> Python              Schema
>> >> np.int8     <-----> BYTE
>> >> np.int16    <-----> INT16
>> >> np.int32    <-----> INT32
>> >> np.int64    <-----> INT64
>> >> int         ---/
>> >> np.float32  <-----> FLOAT
>> >> np.float64  <-----> DOUBLE
>> >> float       ---/
>> >> bool        <-----> BOOLEAN
>> >> The mappings for STRING and BYTES are different between python 2 and
>> python 3,
>> >> because of the changes to str:
>> >> py3:
>> >> str/unicode <-----> STRING
>> >> bytes       <-----> BYTES
>> >> ByteString  ---/
>> >> py2:
>> >> unicode     <-----> STRING
>> >> str/bytes   ---/
>> >> ByteString  <-----> BYTES
>> >>
>> >> As you can see, int and float typings can now be used to create a
>> schema with an INT64 or DOUBLE attribute, but when creating an anonymous
>> NamedTuple sub-class from a schema, the numpy types are preferred. I prefer
>> that approach, if only for symmetry with the other integer and floating
>> point types, but I can change it to prefer int/float if I'm the only one
>> that feels that way.
>>
>> Just to be clear, this is just talking about the schema itself (as at
>> that level, due to the many-to-one mapping above, no distinction is
>> made between int vs. int64). The runtime types are still int/float,
>> right?
>>
>
> Correct, the runtime type is still int. I'm only using the numpy types in
> the typing representation of a schema, so that we have some way to
> distinguish between the different integer/float bit widths. I chose numpy
> types because they are actual types (i.e. type(np.int64) == type) and thus
> are compatible with typing, unlike pyarrow types.
>

This make sense. Thank you for explaining.


>
>
>>
>> > Just an opinion: As a user I would expect anonymous types created for
>> me to have native python types. I do not have data on what would be the
>> expectations of users in general.
>>
>> I think if the schema declares a field to be int64, but the runtime
>> type of the values is actually int, that's probably OK.
>>
>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Brian Hulette <bh...@google.com>.

On Tue, Aug 20, 2019 at 1:41 PM Robert Bradshaw <ro...@google.com> wrote:

> On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay <al...@google.com> wrote:
> >
> >
> >
> > On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bh...@google.com>
> wrote:
> >>
> >>
> >>
> >> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <ch...@gmail.com>
> wrote:
> >>>>
> >>>> >> Agreed on float since it seems to trivially map to a double, but
> I’m torn on int still. While I do want int type hints to work, it doesn’t
> seem appropriate to map it to AtomicType.INT64, since it has a completely
> different range of values.
> >>>> >>
> >>>> >> Let’s say we used native int for the runtime field type, not just
> as a schema declaration for numpy.int64. What is the real world fallout
> from this? Would there be data loss?
> >>>> >
> >>>> > I'm not sure I follow the question exactly, what is the interplay
> between int and numpy.int64 in this scenario? Are you saying that np.int64
> is used in the schema declaration, but we just use native int at runtime,
> and check the bit width when encoding?
> >>>> >
> >>>> > In any case, I don't think the real world fallout of using int is
> nearly that dire. I suppose data loss is possible if a poorly designed
> pipeline overflows an int64 and crashes,
> >>>>
> >>>> The primary risk is that it *won't* crash when overflowing an int64,
> >>>> it'll just silently give the wrong answer. That's much less safe than
> >>>> using a native int and then actually crashing in the case it's too
> >>>> large at the point one tries to encode it.
> >>>
> >>>
> >>> If the behavior of numpy.int64 is less safe than int, and both support
> 64-bit integers, and int is the more intuitive type to use, then that seems
> to make a strong case for using int rather than numpy.int64.
> >>>
> >>
> >> I'm not sure we established numpy.int64 is less safe, just that a
> silent overflow is a risk.
>
> Silent overflows are inherently less safe, especially for a language
> where users in general never have to deal with this.
>

Absolutely agree that silent overflows are unsafe! I was just trying to
point out that numpy isn't strictly silent. But as you point out below it's
irrelevant because the runtime type is still int.


> >> By default numpy will just log a warning when an overflow occurs, so
> it's not totally silent, but definitely risky. numpy can however be made to
> throw an exception when an overflow occurs with `np.seterr(over='raise')`.
>
> Warning logs on remote machines are unlikely to ever be seen. Even if
> one knew about the numpy setting (keep in mind the user may not ever
> directly user or import numpy), it doesn't seem to work (and one would
> have to set it on the remote workers, or propagate this setting if set
> in the main program).
>
> In [1]: import numpy as np
> In [2]: np.seterr(over='raise')  # returns previous value
> Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
> 'ignore'}
> In [3]: np.int64(2**36) * np.int64(2**36)
> Out[3]: 0
>
>
That's odd.. I ran the same test (Python 2.7, numpy 1.16) and it worked for
me:

In [4]: import numpy as np

In [5]: np.int64(2**36) * np.int64(2**36)
/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/ipython:1:
RuntimeWarning: overflow encountered in long_scalars



#!/usr/local/google/home/bhulette/working_dir/beam/sdks/python/venv/bin/python
Out[5]: 0

In [6]: np.seterr(over='raise')
Out[6]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under':
'ignore'}

In [7]: np.int64(2**36) * np.int64(2**36)
---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<ipython-input-7-962da6705127> in <module>()
----> 1 np.int64(2**36) * np.int64(2**36)

FloatingPointError: overflow encountered in long_scalars



> >> Regardless of what type is used in the typing representation of a
> schema, we've established that RowCoder.encode should accept anything
> convertible to an int for integer fields. So it will need to check it's
> width and raise an error if it's too large.
> >> I added some tests last week to ensure that RowCoder does this [1].
> However they're currently skipped because I'm unsure of the proper place to
> raise the error. I wrote up the details in a comment [2] (sorry I did a
> force push so the comment doesn't show up in the appropriate place).
> >>
> >> Note that when decoding an INT32/64 field RowCoder still produces plain
> old ints (since it relies on VarIntCoder), so int really is the runtime
> type, and the numpy types are just for the typing representation of a
> schema.
> >>
> >> I also updated my PR to accept int, float, and str in the typing
> representation of a schema, and added the following summary of type
> mappings to typehints.schema [1], since it's not readily apparent from the
> code itself:
> >
> >
> > Cool!
> >
> >>
> >>
> >> Python              Schema
> >> np.int8     <-----> BYTE
> >> np.int16    <-----> INT16
> >> np.int32    <-----> INT32
> >> np.int64    <-----> INT64
> >> int         ---/
> >> np.float32  <-----> FLOAT
> >> np.float64  <-----> DOUBLE
> >> float       ---/
> >> bool        <-----> BOOLEAN
> >> The mappings for STRING and BYTES are different between python 2 and
> python 3,
> >> because of the changes to str:
> >> py3:
> >> str/unicode <-----> STRING
> >> bytes       <-----> BYTES
> >> ByteString  ---/
> >> py2:
> >> unicode     <-----> STRING
> >> str/bytes   ---/
> >> ByteString  <-----> BYTES
> >>
> >> As you can see, int and float typings can now be used to create a
> schema with an INT64 or DOUBLE attribute, but when creating an anonymous
> NamedTuple sub-class from a schema, the numpy types are preferred. I prefer
> that approach, if only for symmetry with the other integer and floating
> point types, but I can change it to prefer int/float if I'm the only one
> that feels that way.
>
> Just to be clear, this is just talking about the schema itself (as at
> that level, due to the many-to-one mapping above, no distinction is
> made between int vs. int64). The runtime types are still int/float,
> right?
>

Correct, the runtime type is still int. I'm only using the numpy types in
the typing representation of a schema, so that we have some way to
distinguish between the different integer/float bit widths. I chose numpy
types because they are actual types (i.e. type(np.int64) == type) and thus
are compatible with typing, unlike pyarrow types.


>
> > Just an opinion: As a user I would expect anonymous types created for me
> to have native python types. I do not have data on what would be the
> expectations of users in general.
>
> I think if the schema declares a field to be int64, but the runtime
> type of the values is actually int, that's probably OK.
>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Robert Bradshaw <ro...@google.com>.

On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay <al...@google.com> wrote:
>
>
>
> On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bh...@google.com> wrote:
>>
>>
>>
>> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <ch...@gmail.com> wrote:
>>>>
>>>> >> Agreed on float since it seems to trivially map to a double, but I’m torn on int still. While I do want int type hints to work, it doesn’t seem appropriate to map it to AtomicType.INT64, since it has a completely different range of values.
>>>> >>
>>>> >> Let’s say we used native int for the runtime field type, not just as a schema declaration for numpy.int64. What is the real world fallout from this? Would there be data loss?
>>>> >
>>>> > I'm not sure I follow the question exactly, what is the interplay between int and numpy.int64 in this scenario? Are you saying that np.int64 is used in the schema declaration, but we just use native int at runtime, and check the bit width when encoding?
>>>> >
>>>> > In any case, I don't think the real world fallout of using int is nearly that dire. I suppose data loss is possible if a poorly designed pipeline overflows an int64 and crashes,
>>>>
>>>> The primary risk is that it *won't* crash when overflowing an int64,
>>>> it'll just silently give the wrong answer. That's much less safe than
>>>> using a native int and then actually crashing in the case it's too
>>>> large at the point one tries to encode it.
>>>
>>>
>>> If the behavior of numpy.int64 is less safe than int, and both support 64-bit integers, and int is the more intuitive type to use, then that seems to make a strong case for using int rather than numpy.int64.
>>>
>>
>> I'm not sure we established numpy.int64 is less safe, just that a silent overflow is a risk.

Silent overflows are inherently less safe, especially for a language
where users in general never have to deal with this.

>> By default numpy will just log a warning when an overflow occurs, so it's not totally silent, but definitely risky. numpy can however be made to throw an exception when an overflow occurs with `np.seterr(over='raise')`.

Warning logs on remote machines are unlikely to ever be seen. Even if
one knew about the numpy setting (keep in mind the user may not ever
directly user or import numpy), it doesn't seem to work (and one would
have to set it on the remote workers, or propagate this setting if set
in the main program).

In [1]: import numpy as np
In [2]: np.seterr(over='raise')  # returns previous value
Out[2]: {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
In [3]: np.int64(2**36) * np.int64(2**36)
Out[3]: 0

>> Regardless of what type is used in the typing representation of a schema, we've established that RowCoder.encode should accept anything convertible to an int for integer fields. So it will need to check it's width and raise an error if it's too large.
>> I added some tests last week to ensure that RowCoder does this [1]. However they're currently skipped because I'm unsure of the proper place to raise the error. I wrote up the details in a comment [2] (sorry I did a force push so the comment doesn't show up in the appropriate place).
>>
>> Note that when decoding an INT32/64 field RowCoder still produces plain old ints (since it relies on VarIntCoder), so int really is the runtime type, and the numpy types are just for the typing representation of a schema.
>>
>> I also updated my PR to accept int, float, and str in the typing representation of a schema, and added the following summary of type mappings to typehints.schema [1], since it's not readily apparent from the code itself:
>
>
> Cool!
>
>>
>>
>> Python              Schema
>> np.int8     <-----> BYTE
>> np.int16    <-----> INT16
>> np.int32    <-----> INT32
>> np.int64    <-----> INT64
>> int         ---/
>> np.float32  <-----> FLOAT
>> np.float64  <-----> DOUBLE
>> float       ---/
>> bool        <-----> BOOLEAN
>> The mappings for STRING and BYTES are different between python 2 and python 3,
>> because of the changes to str:
>> py3:
>> str/unicode <-----> STRING
>> bytes       <-----> BYTES
>> ByteString  ---/
>> py2:
>> unicode     <-----> STRING
>> str/bytes   ---/
>> ByteString  <-----> BYTES
>>
>> As you can see, int and float typings can now be used to create a schema with an INT64 or DOUBLE attribute, but when creating an anonymous NamedTuple sub-class from a schema, the numpy types are preferred. I prefer that approach, if only for symmetry with the other integer and floating point types, but I can change it to prefer int/float if I'm the only one that feels that way.

Just to be clear, this is just talking about the schema itself (as at
that level, due to the many-to-one mapping above, no distinction is
made between int vs. int64). The runtime types are still int/float,
right?

> Just an opinion: As a user I would expect anonymous types created for me to have native python types. I do not have data on what would be the expectations of users in general.

I think if the schema declares a field to be int64, but the runtime
type of the values is actually int, that's probably OK.

Re: [PROPOSAL] An initial Schema API in Python

Posted by Ahmet Altay <al...@google.com>.

On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette <bh...@google.com> wrote:

>
>
> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <ch...@gmail.com> wrote:
>
>> >> Agreed on float since it seems to trivially map to a double, but I’m
>>> torn on int still. While I do want int type hints to work, it doesn’t seem
>>> appropriate to map it to AtomicType.INT64, since it has a completely
>>> different range of values.
>>> >>
>>> >> Let’s say we used native int for the runtime field type, not just as
>>> a schema declaration for numpy.int64. What is the real world fallout from
>>> this? Would there be data loss?
>>> >
>>> > I'm not sure I follow the question exactly, what is the interplay
>>> between int and numpy.int64 in this scenario? Are you saying that np.int64
>>> is used in the schema declaration, but we just use native int at runtime,
>>> and check the bit width when encoding?
>>> >
>>> > In any case, I don't think the real world fallout of using int is
>>> nearly that dire. I suppose data loss is possible if a poorly designed
>>> pipeline overflows an int64 and crashes,
>>>
>>> The primary risk is that it *won't* crash when overflowing an int64,
>>> it'll just silently give the wrong answer. That's much less safe than
>>> using a native int and then actually crashing in the case it's too
>>> large at the point one tries to encode it.
>>>
>>
>> If the behavior of numpy.int64 is less safe than int, and both support
>> 64-bit integers, and int is the more intuitive type to use, then that seems
>> to make a strong case for using int rather than numpy.int64.
>>
>>
> I'm not sure we established numpy.int64 is less safe, just that a silent
> overflow is a risk. By default numpy will just log a warning when an
> overflow occurs, so it's not totally silent, but definitely risky. numpy
> can however be made to throw an exception when an overflow occurs with
> `np.seterr(over='raise')`.
>
> Regardless of what type is used in the typing representation of a schema,
> we've established that RowCoder.encode should accept anything convertible
> to an int for integer fields. So it will need to check it's width and raise
> an error if it's too large.
> I added some tests last week to ensure that RowCoder does this [1].
> However they're currently skipped because I'm unsure of the proper place to
> raise the error. I wrote up the details in a comment [2] (sorry I did a
> force push so the comment doesn't show up in the appropriate place).
>
> Note that when decoding an INT32/64 field RowCoder still produces plain
> old ints (since it relies on VarIntCoder), so int really is the runtime
> type, and the numpy types are just for the typing representation of a
> schema.
>
> I also updated my PR to accept int, float, and str in the typing
> representation of a schema, and added the following summary of type
> mappings to typehints.schema [1], since it's not readily apparent from the
> code itself:
>

Cool!


>
> Python              Schema
> np.int8     <-----> BYTE
> np.int16    <-----> INT16
> np.int32    <-----> INT32
> np.int64    <-----> INT64
> int         ---/
> np.float32  <-----> FLOAT
> np.float64  <-----> DOUBLE
> float       ---/
> bool        <-----> BOOLEAN
> The mappings for STRING and BYTES are different between python 2 and
> python 3,
> because of the changes to str:
> py3:
> str/unicode <-----> STRING
> bytes       <-----> BYTES
> ByteString  ---/
> py2:
> unicode     <-----> STRING
> str/bytes   ---/
> ByteString  <-----> BYTES
>
> As you can see, int and float typings can now be used to create a schema
> with an INT64 or DOUBLE attribute, but when creating an anonymous
> NamedTuple sub-class from a schema, the numpy types are preferred. I prefer
> that approach, if only for symmetry with the other integer and floating
> point types, but I can change it to prefer int/float if I'm the only one
> that feels that way.
>

Just an opinion: As a user I would expect anonymous types created for me to
have native python types. I do not have data on what would be the
expectations of users in general.


>
> Brian
>
> [1]
> https://github.com/apache/beam/pull/9188/files#diff-94d5ea3d2d121722c91b220a353490e2R88
> [2] https://github.com/apache/beam/pull/9188#discussion_r312682478
> [3]
> https://github.com/apache/beam/blob/25dcc50a8de9c607069a8efc80a6da67a6e8b0ca/sdks/python/apache_beam/typehints/schemas.py#L20
>
>
>> -chad
>>
>>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Brian Hulette <bh...@google.com>.

On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova <ch...@gmail.com> wrote:

> >> Agreed on float since it seems to trivially map to a double, but I’m
>> torn on int still. While I do want int type hints to work, it doesn’t seem
>> appropriate to map it to AtomicType.INT64, since it has a completely
>> different range of values.
>> >>
>> >> Let’s say we used native int for the runtime field type, not just as a
>> schema declaration for numpy.int64. What is the real world fallout from
>> this? Would there be data loss?
>> >
>> > I'm not sure I follow the question exactly, what is the interplay
>> between int and numpy.int64 in this scenario? Are you saying that np.int64
>> is used in the schema declaration, but we just use native int at runtime,
>> and check the bit width when encoding?
>> >
>> > In any case, I don't think the real world fallout of using int is
>> nearly that dire. I suppose data loss is possible if a poorly designed
>> pipeline overflows an int64 and crashes,
>>
>> The primary risk is that it *won't* crash when overflowing an int64,
>> it'll just silently give the wrong answer. That's much less safe than
>> using a native int and then actually crashing in the case it's too
>> large at the point one tries to encode it.
>>
>
> If the behavior of numpy.int64 is less safe than int, and both support
> 64-bit integers, and int is the more intuitive type to use, then that seems
> to make a strong case for using int rather than numpy.int64.
>
>
I'm not sure we established numpy.int64 is less safe, just that a silent
overflow is a risk. By default numpy will just log a warning when an
overflow occurs, so it's not totally silent, but definitely risky. numpy
can however be made to throw an exception when an overflow occurs with
`np.seterr(over='raise')`.

Regardless of what type is used in the typing representation of a schema,
we've established that RowCoder.encode should accept anything convertible
to an int for integer fields. So it will need to check it's width and raise
an error if it's too large.
I added some tests last week to ensure that RowCoder does this [1]. However
they're currently skipped because I'm unsure of the proper place to raise
the error. I wrote up the details in a comment [2] (sorry I did a force
push so the comment doesn't show up in the appropriate place).

Note that when decoding an INT32/64 field RowCoder still produces plain old
ints (since it relies on VarIntCoder), so int really is the runtime type,
and the numpy types are just for the typing representation of a schema.

I also updated my PR to accept int, float, and str in the typing
representation of a schema, and added the following summary of type
mappings to typehints.schema [1], since it's not readily apparent from the
code itself:

Python              Schema
np.int8     <-----> BYTE
np.int16    <-----> INT16
np.int32    <-----> INT32
np.int64    <-----> INT64
int         ---/
np.float32  <-----> FLOAT
np.float64  <-----> DOUBLE
float       ---/
bool        <-----> BOOLEAN
The mappings for STRING and BYTES are different between python 2 and python
3,
because of the changes to str:
py3:
str/unicode <-----> STRING
bytes       <-----> BYTES
ByteString  ---/
py2:
unicode     <-----> STRING
str/bytes   ---/
ByteString  <-----> BYTES

As you can see, int and float typings can now be used to create a schema
with an INT64 or DOUBLE attribute, but when creating an anonymous
NamedTuple sub-class from a schema, the numpy types are preferred. I prefer
that approach, if only for symmetry with the other integer and floating
point types, but I can change it to prefer int/float if I'm the only one
that feels that way.

Brian

[1]
https://github.com/apache/beam/pull/9188/files#diff-94d5ea3d2d121722c91b220a353490e2R88
[2] https://github.com/apache/beam/pull/9188#discussion_r312682478
[3]
https://github.com/apache/beam/blob/25dcc50a8de9c607069a8efc80a6da67a6e8b0ca/sdks/python/apache_beam/typehints/schemas.py#L20

> -chad
>
>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Chad Dombrova <ch...@gmail.com>.

>
> >> Agreed on float since it seems to trivially map to a double, but I’m
> torn on int still. While I do want int type hints to work, it doesn’t seem
> appropriate to map it to AtomicType.INT64, since it has a completely
> different range of values.
> >>
> >> Let’s say we used native int for the runtime field type, not just as a
> schema declaration for numpy.int64. What is the real world fallout from
> this? Would there be data loss?
> >
> > I'm not sure I follow the question exactly, what is the interplay
> between int and numpy.int64 in this scenario? Are you saying that np.int64
> is used in the schema declaration, but we just use native int at runtime,
> and check the bit width when encoding?
> >
> > In any case, I don't think the real world fallout of using int is nearly
> that dire. I suppose data loss is possible if a poorly designed pipeline
> overflows an int64 and crashes,
>
> The primary risk is that it *won't* crash when overflowing an int64,
> it'll just silently give the wrong answer. That's much less safe than
> using a native int and then actually crashing in the case it's too
> large at the point one tries to encode it.
>

If the behavior of numpy.int64 is less safe than int, and both support
64-bit integers, and int is the more intuitive type to use, then that seems
to make a strong case for using int rather than numpy.int64.

-chad

Re: [PROPOSAL] An initial Schema API in Python

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Aug 7, 2019 at 11:12 PM Brian Hulette <bh...@google.com> wrote:
>
> Thanks for all the suggestions, I've added responses inline.
>
> On Wed, Aug 7, 2019 at 12:52 PM Chad Dombrova <ch...@gmail.com> wrote:
>>
>> There’s a lot of ground to cover here, so I’m going to pull from a few different responses.
>>
>> ________________________________
>>
>> numpy ints
>>
>> A properly written library should accept any type implementing the int (or index) methods in place of an int, rather than doing explicit type checks
>>
>> Yes, but the reality is that very very few actually do this, including Beam itself (check the code for Timestamp and Duration, to name a few).
>>
>> Which brings me to my next topic:
>>
>> I tested this out with mypy and it would not be compatible:
>>
>> def square(x: int):
>> return x*x
>>
>> square(np.int16(32)) # mypy error
>>
>> The proper way to check this scenario is using typing.SupportsInt. Note that this only guarantees that __int__ exists, so you still need to cast to int if you want to do anything with the object:
>>
>> def square(x: typing.SupportsInt) -> int:
>>     if not isinstance(x, int):
>>         x = int(x)
>>     return x*x
>>
>> square('foo')  # error!
>> square(1.2)  # ok
>
>  Yep I came across this while writing my last reply. I agree though it seems unlikely that many libraries actually do this.
>
>> ________________________________
>>
>> Native python ints
>>
>> Agreed on float since it seems to trivially map to a double, but I’m torn on int still. While I do want int type hints to work, it doesn’t seem appropriate to map it to AtomicType.INT64, since it has a completely different range of values.
>>
>> Let’s say we used native int for the runtime field type, not just as a schema declaration for numpy.int64. What is the real world fallout from this? Would there be data loss?
>
> I'm not sure I follow the question exactly, what is the interplay between int and numpy.int64 in this scenario? Are you saying that np.int64 is used in the schema declaration, but we just use native int at runtime, and check the bit width when encoding?
>
> In any case, I don't think the real world fallout of using int is nearly that dire. I suppose data loss is possible if a poorly designed pipeline overflows an int64 and crashes,

The primary risk is that it *won't* crash when overflowing an int64,
it'll just silently give the wrong answer. That's much less safe than
using a native int and then actually crashing in the case it's too
large at the point one tries to encode it.

> but that's possible whether we use int or np.int64 at runtime. I'm just saying that a user could be forgiven for thinking that they're safe from overflows if they declare a schema as NamedTuple('Foo', [('to_infinity_and_beyond', int)]), but they shouldn't make the same mistake when they explicitly call it an int64.

Yes. But for schemas to be maximally useful, we'll want to be able to
infer them from all sorts of things that aren't written with Beam in
mind (e.g. external data classes, function annotations) and rejecting
the builtin int type will be a poor user experience here.

>> ________________________________
>>
>> Python3-only
>>
>> No need to worry about 2/3 compatibility for strings, we could just use str
>>
>> This is already widely handled throughout the Beam python SDK using the future/past library, so it seems silly to give up on this solution for schemas.
>>
>> On this topic, I added some comments to the PR about using past.builtins.unicode instead of numpy.unicode. They’re the same type, but there’s no reason to get this via numpy, when everywhere else in the code gets it from past.
>>
>> We could just use bytes for byte arrays (as a shorthand for typing.ByteString [1])
>>
>> Neat, but in my obviously very biased opinion it is not worth cutting off python2 users over this.
>
> Ok I won't do this :) I wasn't aware of typing.Sequence, that does seem like a good fit. The other two items are just nice-to-haves, I'm happy to work around those and use Sequence for arrays instead.

I would imagine that we could accept bytes or typing.ByteString for
BYTES, with only Python 2 users having to do the latter. (In both
Python 2 and Python 3 one would use str for STRING, it would decode to
past.builtins.unicode. This seems to capture the intent better than
mapping str to BYTES in Python 2 only.)

Re: [PROPOSAL] An initial Schema API in Python

Posted by Brian Hulette <bh...@google.com>.

Thanks for all the suggestions, I've added responses inline.

On Wed, Aug 7, 2019 at 12:52 PM Chad Dombrova <ch...@gmail.com> wrote:

> There’s a lot of ground to cover here, so I’m going to pull from a few
> different responses.
> ------------------------------
>
> numpy ints
>
> A properly written library should accept any type implementing the *int*
> (or *index*) methods in place of an int, rather than doing explicit type
> checks
>
> Yes, but the reality is that very very few actually do this, including
> Beam itself (check the code for Timestamp and Duration, to name a few).
>
> Which brings me to my next topic:
>
> I tested this out with mypy and it would not be compatible:
>
> def square(x: int):
> return x*x
>
> square(np.int16(32)) # mypy error
>
> The proper way to check this scenario is using typing.SupportsInt. Note
> that this *only* guarantees that __int__ exists, so you still need to
> cast to int if you want to do anything with the object:
>
> def square(x: typing.SupportsInt) -> int:
>     if not isinstance(x, int):
>         x = int(x)
>     return x*x
>
> square('foo')  # error!
> square(1.2)  # ok
>
>  Yep I came across this while writing my last reply. I agree though it
seems unlikely that many libraries actually do this.

------------------------------
>
> Native python ints
>
> Agreed on float since it seems to trivially map to a double, but I’m torn
> on int still. While I do want int type hints to work, it doesn’t seem
> appropriate to map it to AtomicType.INT64, since it has a completely
> different range of values.
>
> Let’s say we used native int for the runtime field type, not just as a
> schema declaration for numpy.int64. What is the real world fallout from
> this? Would there be data loss?
>
I'm not sure I follow the question exactly, what is the interplay between
int and numpy.int64 in this scenario? Are you saying that np.int64 is used
in the schema declaration, but we just use native int at runtime, and check
the bit width when encoding?

In any case, I don't think the real world fallout of using int is nearly
that dire. I suppose data loss is possible if a poorly designed pipeline
overflows an int64 and crashes, but that's possible whether we use int or
np.int64 at runtime. I'm just saying that a user could be forgiven for
thinking that they're safe from overflows if they declare a schema as
NamedTuple('Foo', [('to_infinity_and_beyond', int)]), but they shouldn't
make the same mistake when they explicitly call it an int64.

> ------------------------------
>
> NamedTuples
>
> (2) The mapping of concrete values to Python types. Rows mapping to
> NamedTuples may give expectations beyond the attributes they offer (and I’d
> imagine we’ll want to be flexible with the possible representations here,
> e.g. offering a slice of an arrow record batch). Or do we need to pay the
> cost of re-creating the users NamedTuple subclass.
>
> I was a little surprised to see that the current implementation is
> creating a new class on the fly each time. I would suspect that this would
> be slower than recording the class on registration and looking it up. Why
> not do this? Using the same type will be less confusing and more powerful,
> as custom methods can be added.
>
each time just means once per schema when a worker is being initialized.
The classes are registered by UUID when they're created so we can look them
up if the same schema is used again. The issue is that the registry that's
built up at pipeline construction time isn't carried over to the workers,
so we need to re-build it when a worker starts, and the only information we
have at that point is the schema stored as the payload of RowCoder, which
currently has nothing SDK-specific.

In https://s.apache.org/beam-schemas and in the portable schemas thread
<https://lists.apache.org/thread.html/b9901224b13054c93743cf473262c1ac03411404a69adc4336a59c16@%3Cdev.beam.apache.org%3E>
we
discussed the possibility of including SDK-specific logical types, where
some sort of serialized class to represent the type can be included
per-SDK. That could allow us to actually send the user's type over to the
workers to solve this problem, but that's not in the spec I'm working off
of now, and would take some more design work.

> ------------------------------
>
> Python3-only
>
> Now that I’m actually thinking it through, you’re right there are several
> things that could be cleaner if we make it a python 3 only feature:
>
> Please don’t do this.
>
>
>    - We could use typing.Collection
>
> If you want something a little more relaxed than typing.List why not
> using typing.Sequence? Sequences are expected to be ordered (they
> therefore support __getitem__ and __reverse__) and Collections are not,
> so I think Sequence is a better fit for an array.
>
>
>    - No need to worry about 2/3 compatibility for strings, we could just
>    use str
>
> This is already widely handled throughout the Beam python SDK using the
> future/past library, so it seems silly to give up on this solution for
> schemas.
>
> On this topic, I added some comments to the PR about using
> past.builtins.unicode instead of numpy.unicode. They’re the same type,
> but there’s no reason to get this via numpy, when everywhere else in the
> code gets it from past.
>
>
>    - We could just use bytes for byte arrays (as a shorthand for
>    typing.ByteString [1])
>
> Neat, but in my obviously very biased opinion it is not worth cutting off
> python2 users over this.
>
Ok I won't do this :) I wasn't aware of typing.Sequence, that does seem
like a good fit. The other two items are just nice-to-haves, I'm happy to
work around those and use Sequence for arrays instead.

> That's all for now.
> -chad
>
>
>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Chad Dombrova <ch...@gmail.com>.

There’s a lot of ground to cover here, so I’m going to pull from a few
different responses.
------------------------------

numpy ints

A properly written library should accept any type implementing the *int*
(or *index*) methods in place of an int, rather than doing explicit type
checks

Yes, but the reality is that very very few actually do this, including Beam
itself (check the code for Timestamp and Duration, to name a few).

Which brings me to my next topic:

I tested this out with mypy and it would not be compatible:

def square(x: int):
return x*x

square(np.int16(32)) # mypy error

The proper way to check this scenario is using typing.SupportsInt. Note
that this *only* guarantees that __int__ exists, so you still need to cast
to int if you want to do anything with the object:

def square(x: typing.SupportsInt) -> int:
    if not isinstance(x, int):
        x = int(x)
    return x*x

square('foo')  # error!
square(1.2)  # ok

------------------------------

Native python ints

Agreed on float since it seems to trivially map to a double, but I’m torn
on int still. While I do want int type hints to work, it doesn’t seem
appropriate to map it to AtomicType.INT64, since it has a completely
different range of values.

Let’s say we used native int for the runtime field type, not just as a
schema declaration for numpy.int64. What is the real world fallout from
this? Would there be data loss?
------------------------------

NamedTuples

(2) The mapping of concrete values to Python types. Rows mapping to
NamedTuples may give expectations beyond the attributes they offer (and I’d
imagine we’ll want to be flexible with the possible representations here,
e.g. offering a slice of an arrow record batch). Or do we need to pay the
cost of re-creating the users NamedTuple subclass.

I was a little surprised to see that the current implementation is creating
a new class on the fly each time. I would suspect that this would be slower
than recording the class on registration and looking it up. Why not do
this? Using the same type will be less confusing and more powerful, as
custom methods can be added.
------------------------------

Python3-only

Now that I’m actually thinking it through, you’re right there are several
things that could be cleaner if we make it a python 3 only feature:

Please don’t do this.


   - We could use typing.Collection

If you want something a little more relaxed than typing.List why not using
typing.Sequence? Sequences are expected to be ordered (they therefore
support __getitem__ and __reverse__) and Collections are not, so I think
Sequence is a better fit for an array.


   - No need to worry about 2/3 compatibility for strings, we could just
   use str

This is already widely handled throughout the Beam python SDK using the
future/past library, so it seems silly to give up on this solution for
schemas.

On this topic, I added some comments to the PR about using
past.builtins.unicode instead of numpy.unicode. They’re the same type, but
there’s no reason to get this via numpy, when everywhere else in the code
gets it from past.


   - We could just use bytes for byte arrays (as a shorthand for
   typing.ByteString [1])

Neat, but in my obviously very biased opinion it is not worth cutting off
python2 users over this.

That's all for now.
-chad

Re: [PROPOSAL] An initial Schema API in Python

Posted by Ahmet Altay <al...@google.com>.

On Wed, Aug 7, 2019 at 10:51 AM Brian Hulette <bh...@google.com> wrote:

> If it is not a big deal supporting both sounds good. I was actually
>> referring to your comment about  typing.Collection not being available on
>> python 2.
>>
>
> Oh, of course, sorry somehow that completely slipped my mind. Now that I'm
> actually thinking it through, you're right there are several things that
> could be cleaner if we make it a python 3 only feature:
> - We could use `typing.Collection`
> - No need to worry about 2/3 compatibility for strings, we could just use
> `str`
> - We could just use `bytes` for byte arrays (as a shorthand for
> `typing.ByteString` [1])
>
> If we do make it a python 3 only feature how would it be enforced? Just by
> documentation, or could I add a version check? I can't find any precedents
> for entire features that reject py2 currently.
>

+Valentyn Tymofieiev <va...@google.com> might have a suggestion. We do
not have a precedence in this. Documentation might be enough. Our intention
was to drop py2 support very soon anyway. (as discussed on some other
thread in the mailing list.)


>
> What happens when one does np.int16(107) + 1, is the numpy type preserved?
>
>
> Yes it looks like once you're in numpy land you stay there, unless you
> explicitly leave with a call to int(). ints in an arithmetic expression
> like np.int16(107) + 1 are implicitly converted to a reasonable type
> (np.int64 if bit_length() < 64, else np.float64).
>
> Another concern I have is, if there is a user function or a library that
>> user does not control, that uses typing to indicate that a function accepts
>> a type of int, would it be compatible with numpy types?
>
>
> That's a good point. I tested this out with mypy and it would not be
> compatible:
>

Maybe we can find a way to do the conversions automatically for the users.


>
> def square(x: int):
>     return x*x
>
> square(np.int16(32))  # mypy error
>
> Users would have to cast to int to make this work as Chad pointed out. I
> was curious about the actual performance cost of these conversions, so I
> wrote up a script to benchmark them [2]. The results from running on python
> 2.7.16 on my desktop:
>
> pass:   6.117 ns/op
> int to int:     71.524 ns/op
> np.int8 to int: 89.784 ns/op
> int to np.int8: 89.784 ns/op
> np.int8 to np.int8:     89.784 ns/op
> np.int16 to int:        86.715 ns/op
> int to np.int16:        86.715 ns/op
> np.int16 to np.int16:   86.715 ns/op
> np.int32 to int:        89.172 ns/op
> int to np.int32:        89.172 ns/op
> np.int32 to np.int32:   89.172 ns/op
> np.int64 to int:        88.072 ns/op
> int to np.int64:        88.072 ns/op
> np.int64 to np.int64:   88.072 ns/op
>
> It's interesting to note you don't pay much of a penalty for converting
> to/from numpy types over just casting something that's already an int.
>
> [1] https://docs.python.org/3/library/typing.html#typing.ByteString
> [2] https://gist.github.com/TheNeuralBit/158dff3fa90dc46a369bb014e913650d
>
> On Fri, Aug 2, 2019 at 6:43 PM Ahmet Altay <al...@google.com> wrote:
>
>> To clarify, I am happy to start with implementation and iterating on it.
>> I do not want to block this late into the discussion.
>>
>> On Fri, Aug 2, 2019 at 6:03 PM Brian Hulette <bh...@google.com> wrote:
>>
>>> I meant "or sub-class it and define fields with type annotations" not
>>> "with attributes". I believe that version doesn't work in python 2 since it
>>> doesn't support the `name: type` syntax.
>>>
>>
>> If it is not a big deal supporting both sounds good. I was actually
>> referring to your comment about  typing.Collection not being available on
>> python 2.
>>
>>
>>>
>>> On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>>
>>>> > Do we need to support python 2? If supporting python 2 will
>>>> complicate things, we could make this a python3 only feature.
>>>> I don't think supporting python 2 complicates things. It's just that
>>>> there are two different ways to use typing.NamedTuple in python 3 - you can
>>>> either instantiate it and provide a list of (name, type) pairs, or
>>>> sub-class it and define fields with attributes. But in python 2 only the
>>>> former works.
>>>>
>>>> > Why are we mapping to numpy types? Design document suggests mapping
>>>> to python native types as the plan.
>>>> We did discuss using numpy types in a comment [1], but you're right we
>>>> never resolved it and the doc still lists native types. My biggest concern
>>>> with just using native int/float types is I think we definitely need *some*
>>>> way to distinguish between the schema proto's various int/float sizes in
>>>> the python representation. If we don't we would need to either a) reject
>>>> schemas that contain any size other than the one that we support, or b) no
>>>> longer have a bijective mapping between proto and python (i.e. any integer
>>>> type that passes through the Python SDK would get converted to an int64).
>>>> And if we do need some way to distinguish between the integer types, I
>>>> thought a de facto standard was better than creating our own - as Robert
>>>> noted in that comment thread "The only strong opinion I have is that we
>>>> shouldn't invent our own."
>>>>
>>>> As I was experimenting with different approaches I also discovered the
>>>> numpy numeric types are very nice because you can instantiate them and they
>>>> look just like ints, for example `np.int16(107) == 107` evaluates to true
>>>> even though `type(np.int16(107)) == type(107)` does not.
>>>>
>>>> Another concern with python's int type is that it supports unlimited
>>>> precision [2], so it's really not a good type to use for any of the schema
>>>> ints. My PR as it is right now actually doesn't even support int. I
>>>> probably should at least make a change to accept int as a type
>>>> specification for iint64 but throw an error when encoding if an int is too
>>>> big.
>>>>
>>>
>> I agree with Robert's position of not inventing our own. I assume we
>> could make a decision between python native types, arrow types, and numpy
>> types.
>>
>> What happens when one does np.int16(107) + 1, is the numpy type preserved?
>>
>>  Another concern I have is, if there is a user function or a library that
>> user does not control, that uses typing to indicate that a function accepts
>> a type of int, would it be compatible with numpy types?
>>
>>
>>>
>>>> [1]
>>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
>>>> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric
>>>>
>>>> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <al...@google.com> wrote:
>>>> >
>>>> > Thank you Brian.
>>>> >
>>>> > I did not spend enough time yet to review. Some early questions, I
>>>> apologize if I missed an earlier discussion.
>>>> > - Do we need to support python 2? If supporting python 2 will
>>>> complicate things, we could make this a python3 only feature.
>>>> > - Why are we mapping to numpy types? Design document suggests mapping
>>>> to python native types as the plan.
>>>> >
>>>> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>> >>
>>>> >> tl;dr: I have a PR at [1] that defines an initial Schema API in
>>>> python based on the typing module, and uses typing.NamedTuple to represent
>>>> a Schema. There are some risks with that approach but I propose we move
>>>> forward with it as a first draft and iterate.
>>>> >>
>>>> >>
>>>> >> I've opened up a PR [1] that implements RowCoder in the Python SDK
>>>> and verifies it's compatibility with the Java implementation via tests in
>>>> standard_coders.yaml. A lot of miscellaneous changes are required to get
>>>> that point, including a pretty significant one: providing some native
>>>> python representation for schemas.
>>>> >>
>>>> >> As discussed in the PR description I opted to fully embrace the
>>>> typing module for the native representation of schema types:
>>>> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
>>>> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas
>>>> we settled on typing.Collection, but unfortunately this doesn't seem to be
>>>> supported in python 2, I'm open to other suggestions here.
>>>> >> - Map maps to typing.Mapping.
>>>> >> - Rows map to typing.NamedTuple.
>>>> >> - nullability is indicated with typing.Optional. Note there's no
>>>> distinction between Optional[Optional[T]] and Optional[T] in typing, both
>>>> map to Union[T, None] - so this is actually a good analog for the nullable
>>>> flag on FieldType in schema.proto.
>>>> >>
>>>> >> With this approach a schema in Python might look like:
>>>> >> ```
>>>> >> class Movie(NamedTuple):
>>>> >>   name: np.unicode
>>>> >>   year: Optional[np.int16]
>>>> >>
>>>> >> # The class/type annotation syntax doesn't work in Python 2. Instead
>>>> you can use:
>>>> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
>>>> Optional[np.int16])]
>>>> >>
>>>> >> # DoFns annotated with_output_types(Movie) will use RowCoder
>>>> >> coders.registry.register_coder(Movie, coders.RowCoder)
>>>> >> ```
>>>> >>
>>>> >> I think the choice to use typing.NamedTuple as a row type is
>>>> potentially controversial - Udi, Robert Bradshaw and I were already
>>>> discussing it a bit in a comment on the portable schemas doc [2], but I
>>>> wanted to bring that discussion to the ML.
>>>> >>
>>>> >> On the pro side:
>>>> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
>>>> store attributes internally as an ordered collection (List<Object> in Row,
>>>> a tuple in NamedTuple) and provide shortcuts for accessing those attributes
>>>> by field name based on the schema.
>>>> >> +  NamedTuple is a native type, and we're trying to get out of the
>>>> business of defining our own type hints (I think).
>>>> >>
>>>> >> On the con side:
>>>> >> - When using the class-based version of NamedTuple in python 3 a
>>>> user might be tempted to add more functionality to their class (for
>>>> example, define a method) rather than just defining a schema - but I'm not
>>>> sure we're prepared to guarantee that we will always produce an instance of
>>>> their class, just something that has the defined attributes. This concern
>>>> can potentially be alleviated once we have support for logical types.
>>>> >>
>>>> >> Unless there are any objections I think it would make sense to start
>>>> with this implementation (documenting the limitations), and then iterate on
>>>> it. Please take a look at the PR [1] and let me know what you think about
>>>> this proposal.
>>>> >>
>>>> >> Thanks,
>>>> >> Brian
>>>> >>
>>>> >> [1] https://github.com/apache/beam/pull/9188
>>>> >> [2]
>>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
>>>> >> [3]
>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
>>>>
>>>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Brian Hulette <bh...@google.com>.

>
> If it is not a big deal supporting both sounds good. I was actually
> referring to your comment about  typing.Collection not being available on
> python 2.
>

Oh, of course, sorry somehow that completely slipped my mind. Now that I'm
actually thinking it through, you're right there are several things that
could be cleaner if we make it a python 3 only feature:
- We could use `typing.Collection`
- No need to worry about 2/3 compatibility for strings, we could just use
`str`
- We could just use `bytes` for byte arrays (as a shorthand for
`typing.ByteString` [1])

If we do make it a python 3 only feature how would it be enforced? Just by
documentation, or could I add a version check? I can't find any precedents
for entire features that reject py2 currently.

What happens when one does np.int16(107) + 1, is the numpy type preserved?


Yes it looks like once you're in numpy land you stay there, unless you
explicitly leave with a call to int(). ints in an arithmetic expression
like np.int16(107) + 1 are implicitly converted to a reasonable type
(np.int64 if bit_length() < 64, else np.float64).

Another concern I have is, if there is a user function or a library that
> user does not control, that uses typing to indicate that a function accepts
> a type of int, would it be compatible with numpy types?


That's a good point. I tested this out with mypy and it would not be
compatible:

def square(x: int):
    return x*x

square(np.int16(32))  # mypy error

Users would have to cast to int to make this work as Chad pointed out. I
was curious about the actual performance cost of these conversions, so I
wrote up a script to benchmark them [2]. The results from running on python
2.7.16 on my desktop:

pass:   6.117 ns/op
int to int:     71.524 ns/op
np.int8 to int: 89.784 ns/op
int to np.int8: 89.784 ns/op
np.int8 to np.int8:     89.784 ns/op
np.int16 to int:        86.715 ns/op
int to np.int16:        86.715 ns/op
np.int16 to np.int16:   86.715 ns/op
np.int32 to int:        89.172 ns/op
int to np.int32:        89.172 ns/op
np.int32 to np.int32:   89.172 ns/op
np.int64 to int:        88.072 ns/op
int to np.int64:        88.072 ns/op
np.int64 to np.int64:   88.072 ns/op

It's interesting to note you don't pay much of a penalty for converting
to/from numpy types over just casting something that's already an int.

[1] https://docs.python.org/3/library/typing.html#typing.ByteString
[2] https://gist.github.com/TheNeuralBit/158dff3fa90dc46a369bb014e913650d

On Fri, Aug 2, 2019 at 6:43 PM Ahmet Altay <al...@google.com> wrote:

> To clarify, I am happy to start with implementation and iterating on it. I
> do not want to block this late into the discussion.
>
> On Fri, Aug 2, 2019 at 6:03 PM Brian Hulette <bh...@google.com> wrote:
>
>> I meant "or sub-class it and define fields with type annotations" not
>> "with attributes". I believe that version doesn't work in python 2 since it
>> doesn't support the `name: type` syntax.
>>
>
> If it is not a big deal supporting both sounds good. I was actually
> referring to your comment about  typing.Collection not being available on
> python 2.
>
>
>>
>> On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <bh...@google.com> wrote:
>>
>>> > Do we need to support python 2? If supporting python 2 will complicate
>>> things, we could make this a python3 only feature.
>>> I don't think supporting python 2 complicates things. It's just that
>>> there are two different ways to use typing.NamedTuple in python 3 - you can
>>> either instantiate it and provide a list of (name, type) pairs, or
>>> sub-class it and define fields with attributes. But in python 2 only the
>>> former works.
>>>
>>> > Why are we mapping to numpy types? Design document suggests mapping to
>>> python native types as the plan.
>>> We did discuss using numpy types in a comment [1], but you're right we
>>> never resolved it and the doc still lists native types. My biggest concern
>>> with just using native int/float types is I think we definitely need *some*
>>> way to distinguish between the schema proto's various int/float sizes in
>>> the python representation. If we don't we would need to either a) reject
>>> schemas that contain any size other than the one that we support, or b) no
>>> longer have a bijective mapping between proto and python (i.e. any integer
>>> type that passes through the Python SDK would get converted to an int64).
>>> And if we do need some way to distinguish between the integer types, I
>>> thought a de facto standard was better than creating our own - as Robert
>>> noted in that comment thread "The only strong opinion I have is that we
>>> shouldn't invent our own."
>>>
>>> As I was experimenting with different approaches I also discovered the
>>> numpy numeric types are very nice because you can instantiate them and they
>>> look just like ints, for example `np.int16(107) == 107` evaluates to true
>>> even though `type(np.int16(107)) == type(107)` does not.
>>>
>>> Another concern with python's int type is that it supports unlimited
>>> precision [2], so it's really not a good type to use for any of the schema
>>> ints. My PR as it is right now actually doesn't even support int. I
>>> probably should at least make a change to accept int as a type
>>> specification for iint64 but throw an error when encoding if an int is too
>>> big.
>>>
>>
> I agree with Robert's position of not inventing our own. I assume we could
> make a decision between python native types, arrow types, and numpy types.
>
> What happens when one does np.int16(107) + 1, is the numpy type preserved?
>
>  Another concern I have is, if there is a user function or a library that
> user does not control, that uses typing to indicate that a function accepts
> a type of int, would it be compatible with numpy types?
>
>
>>
>>> [1]
>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
>>> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric
>>>
>>> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <al...@google.com> wrote:
>>> >
>>> > Thank you Brian.
>>> >
>>> > I did not spend enough time yet to review. Some early questions, I
>>> apologize if I missed an earlier discussion.
>>> > - Do we need to support python 2? If supporting python 2 will
>>> complicate things, we could make this a python3 only feature.
>>> > - Why are we mapping to numpy types? Design document suggests mapping
>>> to python native types as the plan.
>>> >
>>> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>> >>
>>> >> tl;dr: I have a PR at [1] that defines an initial Schema API in
>>> python based on the typing module, and uses typing.NamedTuple to represent
>>> a Schema. There are some risks with that approach but I propose we move
>>> forward with it as a first draft and iterate.
>>> >>
>>> >>
>>> >> I've opened up a PR [1] that implements RowCoder in the Python SDK
>>> and verifies it's compatibility with the Java implementation via tests in
>>> standard_coders.yaml. A lot of miscellaneous changes are required to get
>>> that point, including a pretty significant one: providing some native
>>> python representation for schemas.
>>> >>
>>> >> As discussed in the PR description I opted to fully embrace the
>>> typing module for the native representation of schema types:
>>> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
>>> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
>>> settled on typing.Collection, but unfortunately this doesn't seem to be
>>> supported in python 2, I'm open to other suggestions here.
>>> >> - Map maps to typing.Mapping.
>>> >> - Rows map to typing.NamedTuple.
>>> >> - nullability is indicated with typing.Optional. Note there's no
>>> distinction between Optional[Optional[T]] and Optional[T] in typing, both
>>> map to Union[T, None] - so this is actually a good analog for the nullable
>>> flag on FieldType in schema.proto.
>>> >>
>>> >> With this approach a schema in Python might look like:
>>> >> ```
>>> >> class Movie(NamedTuple):
>>> >>   name: np.unicode
>>> >>   year: Optional[np.int16]
>>> >>
>>> >> # The class/type annotation syntax doesn't work in Python 2. Instead
>>> you can use:
>>> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
>>> Optional[np.int16])]
>>> >>
>>> >> # DoFns annotated with_output_types(Movie) will use RowCoder
>>> >> coders.registry.register_coder(Movie, coders.RowCoder)
>>> >> ```
>>> >>
>>> >> I think the choice to use typing.NamedTuple as a row type is
>>> potentially controversial - Udi, Robert Bradshaw and I were already
>>> discussing it a bit in a comment on the portable schemas doc [2], but I
>>> wanted to bring that discussion to the ML.
>>> >>
>>> >> On the pro side:
>>> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
>>> store attributes internally as an ordered collection (List<Object> in Row,
>>> a tuple in NamedTuple) and provide shortcuts for accessing those attributes
>>> by field name based on the schema.
>>> >> +  NamedTuple is a native type, and we're trying to get out of the
>>> business of defining our own type hints (I think).
>>> >>
>>> >> On the con side:
>>> >> - When using the class-based version of NamedTuple in python 3 a user
>>> might be tempted to add more functionality to their class (for example,
>>> define a method) rather than just defining a schema - but I'm not sure
>>> we're prepared to guarantee that we will always produce an instance of
>>> their class, just something that has the defined attributes. This concern
>>> can potentially be alleviated once we have support for logical types.
>>> >>
>>> >> Unless there are any objections I think it would make sense to start
>>> with this implementation (documenting the limitations), and then iterate on
>>> it. Please take a look at the PR [1] and let me know what you think about
>>> this proposal.
>>> >>
>>> >> Thanks,
>>> >> Brian
>>> >>
>>> >> [1] https://github.com/apache/beam/pull/9188
>>> >> [2]
>>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
>>> >> [3]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
>>>
>>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Ahmet Altay <al...@google.com>.

To clarify, I am happy to start with implementation and iterating on it. I
do not want to block this late into the discussion.

On Fri, Aug 2, 2019 at 6:03 PM Brian Hulette <bh...@google.com> wrote:

> I meant "or sub-class it and define fields with type annotations" not
> "with attributes". I believe that version doesn't work in python 2 since it
> doesn't support the `name: type` syntax.
>

If it is not a big deal supporting both sounds good. I was actually
referring to your comment about  typing.Collection not being available on
python 2.


>
> On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <bh...@google.com> wrote:
>
>> > Do we need to support python 2? If supporting python 2 will complicate
>> things, we could make this a python3 only feature.
>> I don't think supporting python 2 complicates things. It's just that
>> there are two different ways to use typing.NamedTuple in python 3 - you can
>> either instantiate it and provide a list of (name, type) pairs, or
>> sub-class it and define fields with attributes. But in python 2 only the
>> former works.
>>
>> > Why are we mapping to numpy types? Design document suggests mapping to
>> python native types as the plan.
>> We did discuss using numpy types in a comment [1], but you're right we
>> never resolved it and the doc still lists native types. My biggest concern
>> with just using native int/float types is I think we definitely need *some*
>> way to distinguish between the schema proto's various int/float sizes in
>> the python representation. If we don't we would need to either a) reject
>> schemas that contain any size other than the one that we support, or b) no
>> longer have a bijective mapping between proto and python (i.e. any integer
>> type that passes through the Python SDK would get converted to an int64).
>> And if we do need some way to distinguish between the integer types, I
>> thought a de facto standard was better than creating our own - as Robert
>> noted in that comment thread "The only strong opinion I have is that we
>> shouldn't invent our own."
>>
>> As I was experimenting with different approaches I also discovered the
>> numpy numeric types are very nice because you can instantiate them and they
>> look just like ints, for example `np.int16(107) == 107` evaluates to true
>> even though `type(np.int16(107)) == type(107)` does not.
>>
>> Another concern with python's int type is that it supports unlimited
>> precision [2], so it's really not a good type to use for any of the schema
>> ints. My PR as it is right now actually doesn't even support int. I
>> probably should at least make a change to accept int as a type
>> specification for iint64 but throw an error when encoding if an int is too
>> big.
>>
>
I agree with Robert's position of not inventing our own. I assume we could
make a decision between python native types, arrow types, and numpy types.

What happens when one does np.int16(107) + 1, is the numpy type preserved?

 Another concern I have is, if there is a user function or a library that
user does not control, that uses typing to indicate that a function accepts
a type of int, would it be compatible with numpy types?


>
>> [1]
>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
>> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric
>>
>> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <al...@google.com> wrote:
>> >
>> > Thank you Brian.
>> >
>> > I did not spend enough time yet to review. Some early questions, I
>> apologize if I missed an earlier discussion.
>> > - Do we need to support python 2? If supporting python 2 will
>> complicate things, we could make this a python3 only feature.
>> > - Why are we mapping to numpy types? Design document suggests mapping
>> to python native types as the plan.
>> >
>> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bh...@google.com>
>> wrote:
>> >>
>> >> tl;dr: I have a PR at [1] that defines an initial Schema API in python
>> based on the typing module, and uses typing.NamedTuple to represent a
>> Schema. There are some risks with that approach but I propose we move
>> forward with it as a first draft and iterate.
>> >>
>> >>
>> >> I've opened up a PR [1] that implements RowCoder in the Python SDK and
>> verifies it's compatibility with the Java implementation via tests in
>> standard_coders.yaml. A lot of miscellaneous changes are required to get
>> that point, including a pretty significant one: providing some native
>> python representation for schemas.
>> >>
>> >> As discussed in the PR description I opted to fully embrace the typing
>> module for the native representation of schema types:
>> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
>> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
>> settled on typing.Collection, but unfortunately this doesn't seem to be
>> supported in python 2, I'm open to other suggestions here.
>> >> - Map maps to typing.Mapping.
>> >> - Rows map to typing.NamedTuple.
>> >> - nullability is indicated with typing.Optional. Note there's no
>> distinction between Optional[Optional[T]] and Optional[T] in typing, both
>> map to Union[T, None] - so this is actually a good analog for the nullable
>> flag on FieldType in schema.proto.
>> >>
>> >> With this approach a schema in Python might look like:
>> >> ```
>> >> class Movie(NamedTuple):
>> >>   name: np.unicode
>> >>   year: Optional[np.int16]
>> >>
>> >> # The class/type annotation syntax doesn't work in Python 2. Instead
>> you can use:
>> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
>> Optional[np.int16])]
>> >>
>> >> # DoFns annotated with_output_types(Movie) will use RowCoder
>> >> coders.registry.register_coder(Movie, coders.RowCoder)
>> >> ```
>> >>
>> >> I think the choice to use typing.NamedTuple as a row type is
>> potentially controversial - Udi, Robert Bradshaw and I were already
>> discussing it a bit in a comment on the portable schemas doc [2], but I
>> wanted to bring that discussion to the ML.
>> >>
>> >> On the pro side:
>> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
>> store attributes internally as an ordered collection (List<Object> in Row,
>> a tuple in NamedTuple) and provide shortcuts for accessing those attributes
>> by field name based on the schema.
>> >> +  NamedTuple is a native type, and we're trying to get out of the
>> business of defining our own type hints (I think).
>> >>
>> >> On the con side:
>> >> - When using the class-based version of NamedTuple in python 3 a user
>> might be tempted to add more functionality to their class (for example,
>> define a method) rather than just defining a schema - but I'm not sure
>> we're prepared to guarantee that we will always produce an instance of
>> their class, just something that has the defined attributes. This concern
>> can potentially be alleviated once we have support for logical types.
>> >>
>> >> Unless there are any objections I think it would make sense to start
>> with this implementation (documenting the limitations), and then iterate on
>> it. Please take a look at the PR [1] and let me know what you think about
>> this proposal.
>> >>
>> >> Thanks,
>> >> Brian
>> >>
>> >> [1] https://github.com/apache/beam/pull/9188
>> >> [2]
>> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
>> >> [3]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
>>
>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Brian Hulette <bh...@google.com>.

I meant "or sub-class it and define fields with type annotations" not "with
attributes". I believe that version doesn't work in python 2 since it
doesn't support the `name: type` syntax.

On Fri, Aug 2, 2019 at 5:55 PM Brian Hulette <bh...@google.com> wrote:

> > Do we need to support python 2? If supporting python 2 will complicate
> things, we could make this a python3 only feature.
> I don't think supporting python 2 complicates things. It's just that there
> are two different ways to use typing.NamedTuple in python 3 - you can
> either instantiate it and provide a list of (name, type) pairs, or
> sub-class it and define fields with attributes. But in python 2 only the
> former works.
>
> > Why are we mapping to numpy types? Design document suggests mapping to
> python native types as the plan.
> We did discuss using numpy types in a comment [1], but you're right we
> never resolved it and the doc still lists native types. My biggest concern
> with just using native int/float types is I think we definitely need *some*
> way to distinguish between the schema proto's various int/float sizes in
> the python representation. If we don't we would need to either a) reject
> schemas that contain any size other than the one that we support, or b) no
> longer have a bijective mapping between proto and python (i.e. any integer
> type that passes through the Python SDK would get converted to an int64).
> And if we do need some way to distinguish between the integer types, I
> thought a de facto standard was better than creating our own - as Robert
> noted in that comment thread "The only strong opinion I have is that we
> shouldn't invent our own."
>
> As I was experimenting with different approaches I also discovered the
> numpy numeric types are very nice because you can instantiate them and they
> look just like ints, for example `np.int16(107) == 107` evaluates to true
> even though `type(np.int16(107)) == type(107)` does not.
>
> Another concern with python's int type is that it supports unlimited
> precision [2], so it's really not a good type to use for any of the schema
> ints. My PR as it is right now actually doesn't even support int. I
> probably should at least make a change to accept int as a type
> specification for iint64 but throw an error when encoding if an int is too
> big.
>
> [1]
> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
> [2] https://docs.python.org/3/library/stdtypes.html#typesnumeric
>
> On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <al...@google.com> wrote:
> >
> > Thank you Brian.
> >
> > I did not spend enough time yet to review. Some early questions, I
> apologize if I missed an earlier discussion.
> > - Do we need to support python 2? If supporting python 2 will complicate
> things, we could make this a python3 only feature.
> > - Why are we mapping to numpy types? Design document suggests mapping to
> python native types as the plan.
> >
> > On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bh...@google.com>
> wrote:
> >>
> >> tl;dr: I have a PR at [1] that defines an initial Schema API in python
> based on the typing module, and uses typing.NamedTuple to represent a
> Schema. There are some risks with that approach but I propose we move
> forward with it as a first draft and iterate.
> >>
> >>
> >> I've opened up a PR [1] that implements RowCoder in the Python SDK and
> verifies it's compatibility with the Java implementation via tests in
> standard_coders.yaml. A lot of miscellaneous changes are required to get
> that point, including a pretty significant one: providing some native
> python representation for schemas.
> >>
> >> As discussed in the PR description I opted to fully embrace the typing
> module for the native representation of schema types:
> >> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
> >> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
> settled on typing.Collection, but unfortunately this doesn't seem to be
> supported in python 2, I'm open to other suggestions here.
> >> - Map maps to typing.Mapping.
> >> - Rows map to typing.NamedTuple.
> >> - nullability is indicated with typing.Optional. Note there's no
> distinction between Optional[Optional[T]] and Optional[T] in typing, both
> map to Union[T, None] - so this is actually a good analog for the nullable
> flag on FieldType in schema.proto.
> >>
> >> With this approach a schema in Python might look like:
> >> ```
> >> class Movie(NamedTuple):
> >>   name: np.unicode
> >>   year: Optional[np.int16]
> >>
> >> # The class/type annotation syntax doesn't work in Python 2. Instead
> you can use:
> >> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
> Optional[np.int16])]
> >>
> >> # DoFns annotated with_output_types(Movie) will use RowCoder
> >> coders.registry.register_coder(Movie, coders.RowCoder)
> >> ```
> >>
> >> I think the choice to use typing.NamedTuple as a row type is
> potentially controversial - Udi, Robert Bradshaw and I were already
> discussing it a bit in a comment on the portable schemas doc [2], but I
> wanted to bring that discussion to the ML.
> >>
> >> On the pro side:
> >> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
> store attributes internally as an ordered collection (List<Object> in Row,
> a tuple in NamedTuple) and provide shortcuts for accessing those attributes
> by field name based on the schema.
> >> +  NamedTuple is a native type, and we're trying to get out of the
> business of defining our own type hints (I think).
> >>
> >> On the con side:
> >> - When using the class-based version of NamedTuple in python 3 a user
> might be tempted to add more functionality to their class (for example,
> define a method) rather than just defining a schema - but I'm not sure
> we're prepared to guarantee that we will always produce an instance of
> their class, just something that has the defined attributes. This concern
> can potentially be alleviated once we have support for logical types.
> >>
> >> Unless there are any objections I think it would make sense to start
> with this implementation (documenting the limitations), and then iterate on
> it. Please take a look at the PR [1] and let me know what you think about
> this proposal.
> >>
> >> Thanks,
> >> Brian
> >>
> >> [1] https://github.com/apache/beam/pull/9188
> >> [2]
> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
> >> [3]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
>

Re: [PROPOSAL] An initial Schema API in Python

Posted by Brian Hulette <bh...@google.com>.

On Tue, Aug 6, 2019 at 3:55 AM Robert Bradshaw <ro...@google.com> wrote:

> On Sun, Aug 4, 2019 at 12:03 AM Chad Dombrova <ch...@gmail.com> wrote:
> >
> > Hi,
> >
> > This looks like a great feature.
> >
> > Is there a plan to eventually support custom field types?
> >
> > I assume adding support for dataclasses in python 3.7+ should be trivial
> to do in a follow up PR. Do you see any complications with that? The main
> advantage that dataclasses have over NamedTuple in this context is argument
> defaults, which is a nice convenience.
>
> Java has a notion of logical types which has yet to be figured out in
> a cross-langauge way but tackles this exact issue. I think there's a
> lot of value in "anonymous" named tuples as intermediates well, e.g.
> one might to a projection onto a subset of fields, and then do a
> grouping/aggregating operation, in which case the new schema can be
> inferred (even if it doesn't have a name).
>

Yes I think in the future we can add support for dataclasses. In Java,
SchemaCoder handles things like this by inferring schemas from user types
like POJOs and automatically creating functions for converting instances of
those types to/from Row, and we could do something similar for converting
other structured data representations to/from NamedTuple.

I don't think it's trivial to add in a follow-up PR though, because we're
not actually making any guarantees yet that your type and it's methods will
be re-created when the pipeline is running, just that we will produce
objects with the expected attributes.

>
> > My PR as it is right now actually doesn’t even support int. I probably
> should at least make a change to accept int as a type specification for
> iint64 but throw an error when encoding if an int is too big.
> >
> > Should probably do the same for float.
> >
> > Another concern I have is, if there is a user function or a library that
> user does not control, that uses typing to indicate that a function accepts
> a type of int, would it be compatible with numpy types?
> >
> > I have similar concerns. I guess we’ll just have to cast to int before
> passing into 3rd party code, which is not ideal. Why not use int for int64
> in python?
>
> A properly written library should accept any type implementing the
> __int__ (or __index__) methods in place of an int, rather than doing
> explicit type checks, though performance may suffer. Likewise when
> encoding, we should accept all sorts of ints when an int32 (say) is
> expected, rather than force the user to know and cast to the right
> type.
>

But I don't think there's any way for library functions to type hint that
they accept any type implementing __int__, so while a properly written
library may be able to accept this np.int*, I don't think they could write
type hints in such a way that static type checkers would consider it
valid. Nevermind
I just came across typing.SupportsInt - mypy will let you pass any numpy
integer type to a function with that type hint.

It seems reasonable for RowCoder accept anything that supports __int__ when
encoding INT*. We could just convert to an int and verify that bit_length()
is short enough for the specified type.

>
> As for the mappings between Python types and schemas, there are
> several mappings that are somewhat being conflated.
>
> (1) There is the mapping used in definitions. At the moment, all
> subclasses of NamedTuple map to the same generic Row schema type,
> probably not something we want in the long run (but could be OK for
> now if we think doing better in the future counts as backwards
> compatible). For integral types, it makes sense to accept
> np.int{8,16,32,64}, but should we accept the equivalent arrow types
> here as well? I think we also need to accept the plain Python "int"
> and "float" type, otherwise a standard Python class like
>
>     NamedTuple('Word', [('name', str), ('rank', int), ('frequency', float)]
>
> will be surprisingly rejected.
>

Agreed on float since it seems to trivially map to a double, but I'm torn
on int still. While I do want int type hints to work, it doesn't seem
appropriate to map it to AtomicType.INT64, since it has a completely
different range of values. I wonder if instead there should be a variable
width integer primitive (or logical type) that int can map to. Since other
SDKs may not have a type to map this to it may need to be cast (explicitly
or implicitly) to some other integer type to interoperate.

>
> (2) The mapping of concrete values to Python types. Rows mapping to
> NamedTuples may give expectations beyond the attributes they offer
> (and I'd imagine we'll want to be flexible with the possible
> representations here, e.g. offering a slice of an arrow record batch).
> Or do we need to pay the cost of re-creating the users NamedTuple
> subclass. Ints are another interesting case--it may be quite
> surprising to users for the returned values to have silent truncating
> overflow semantics (very unlike Python) rather than the arbitrary
> precision that Python's ints give (especially if the conversion from a
> python int to an int64 happens due to an invisible fusion barrier).
> Better to compute the larger value and then thrown an error if/when it
> is encoded into a fixed width type later.
>

NamedTuples: Yeah I'm starting to wonder if we should hide the use of
NamedTuple as an implementation detail somehow. It causes a couple of
issues:
1. As you point out it can give user's expectations beyond just the
attributes.
2. Currently you have to specifically call register_coder(MyRowType,
RowCoder) to opt-in to using RowCoder for your type, since I thought
someone may want to use NamedTuple without RowCoder and would be surprised
when we tried to shoe-horn it into RowCoder by default. But it would be
nice if users didn't have to do that.

Ints: But is it any better that the user would get an error when they said
they wanted to use a schema with a native int field, and we throw a runtime
error when trying to encode a particular one because it requires > 64 bits?
It seems preferable to make them specify int64 for these fields so they're
forced to acknowledge that this can happen.

>
> (3) The mapping of Python values into a row (e.g. for serialization).
> If there are errors (e.g. a DoFn produces tuples of the wrong type),
> how eagerly can we detect them? At what cost? How strict should we be
> (e.g. if a named tuple with certain fields is expected, can we map a
> concrete subclass to it? A NamedTuple that has a superset of the
> fields? Implicitly mapping Python's float (aka a 64-bit C double) to a
> float32 is a particularly sticky question.
>
> I think we can make forward progress on implementation in parallel to
> answering these questions, but we should be explicit and document what
> the best options are here and then get the code to align.

Re: [PROPOSAL] An initial Schema API in Python

Posted by Robert Bradshaw <ro...@google.com>.

On Sun, Aug 4, 2019 at 12:03 AM Chad Dombrova <ch...@gmail.com> wrote:
>
> Hi,
>
> This looks like a great feature.
>
> Is there a plan to eventually support custom field types?
>
> I assume adding support for dataclasses in python 3.7+ should be trivial to do in a follow up PR. Do you see any complications with that? The main advantage that dataclasses have over NamedTuple in this context is argument defaults, which is a nice convenience.

Java has a notion of logical types which has yet to be figured out in
a cross-langauge way but tackles this exact issue. I think there's a
lot of value in "anonymous" named tuples as intermediates well, e.g.
one might to a projection onto a subset of fields, and then do a
grouping/aggregating operation, in which case the new schema can be
inferred (even if it doesn't have a name).

> My PR as it is right now actually doesn’t even support int. I probably should at least make a change to accept int as a type specification for iint64 but throw an error when encoding if an int is too big.
>
> Should probably do the same for float.
>
> Another concern I have is, if there is a user function or a library that user does not control, that uses typing to indicate that a function accepts a type of int, would it be compatible with numpy types?
>
> I have similar concerns. I guess we’ll just have to cast to int before passing into 3rd party code, which is not ideal. Why not use int for int64 in python?

A properly written library should accept any type implementing the
__int__ (or __index__) methods in place of an int, rather than doing
explicit type checks, though performance may suffer. Likewise when
encoding, we should accept all sorts of ints when an int32 (say) is
expected, rather than force the user to know and cast to the right
type.

As for the mappings between Python types and schemas, there are
several mappings that are somewhat being conflated.

(1) There is the mapping used in definitions. At the moment, all
subclasses of NamedTuple map to the same generic Row schema type,
probably not something we want in the long run (but could be OK for
now if we think doing better in the future counts as backwards
compatible). For integral types, it makes sense to accept
np.int{8,16,32,64}, but should we accept the equivalent arrow types
here as well? I think we also need to accept the plain Python "int"
and "float" type, otherwise a standard Python class like

    NamedTuple('Word', [('name', str), ('rank', int), ('frequency', float)]

will be surprisingly rejected.

(2) The mapping of concrete values to Python types. Rows mapping to
NamedTuples may give expectations beyond the attributes they offer
(and I'd imagine we'll want to be flexible with the possible
representations here, e.g. offering a slice of an arrow record batch).
Or do we need to pay the cost of re-creating the users NamedTuple
subclass. Ints are another interesting case--it may be quite
surprising to users for the returned values to have silent truncating
overflow semantics (very unlike Python) rather than the arbitrary
precision that Python's ints give (especially if the conversion from a
python int to an int64 happens due to an invisible fusion barrier).
Better to compute the larger value and then thrown an error if/when it
is encoded into a fixed width type later.

(3) The mapping of Python values into a row (e.g. for serialization).
If there are errors (e.g. a DoFn produces tuples of the wrong type),
how eagerly can we detect them? At what cost? How strict should we be
(e.g. if a named tuple with certain fields is expected, can we map a
concrete subclass to it? A NamedTuple that has a superset of the
fields? Implicitly mapping Python's float (aka a 64-bit C double) to a
float32 is a particularly sticky question.

I think we can make forward progress on implementation in parallel to
answering these questions, but we should be explicit and document what
the best options are here and then get the code to align.

Re: [PROPOSAL] An initial Schema API in Python

Posted by Chad Dombrova <ch...@gmail.com>.

Hi,

This looks like a great feature.

Is there a plan to eventually support custom field types?

I assume adding support for dataclasses in python 3.7+ should be trivial to
do in a follow up PR. Do you see any complications with that? The main
advantage that dataclasses have over NamedTuple in this context is argument
defaults, which is a nice convenience.

My PR as it is right now actually doesn’t even support int. I probably
should at least make a change to accept int as a type specification for
iint64 but throw an error when encoding if an int is too big.

Should probably do the same for float.

Another concern I have is, if there is a user function or a library that
user does not control, that uses typing to indicate that a function accepts
a type of int, would it be compatible with numpy types?

I have similar concerns. I guess we’ll just have to cast to int before
passing into 3rd party code, which is not ideal. Why not use int for int64
in python?

-chad

Re: [PROPOSAL] An initial Schema API in Python

Posted by Brian Hulette <bh...@google.com>.

> Do we need to support python 2? If supporting python 2 will complicate
things, we could make this a python3 only feature.
I don't think supporting python 2 complicates things. It's just that there
are two different ways to use typing.NamedTuple in python 3 - you can
either instantiate it and provide a list of (name, type) pairs, or
sub-class it and define fields with attributes. But in python 2 only the
former works.

> Why are we mapping to numpy types? Design document suggests mapping to
python native types as the plan.
We did discuss using numpy types in a comment [1], but you're right we
never resolved it and the doc still lists native types. My biggest concern
with just using native int/float types is I think we definitely need *some*
way to distinguish between the schema proto's various int/float sizes in
the python representation. If we don't we would need to either a) reject
schemas that contain any size other than the one that we support, or b) no
longer have a bijective mapping between proto and python (i.e. any integer
type that passes through the Python SDK would get converted to an int64).
And if we do need some way to distinguish between the integer types, I
thought a de facto standard was better than creating our own - as Robert
noted in that comment thread "The only strong opinion I have is that we
shouldn't invent our own."

As I was experimenting with different approaches I also discovered the
numpy numeric types are very nice because you can instantiate them and they
look just like ints, for example `np.int16(107) == 107` evaluates to true
even though `type(np.int16(107)) == type(107)` does not.

Another concern with python's int type is that it supports unlimited
precision [2], so it's really not a good type to use for any of the schema
ints. My PR as it is right now actually doesn't even support int. I
probably should at least make a change to accept int as a type
specification for iint64 but throw an error when encoding if an int is too
big.

[1]
https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAACtLItNA
[2] https://docs.python.org/3/library/stdtypes.html#typesnumeric

On Fri, Aug 2, 2019 at 4:12 PM Ahmet Altay <al...@google.com> wrote:
>
> Thank you Brian.
>
> I did not spend enough time yet to review. Some early questions, I
apologize if I missed an earlier discussion.
> - Do we need to support python 2? If supporting python 2 will complicate
things, we could make this a python3 only feature.
> - Why are we mapping to numpy types? Design document suggests mapping to
python native types as the plan.
>
> On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bh...@google.com> wrote:
>>
>> tl;dr: I have a PR at [1] that defines an initial Schema API in python
based on the typing module, and uses typing.NamedTuple to represent a
Schema. There are some risks with that approach but I propose we move
forward with it as a first draft and iterate.
>>
>>
>> I've opened up a PR [1] that implements RowCoder in the Python SDK and
verifies it's compatibility with the Java implementation via tests in
standard_coders.yaml. A lot of miscellaneous changes are required to get
that point, including a pretty significant one: providing some native
python representation for schemas.
>>
>> As discussed in the PR description I opted to fully embrace the typing
module for the native representation of schema types:
>> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
>> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
settled on typing.Collection, but unfortunately this doesn't seem to be
supported in python 2, I'm open to other suggestions here.
>> - Map maps to typing.Mapping.
>> - Rows map to typing.NamedTuple.
>> - nullability is indicated with typing.Optional. Note there's no
distinction between Optional[Optional[T]] and Optional[T] in typing, both
map to Union[T, None] - so this is actually a good analog for the nullable
flag on FieldType in schema.proto.
>>
>> With this approach a schema in Python might look like:
>> ```
>> class Movie(NamedTuple):
>>   name: np.unicode
>>   year: Optional[np.int16]
>>
>> # The class/type annotation syntax doesn't work in Python 2. Instead you
can use:
>> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
Optional[np.int16])]
>>
>> # DoFns annotated with_output_types(Movie) will use RowCoder
>> coders.registry.register_coder(Movie, coders.RowCoder)
>> ```
>>
>> I think the choice to use typing.NamedTuple as a row type is potentially
controversial - Udi, Robert Bradshaw and I were already discussing it a bit
in a comment on the portable schemas doc [2], but I wanted to bring that
discussion to the ML.
>>
>> On the pro side:
>> + NamedTuple is a pretty great analog for Java's Row type [3]. Both
store attributes internally as an ordered collection (List<Object> in Row,
a tuple in NamedTuple) and provide shortcuts for accessing those attributes
by field name based on the schema.
>> +  NamedTuple is a native type, and we're trying to get out of the
business of defining our own type hints (I think).
>>
>> On the con side:
>> - When using the class-based version of NamedTuple in python 3 a user
might be tempted to add more functionality to their class (for example,
define a method) rather than just defining a schema - but I'm not sure
we're prepared to guarantee that we will always produce an instance of
their class, just something that has the defined attributes. This concern
can potentially be alleviated once we have support for logical types.
>>
>> Unless there are any objections I think it would make sense to start
with this implementation (documenting the limitations), and then iterate on
it. Please take a look at the PR [1] and let me know what you think about
this proposal.
>>
>> Thanks,
>> Brian
>>
>> [1] https://github.com/apache/beam/pull/9188
>> [2]
https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
>> [3]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java

Re: [PROPOSAL] An initial Schema API in Python

Posted by Ahmet Altay <al...@google.com>.

Thank you Brian.

I did not spend enough time yet to review. Some early questions, I
apologize if I missed an earlier discussion.
- Do we need to support python 2? If supporting python 2 will complicate
things, we could make this a python3 only feature.
- Why are we mapping to numpy types? Design document suggests mapping to
python native types as the plan.

On Wed, Jul 31, 2019 at 2:51 PM Brian Hulette <bh...@google.com> wrote:

> tl;dr: I have a PR at [1] that defines an initial Schema API in python
> based on the typing module, and uses typing.NamedTuple to represent a
> Schema. There are some risks with that approach but I propose we move
> forward with it as a first draft and iterate.
>
>
> I've opened up a PR [1] that implements RowCoder in the Python SDK and
> verifies it's compatibility with the Java implementation via tests in
> standard_coders.yaml. A lot of miscellaneous changes are required to get
> that point, including a pretty significant one: providing some native
> python representation for schemas.
>
> As discussed in the PR description I opted to fully embrace the typing
> module for the native representation of schema types:
> - Primitive types all map to numpy types (e.g. np.int16, np.unicode).
> - Arrays map to typing.List. In https://s.apache.org/beam-schemas we
> settled on typing.Collection, but unfortunately this doesn't seem to be
> supported in python 2, I'm open to other suggestions here.
> - Map maps to typing.Mapping.
> - Rows map to typing.NamedTuple.
> - nullability is indicated with typing.Optional. Note there's no
> distinction between Optional[Optional[T]] and Optional[T] in typing, both
> map to Union[T, None] - so this is actually a good analog for the nullable
> flag on FieldType in schema.proto.
>
> With this approach a schema in Python might look like:
> ```
> class Movie(NamedTuple):
>   name: np.unicode
>   year: Optional[np.int16]
>
> # The class/type annotation syntax doesn't work in Python 2. Instead you
> can use:
> # Movie = NamedTuple('Movie', [('name', np.unicode), ('year',
> Optional[np.int16])]
>
> # DoFns annotated with_output_types(Movie) will use RowCoder
> coders.registry.register_coder(Movie, coders.RowCoder)
> ```
>
> I think the choice to use typing.NamedTuple as a row type is potentially
> controversial - Udi, Robert Bradshaw and I were already discussing it a bit
> in a comment on the portable schemas doc [2], but I wanted to bring that
> discussion to the ML.
>
> On the pro side:
> + NamedTuple is a pretty great analog for Java's Row type [3]. Both store
> attributes internally as an ordered collection (List<Object> in Row, a
> tuple in NamedTuple) and provide shortcuts for accessing those attributes
> by field name based on the schema.
> +  NamedTuple is a native type, and we're trying to get out of the
> business of defining our own type hints (I think).
>
> On the con side:
> - When using the class-based version of NamedTuple in python 3 a user
> might be tempted to add more functionality to their class (for example,
> define a method) rather than just defining a schema - but I'm not sure
> we're prepared to guarantee that we will always produce an instance of
> their class, just something that has the defined attributes. This concern
> can potentially be alleviated once we have support for logical types.
>
> Unless there are any objections I think it would make sense to start with
> this implementation (documenting the limitations), and then iterate on it.
> Please take a look at the PR [1] and let me know what you think about this
> proposal.
>
> Thanks,
> Brian
>
> [1] https://github.com/apache/beam/pull/9188
> [2]
> https://docs.google.com/a/google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?disco=AAAADSP8gx8
> [3]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java
>