You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2020/04/24 18:56:16 UTC

Removing DATETIME in Java Schemas

When we created the portable representation of schemas last summer we
intentionally did not include DATETIME as a primitive type [1], even though
it already existed as a primitive type in Java [2]. There seemed to be
consensus around a couple of points: 1) At the very least DATETIME is a
poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
logical type is better suited for this purpose.

Since then, it's been my intention to replace Java's DATETIME with a
MillisInstant logical type backed by Long milliseconds since the epoch (see
BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
keeps tests passing): https://github.com/apache/beam/pull/11456

There could be a couple of concerns with this PR that I think would be
better discussed here rather than on github.

## Breaking changes in low-level APIs
The PR includes some breaking changes in public, low-level schema APIs: It
removes DATETIME from the TypeName enum [4], and it will also change the
type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
milliseconds since epoch rather than org.joda.time.Instant). Both of these
changes have the potential to break user code. That being said I think the
risk is justified for a few reasons:
- These are lower-level APIs that we don't intend for most users to use.
The preferred higher-level APIs (SQL, Schema transforms, and inferred
schemas for user types) should seamlessly transition over to the new
MillisInstant logical type.
- Schemas are an experimental feature.
- We can clearly document this breaking change in the release that includes
it.

## Mixing joda and java 8 time
The NanosInstant logical type that Alex added uses java.time.Instant as
it's input type, while my MillisInstant type uses org.joda.time.Instant for
compatibility with the rest of Beam and the previous DATETIME primitive
type. It feels weird, but it also makes a certain sort of sense to use joda
time (which has millisecond precision) for MillisInstant, and java 8 time
(which has nanos) for NanosInstant. Also, the choice shouldn't have a
significant effect on end users - the schema inference code could generate
conversions between java 8 time and joda time (as we already do for
converting between various joda time types [5]) so user types can use
either one.

## Arbitrary Logical Types and SQL
Previously much of the SQL code was written to operate on the _base type_
value for any logical types. So for the new MillisInstant type, SQL would
attempt to operate on the underlying Long, rather than on a
org.joda.time.Instant instance. Thus when I switched over to a
MillisInstant logical type as the default for SQL date and time types any
tests that used them began failing with ClassCastExceptions.

My solution was just to update SQL code to only ever reference the input
type (i.e. org.joda.time.Instant) for logical types. A good example of this
type of change is in BeamAggregationRel [6].

My change does pass all of the SQL tests (including internal Google tests),
but I'm a little concerned that using the base type throughout SQL was a
conscious decision that I'm stomping on. In theory, it could ensure that
SqlTransform would be able to operate on user types that contain custom
user logical types, without requiring SqlTransform to understand those
types. This is just speculation though, there aren't actually any tests
verifying that functionality. Also, I don't think any such tests would have
worked since there are several places where we explicitly check for
particular logical types (e.g. in BeamCalcRel [7]).

Personally I'm ok with this move, because I'm against the idea of
implicitly stripping away logical types like this in schema-aware
transforms. A logical type indicates that the value has some additional
meaning beyond just the base type (and maybe a restricted range), and I
don't think transforms should be able to ignore that meaning unless the
user explicitly allows it, or first converts to the base type themselves.

If anyone has any concerns with these points could you raise them here?
Otherwise I'd like to move ahead with the PR.
(Sorry this message turned out to be a bit of a novel. LMK if it would be
better served by separate threads, or maybe a doc.)

Thanks!
Brian

[1]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
[2]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
[3] https://issues.apache.org/jira/browse/BEAM-7554
[4]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
[5]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
[6]
https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
[7]
https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409

Re: Removing DATETIME in Java Schemas

Posted by Kenneth Knowles <ke...@apache.org>.

This seems like a good idea. This stuff is all still marked "experimental"
for exactly this reason. This is a case where the name fits perfectly. Both
SQL and schemas are new and still working towards a form that can be
supported indefinitely without layers of workarounds that will never
quiesce. I think the user pain here is minimal. I don't think a deprecation
cycle is terrifically important, but we could do one release.

As for SQL, I believe the following:

 - we have multiple SQL dialects with different type systems
 - our SQL planner/runtime has also its own type system
 - when you are converting a Beam schema to pass to SqlTransform, you need
to provide a conversion from the Beam schema type system to the SQL
dialect's type system (sometimes the conversion is a noop)

So my thinking about the conversion:

 - it is OK to convert known logical types like MillisInstant or
NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
TIME ZONE"
 - it is OK to convert unknown logical types into their representation type
 - coming out the other side of a SQL statement, a user should expect the
type to be a row where every field has a type from the SQL dialect; the
input types are gone

I've heard some arguments (on list? offline?) that operating directly on
the representation of a logical type is wrong, because you can't really
know that what you are doing results in a valid value. I agree with that,
basically. That's why I frame it as a one-way conversion from the input
type to the SQL dialect's type. You aren't operating on the logical type
any more.

I feel I am probably forgetting some important details about the "logical
type inlining" discussion, so others should speak up if they have more to
add.

There is also a backwards-compatibility lesson here: don't show users your
enums if you can avoid it. If DATETIME had always been accessed via a
(trivial) factory method then changing the implementation would be
backwards compatible.

Kenn

On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com> wrote:

> When we created the portable representation of schemas last summer we
> intentionally did not include DATETIME as a primitive type [1], even though
> it already existed as a primitive type in Java [2]. There seemed to be
> consensus around a couple of points: 1) At the very least DATETIME is a
> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
> logical type is better suited for this purpose.
>
> Since then, it's been my intention to replace Java's DATETIME with a
> MillisInstant logical type backed by Long milliseconds since the epoch (see
> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
> keeps tests passing): https://github.com/apache/beam/pull/11456
>
> There could be a couple of concerns with this PR that I think would be
> better discussed here rather than on github.
>
>
> ## Breaking changes in low-level APIs
> The PR includes some breaking changes in public, low-level schema APIs: It
> removes DATETIME from the TypeName enum [4], and it will also change the
> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
> milliseconds since epoch rather than org.joda.time.Instant). Both of these
> changes have the potential to break user code. That being said I think the
> risk is justified for a few reasons:
> - These are lower-level APIs that we don't intend for most users to use.
> The preferred higher-level APIs (SQL, Schema transforms, and inferred
> schemas for user types) should seamlessly transition over to the new
> MillisInstant logical type.
> - Schemas are an experimental feature.
> - We can clearly document this breaking change in the release that
> includes it.
>
>
> ## Mixing joda and java 8 time
> The NanosInstant logical type that Alex added uses java.time.Instant as
> it's input type, while my MillisInstant type uses org.joda.time.Instant for
> compatibility with the rest of Beam and the previous DATETIME primitive
> type. It feels weird, but it also makes a certain sort of sense to use joda
> time (which has millisecond precision) for MillisInstant, and java 8 time
> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
> significant effect on end users - the schema inference code could generate
> conversions between java 8 time and joda time (as we already do for
> converting between various joda time types [5]) so user types can use
> either one.
>
>
> ## Arbitrary Logical Types and SQL
> Previously much of the SQL code was written to operate on the _base type_
> value for any logical types. So for the new MillisInstant type, SQL would
> attempt to operate on the underlying Long, rather than on a
> org.joda.time.Instant instance. Thus when I switched over to a
> MillisInstant logical type as the default for SQL date and time types any
> tests that used them began failing with ClassCastExceptions.
>
> My solution was just to update SQL code to only ever reference the input
> type (i.e. org.joda.time.Instant) for logical types. A good example of this
> type of change is in BeamAggregationRel [6].
>
> My change does pass all of the SQL tests (including internal Google
> tests), but I'm a little concerned that using the base type throughout SQL
> was a conscious decision that I'm stomping on. In theory, it could ensure
> that SqlTransform would be able to operate on user types that contain
> custom user logical types, without requiring SqlTransform to understand
> those types. This is just speculation though, there aren't actually any
> tests verifying that functionality. Also, I don't think any such tests
> would have worked since there are several places where we explicitly check
> for particular logical types (e.g. in BeamCalcRel [7]).
>
> Personally I'm ok with this move, because I'm against the idea of
> implicitly stripping away logical types like this in schema-aware
> transforms. A logical type indicates that the value has some additional
> meaning beyond just the base type (and maybe a restricted range), and I
> don't think transforms should be able to ignore that meaning unless the
> user explicitly allows it, or first converts to the base type themselves.
>
>
>
> If anyone has any concerns with these points could you raise them here?
> Otherwise I'd like to move ahead with the PR.
> (Sorry this message turned out to be a bit of a novel. LMK if it would be
> better served by separate threads, or maybe a doc.)
>
> Thanks!
> Brian
>
> [1]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
> [2]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
> [3] https://issues.apache.org/jira/browse/BEAM-7554
> [4]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
> [5]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
> [6]
> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
> [7]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>

Re: Removing DATETIME in Java Schemas

Posted by Kenneth Knowles <ke...@apache.org>.

On Fri, May 15, 2020 at 10:33 PM Reuven Lax <re...@google.com> wrote:

>
>
> On Fri, May 15, 2020 at 8:10 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>>
>>
>> On Fri, May 15, 2020 at 5:25 PM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>> After thinking about this more I've softened on it some, but I'm still a
>>> little wary. I like Kenn's suggestion:
>>>
>>> > - it is OK to convert known logical types like MillisInstant or
>>> NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
>>> TIME ZONE"
>>> > - it is OK to convert unknown logical types into their representation
>>> type
>>> > - coming out the other side of a SQL statement, a user should expect
>>> the type to be a row where every field has a type from the SQL dialect; the
>>> input types are gone
>>>
>>> I think what you're proposing is basically a preprocessing step on
>>> SqlTransform that demotes any unrecognized logical types to their
>>> representation. Is that a fair characterization?
>>>
>>
>> Exactly
>>
>>
>>> I wonder if we could push the demotion further into the transform to
>>> enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." -
>>> if some logical type SQL knows nothing about is pulled into the SELECT *,
>>> it seems reasonable (from a user-perspective) to just pass the type through
>>> untouched, and automatically demoting it might be surprising.
>>>
>>
>> To me, it seems fancy to pass the type through. And I expect users will
>> soon try to do non-pass-through things involving equality or ordering,
>> where only specialized representations work right. So for the very limited
>> cases where it doesn't matter, I'm not opposed, but I wouldn't work hard to
>> make it happen. If Calcite implicit conversions are a good way to implement
>> this, that is fine by me.
>>
>> Put another way: if the downstream transforms expect to get the logical
>> type back, won't automatic schema conversion already "just work"?
>>
>
> No. Let's say you are reading protos and writing protos with a SQL filter
> statement in between. If we lose the logical type, then the output schema
> may not be able to match against the proto schema. .e.g. if the proto has a
> field of type FIXED32, we reflect this with a Fixed32 logical type with a
> base type of integer. SQL can operate on the base type just fine because
> it's just an int32, but if you strip the logical type from the output then
> it will have trouble matching against the proto schema as Beam will infer
> an int32 type.
>

Makes sense. If we implicitly convert only one way then we create an
asymmetry. I missed the fact that we wouldn't coerce the representation
type back to the logical type.

Kenn


>
>> (I don't have any response to the avro/proto points except I think I
>> agree with both of you)
>>
>> Kenn
>>
>> What if we had the ability to implicitly convert logical types to their
>>> representation? Then the above example would work, because no implicit
>>> conversion is necessary, and also relations that introspect the logical
>>> type (e.g. projecting f_location.lat) could work by including an implicit
>>> conversion. The implicit conversion would still be a one-way operation, but
>>> it would only happen when necessary.
>>>
>>> I really have no idea how complicated this would be to implement. Would
>>> it be feasible? Is it ill-advised?
>>>
>>> Brian
>>>
>>> On Fri, May 15, 2020 at 2:43 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> I would not describe the base type as the "wire type." If that were
>>>> true, then the only base type we should support should be byte array.
>>>>
>>>> My simple point is that this is no different than normal schema fields.
>>>> You will find many normal schemas containing data encoded into other field
>>>> types. You will also find this in SQL databases. If naming these fields
>>>> with a logical type causes us to throw up our hands and declare that they
>>>> can't be introspectable, then our users will likely just avoid logical
>>>> types and encode this data into their own LONG and STRING fields. It will
>>>> also mean that SQL will not work well over Avro or Proto input.
>>>>
>>> Is the Avro/Proto connection really a problem? I would think that we
>>> could map most common Avro or Proto types to either primitive types or
>>> "standard" logical types which are the same as the ones used by SQL
>>> wherever possible, e.g. NanosInstant and NanosDuration added for proto.
>>>
>>>>
>>>> Reuven
>>>>
>>>> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <ap...@google.com>
>>>> wrote:
>>>>
>>>>> My understanding is that the base type is effectively the wire format
>>>>> at the type, where the logical type is the in-memory representation for
>>>>> Java. For org.joda.time.Instant, this is just a wrapper around the
>>>>> underlying Long. However for the Date logical type, the LocalDate type has
>>>>> struct as the in-memory representation. There is a performance penalty to
>>>>> this conversion and I think there are cases where users (SQL for example)
>>>>> would want a fast path that skips the serialization/deserialization steps
>>>>> if possible.
>>>>>
>>>>> As for more complex logical types, I would urge caution in making
>>>>> assumptions around the base types. I expect there will be base types of
>>>>> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
>>>>> has a civil time module that packs a struct into a Long. All of these are
>>>>> things you can write SQL around, but I think the users might be somewhat
>>>>> surprised when they come out less than usable by default.
>>>>>
>>>>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> When we created the portable representation of schemas last summer
>>>>>>> we intentionally did not include DATETIME as a primitive type [1], even
>>>>>>> though it already existed as a primitive type in Java [2]. There seemed to
>>>>>>> be consensus around a couple of points: 1) At the very least DATETIME is a
>>>>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>>>>>> logical type is better suited for this purpose.
>>>>>>>
>>>>>>> Since then, it's been my intention to replace Java's DATETIME with a
>>>>>>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>>>>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>>>>>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>>>>>
>>>>>>> There could be a couple of concerns with this PR that I think would
>>>>>>> be better discussed here rather than on github.
>>>>>>>
>>>>>>>
>>>>>>> ## Breaking changes in low-level APIs
>>>>>>> The PR includes some breaking changes in public, low-level schema
>>>>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also
>>>>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant
>>>>>>> fields (Long milliseconds since epoch rather than org.joda.time.Instant).
>>>>>>> Both of these changes have the potential to break user code. That being
>>>>>>> said I think the risk is justified for a few reasons:
>>>>>>> - These are lower-level APIs that we don't intend for most users to
>>>>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and inferred
>>>>>>> schemas for user types) should seamlessly transition over to the new
>>>>>>> MillisInstant logical type.
>>>>>>> - Schemas are an experimental feature.
>>>>>>> - We can clearly document this breaking change in the release that
>>>>>>> includes it.
>>>>>>>
>>>>>>
>>>>>> I am a bit worried about this. Schemas have been marked experimental
>>>>>> for. almost two years, and I've found that many Beam users are using them.
>>>>>> If we make a breaking change, can we make sure that it is extremely clear
>>>>>> to users how to fix their code? I'm afraid we are in the situation now
>>>>>> where the code is theoretically experimental but practically is widely used.
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## Mixing joda and java 8 time
>>>>>>> The NanosInstant logical type that Alex added uses java.time.Instant
>>>>>>> as it's input type, while my MillisInstant type uses org.joda.time.Instant
>>>>>>> for compatibility with the rest of Beam and the previous DATETIME primitive
>>>>>>> type. It feels weird, but it also makes a certain sort of sense to use joda
>>>>>>> time (which has millisecond precision) for MillisInstant, and java 8 time
>>>>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>>>>>> significant effect on end users - the schema inference code could generate
>>>>>>> conversions between java 8 time and joda time (as we already do for
>>>>>>> converting between various joda time types [5]) so user types can use
>>>>>>> either one.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ## Arbitrary Logical Types and SQL
>>>>>>> Previously much of the SQL code was written to operate on the _base
>>>>>>> type_ value for any logical types. So for the new MillisInstant type, SQL
>>>>>>> would attempt to operate on the underlying Long, rather than on a
>>>>>>> org.joda.time.Instant instance. Thus when I switched over to a
>>>>>>> MillisInstant logical type as the default for SQL date and time types any
>>>>>>> tests that used them began failing with ClassCastExceptions.
>>>>>>>
>>>>>>> My solution was just to update SQL code to only ever reference the
>>>>>>> input type (i.e. org.joda.time.Instant) for logical types. A good example
>>>>>>> of this type of change is in BeamAggregationRel [6].
>>>>>>>
>>>>>>> My change does pass all of the SQL tests (including internal Google
>>>>>>> tests), but I'm a little concerned that using the base type throughout SQL
>>>>>>> was a conscious decision that I'm stomping on. In theory, it could ensure
>>>>>>> that SqlTransform would be able to operate on user types that contain
>>>>>>> custom user logical types, without requiring SqlTransform to understand
>>>>>>> those types. This is just speculation though, there aren't actually any
>>>>>>> tests verifying that functionality. Also, I don't think any such tests
>>>>>>> would have worked since there are several places where we explicitly check
>>>>>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>>>>>
>>>>>>> Personally I'm ok with this move, because I'm against the idea of
>>>>>>> implicitly stripping away logical types like this in schema-aware
>>>>>>> transforms. A logical type indicates that the value has some additional
>>>>>>> meaning beyond just the base type (and maybe a restricted range), and I
>>>>>>> don't think transforms should be able to ignore that meaning unless the
>>>>>>> user explicitly allows it, or first converts to the base type themselves.
>>>>>>>
>>>>>>
>>>>>> I disagree with this personally:
>>>>>>
>>>>>> Arguably, _every_ schema has "additional meaning" beyond the fields,
>>>>>> and a LogicalType is just a nice way of naming that and allowing the use of
>>>>>> other classes to manage that type. For example, Let's say you have a record
>>>>>> containing latitude and longitude fields - today you can easily write SQL
>>>>>> over these fields. Later you decide that this is a very common pattern in
>>>>>> your data and decide to create a Location logical type; this also lets you
>>>>>> represent this with a convenience Location class that has many other
>>>>>> helpful features (distance functions, builders, etc.). It seems unfortunate
>>>>>> if you suddenly lose the ability to easily run SQL over this type.
>>>>>> Introducing a logical type hasn't changed the semantic meaning of your data
>>>>>> at. all. All it has done is let you name it, and it would be unfortunate if
>>>>>> analyzing your data stream suddenly became harder as a result.
>>>>>>
>>>>>
>>> I would argue that defining the logical type *has* changed the semantic
>>> meaning. Before you just had a tuple of floating point numbers that
>>> happened to have fields named lat and long. After defining a logical type
>>> you have a location, and lat and long have a restricted range.
>>>
>>>
>>>>>> Another reason I don't like this, is that the pattern that we have
>>>>>> encouraged (and already use extensively) is to use logical types when
>>>>>> converting from external formats such as avro, proto, kryo, etc.  If
>>>>>> logical types aren't supported by SQL, then we are effectively saying that
>>>>>> data coming from these formats can't be easily analyzed with SQL (or SQL
>>>>>> will have to special case the logical types used by every single such input
>>>>>> format, which does not seem sustainable).
>>>>>>
>>>>>
>>>>>> I think the problem you ran into is that today an aggregation deals
>>>>>> with logical types in an all-or-nothing manner. You either aggregate
>>>>>> everything by the representation type or everything by the logical type. If
>>>>>> we instead allowed SQL to specify well-known logical types (e.g. Instant),
>>>>>> I think that this problem would go away.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If anyone has any concerns with these points could you raise them
>>>>>>> here? Otherwise I'd like to move ahead with the PR.
>>>>>>> (Sorry this message turned out to be a bit of a novel. LMK if it
>>>>>>> would be better served by separate threads, or maybe a doc.)
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Brian
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>>>>>> [2]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>>>>>> [4]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>>>>>> [5]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>>>>>> [6]
>>>>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>>>>>> [7]
>>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>>>>>
>>>>>>

Re: Removing DATETIME in Java Schemas

Posted by Reuven Lax <re...@google.com>.

On Fri, May 15, 2020 at 8:10 PM Kenneth Knowles <ke...@apache.org> wrote:

>
>
> On Fri, May 15, 2020 at 5:25 PM Brian Hulette <bh...@google.com> wrote:
>
>> After thinking about this more I've softened on it some, but I'm still a
>> little wary. I like Kenn's suggestion:
>>
>> > - it is OK to convert known logical types like MillisInstant or
>> NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
>> TIME ZONE"
>> > - it is OK to convert unknown logical types into their representation
>> type
>> > - coming out the other side of a SQL statement, a user should expect
>> the type to be a row where every field has a type from the SQL dialect; the
>> input types are gone
>>
>> I think what you're proposing is basically a preprocessing step on
>> SqlTransform that demotes any unrecognized logical types to their
>> representation. Is that a fair characterization?
>>
>
> Exactly
>
>
>> I wonder if we could push the demotion further into the transform to
>> enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." -
>> if some logical type SQL knows nothing about is pulled into the SELECT *,
>> it seems reasonable (from a user-perspective) to just pass the type through
>> untouched, and automatically demoting it might be surprising.
>>
>
> To me, it seems fancy to pass the type through. And I expect users will
> soon try to do non-pass-through things involving equality or ordering,
> where only specialized representations work right. So for the very limited
> cases where it doesn't matter, I'm not opposed, but I wouldn't work hard to
> make it happen. If Calcite implicit conversions are a good way to implement
> this, that is fine by me.
>
> Put another way: if the downstream transforms expect to get the logical
> type back, won't automatic schema conversion already "just work"?
>

No. Let's say you are reading protos and writing protos with a SQL filter
statement in between. If we lose the logical type, then the output schema
may not be able to match against the proto schema. .e.g. if the proto has a
field of type FIXED32, we reflect this with a Fixed32 logical type with a
base type of integer. SQL can operate on the base type just fine because
it's just an int32, but if you strip the logical type from the output then
it will have trouble matching against the proto schema as Beam will infer
an int32 type.


> (I don't have any response to the avro/proto points except I think I agree
> with both of you)
>
> Kenn
>
> What if we had the ability to implicitly convert logical types to their
>> representation? Then the above example would work, because no implicit
>> conversion is necessary, and also relations that introspect the logical
>> type (e.g. projecting f_location.lat) could work by including an implicit
>> conversion. The implicit conversion would still be a one-way operation, but
>> it would only happen when necessary.
>>
>> I really have no idea how complicated this would be to implement. Would
>> it be feasible? Is it ill-advised?
>>
>> Brian
>>
>> On Fri, May 15, 2020 at 2:43 PM Reuven Lax <re...@google.com> wrote:
>>
>>> I would not describe the base type as the "wire type." If that were
>>> true, then the only base type we should support should be byte array.
>>>
>>> My simple point is that this is no different than normal schema fields.
>>> You will find many normal schemas containing data encoded into other field
>>> types. You will also find this in SQL databases. If naming these fields
>>> with a logical type causes us to throw up our hands and declare that they
>>> can't be introspectable, then our users will likely just avoid logical
>>> types and encode this data into their own LONG and STRING fields. It will
>>> also mean that SQL will not work well over Avro or Proto input.
>>>
>> Is the Avro/Proto connection really a problem? I would think that we
>> could map most common Avro or Proto types to either primitive types or
>> "standard" logical types which are the same as the ones used by SQL
>> wherever possible, e.g. NanosInstant and NanosDuration added for proto.
>>
>>>
>>> Reuven
>>>
>>> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <ap...@google.com>
>>> wrote:
>>>
>>>> My understanding is that the base type is effectively the wire format
>>>> at the type, where the logical type is the in-memory representation for
>>>> Java. For org.joda.time.Instant, this is just a wrapper around the
>>>> underlying Long. However for the Date logical type, the LocalDate type has
>>>> struct as the in-memory representation. There is a performance penalty to
>>>> this conversion and I think there are cases where users (SQL for example)
>>>> would want a fast path that skips the serialization/deserialization steps
>>>> if possible.
>>>>
>>>> As for more complex logical types, I would urge caution in making
>>>> assumptions around the base types. I expect there will be base types of
>>>> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
>>>> has a civil time module that packs a struct into a Long. All of these are
>>>> things you can write SQL around, but I think the users might be somewhat
>>>> surprised when they come out less than usable by default.
>>>>
>>>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com>
>>>>> wrote:
>>>>>
>>>>>> When we created the portable representation of schemas last summer we
>>>>>> intentionally did not include DATETIME as a primitive type [1], even though
>>>>>> it already existed as a primitive type in Java [2]. There seemed to be
>>>>>> consensus around a couple of points: 1) At the very least DATETIME is a
>>>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>>>>> logical type is better suited for this purpose.
>>>>>>
>>>>>> Since then, it's been my intention to replace Java's DATETIME with a
>>>>>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>>>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>>>>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>>>>
>>>>>> There could be a couple of concerns with this PR that I think would
>>>>>> be better discussed here rather than on github.
>>>>>>
>>>>>>
>>>>>> ## Breaking changes in low-level APIs
>>>>>> The PR includes some breaking changes in public, low-level schema
>>>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also
>>>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant
>>>>>> fields (Long milliseconds since epoch rather than org.joda.time.Instant).
>>>>>> Both of these changes have the potential to break user code. That being
>>>>>> said I think the risk is justified for a few reasons:
>>>>>> - These are lower-level APIs that we don't intend for most users to
>>>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and inferred
>>>>>> schemas for user types) should seamlessly transition over to the new
>>>>>> MillisInstant logical type.
>>>>>> - Schemas are an experimental feature.
>>>>>> - We can clearly document this breaking change in the release that
>>>>>> includes it.
>>>>>>
>>>>>
>>>>> I am a bit worried about this. Schemas have been marked experimental
>>>>> for. almost two years, and I've found that many Beam users are using them.
>>>>> If we make a breaking change, can we make sure that it is extremely clear
>>>>> to users how to fix their code? I'm afraid we are in the situation now
>>>>> where the code is theoretically experimental but practically is widely used.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> ## Mixing joda and java 8 time
>>>>>> The NanosInstant logical type that Alex added uses java.time.Instant
>>>>>> as it's input type, while my MillisInstant type uses org.joda.time.Instant
>>>>>> for compatibility with the rest of Beam and the previous DATETIME primitive
>>>>>> type. It feels weird, but it also makes a certain sort of sense to use joda
>>>>>> time (which has millisecond precision) for MillisInstant, and java 8 time
>>>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>>>>> significant effect on end users - the schema inference code could generate
>>>>>> conversions between java 8 time and joda time (as we already do for
>>>>>> converting between various joda time types [5]) so user types can use
>>>>>> either one.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> ## Arbitrary Logical Types and SQL
>>>>>> Previously much of the SQL code was written to operate on the _base
>>>>>> type_ value for any logical types. So for the new MillisInstant type, SQL
>>>>>> would attempt to operate on the underlying Long, rather than on a
>>>>>> org.joda.time.Instant instance. Thus when I switched over to a
>>>>>> MillisInstant logical type as the default for SQL date and time types any
>>>>>> tests that used them began failing with ClassCastExceptions.
>>>>>>
>>>>>> My solution was just to update SQL code to only ever reference the
>>>>>> input type (i.e. org.joda.time.Instant) for logical types. A good example
>>>>>> of this type of change is in BeamAggregationRel [6].
>>>>>>
>>>>>> My change does pass all of the SQL tests (including internal Google
>>>>>> tests), but I'm a little concerned that using the base type throughout SQL
>>>>>> was a conscious decision that I'm stomping on. In theory, it could ensure
>>>>>> that SqlTransform would be able to operate on user types that contain
>>>>>> custom user logical types, without requiring SqlTransform to understand
>>>>>> those types. This is just speculation though, there aren't actually any
>>>>>> tests verifying that functionality. Also, I don't think any such tests
>>>>>> would have worked since there are several places where we explicitly check
>>>>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>>>>
>>>>>> Personally I'm ok with this move, because I'm against the idea of
>>>>>> implicitly stripping away logical types like this in schema-aware
>>>>>> transforms. A logical type indicates that the value has some additional
>>>>>> meaning beyond just the base type (and maybe a restricted range), and I
>>>>>> don't think transforms should be able to ignore that meaning unless the
>>>>>> user explicitly allows it, or first converts to the base type themselves.
>>>>>>
>>>>>
>>>>> I disagree with this personally:
>>>>>
>>>>> Arguably, _every_ schema has "additional meaning" beyond the fields,
>>>>> and a LogicalType is just a nice way of naming that and allowing the use of
>>>>> other classes to manage that type. For example, Let's say you have a record
>>>>> containing latitude and longitude fields - today you can easily write SQL
>>>>> over these fields. Later you decide that this is a very common pattern in
>>>>> your data and decide to create a Location logical type; this also lets you
>>>>> represent this with a convenience Location class that has many other
>>>>> helpful features (distance functions, builders, etc.). It seems unfortunate
>>>>> if you suddenly lose the ability to easily run SQL over this type.
>>>>> Introducing a logical type hasn't changed the semantic meaning of your data
>>>>> at. all. All it has done is let you name it, and it would be unfortunate if
>>>>> analyzing your data stream suddenly became harder as a result.
>>>>>
>>>>
>> I would argue that defining the logical type *has* changed the semantic
>> meaning. Before you just had a tuple of floating point numbers that
>> happened to have fields named lat and long. After defining a logical type
>> you have a location, and lat and long have a restricted range.
>>
>>
>>>>> Another reason I don't like this, is that the pattern that we have
>>>>> encouraged (and already use extensively) is to use logical types when
>>>>> converting from external formats such as avro, proto, kryo, etc.  If
>>>>> logical types aren't supported by SQL, then we are effectively saying that
>>>>> data coming from these formats can't be easily analyzed with SQL (or SQL
>>>>> will have to special case the logical types used by every single such input
>>>>> format, which does not seem sustainable).
>>>>>
>>>>
>>>>> I think the problem you ran into is that today an aggregation deals
>>>>> with logical types in an all-or-nothing manner. You either aggregate
>>>>> everything by the representation type or everything by the logical type. If
>>>>> we instead allowed SQL to specify well-known logical types (e.g. Instant),
>>>>> I think that this problem would go away.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> If anyone has any concerns with these points could you raise them
>>>>>> here? Otherwise I'd like to move ahead with the PR.
>>>>>> (Sorry this message turned out to be a bit of a novel. LMK if it
>>>>>> would be better served by separate threads, or maybe a doc.)
>>>>>>
>>>>>> Thanks!
>>>>>> Brian
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>>>>> [2]
>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>>>>> [4]
>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>>>>> [5]
>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>>>>> [6]
>>>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>>>>> [7]
>>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>>>>
>>>>>

Re: Removing DATETIME in Java Schemas

Posted by Kenneth Knowles <ke...@apache.org>.

On Fri, May 15, 2020 at 5:25 PM Brian Hulette <bh...@google.com> wrote:

> After thinking about this more I've softened on it some, but I'm still a
> little wary. I like Kenn's suggestion:
>
> > - it is OK to convert known logical types like MillisInstant or
> NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
> TIME ZONE"
> > - it is OK to convert unknown logical types into their representation
> type
> > - coming out the other side of a SQL statement, a user should expect the
> type to be a row where every field has a type from the SQL dialect; the
> input types are gone
>
> I think what you're proposing is basically a preprocessing step on
> SqlTransform that demotes any unrecognized logical types to their
> representation. Is that a fair characterization?
>

Exactly


> I wonder if we could push the demotion further into the transform to
> enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." -
> if some logical type SQL knows nothing about is pulled into the SELECT *,
> it seems reasonable (from a user-perspective) to just pass the type through
> untouched, and automatically demoting it might be surprising.
>

To me, it seems fancy to pass the type through. And I expect users will
soon try to do non-pass-through things involving equality or ordering,
where only specialized representations work right. So for the very limited
cases where it doesn't matter, I'm not opposed, but I wouldn't work hard to
make it happen. If Calcite implicit conversions are a good way to implement
this, that is fine by me.

Put another way: if the downstream transforms expect to get the logical
type back, won't automatic schema conversion already "just work"?

(I don't have any response to the avro/proto points except I think I agree
with both of you)

Kenn

What if we had the ability to implicitly convert logical types to their
> representation? Then the above example would work, because no implicit
> conversion is necessary, and also relations that introspect the logical
> type (e.g. projecting f_location.lat) could work by including an implicit
> conversion. The implicit conversion would still be a one-way operation, but
> it would only happen when necessary.
>
> I really have no idea how complicated this would be to implement. Would it
> be feasible? Is it ill-advised?
>
> Brian
>
> On Fri, May 15, 2020 at 2:43 PM Reuven Lax <re...@google.com> wrote:
>
>> I would not describe the base type as the "wire type." If that were true,
>> then the only base type we should support should be byte array.
>>
>> My simple point is that this is no different than normal schema fields.
>> You will find many normal schemas containing data encoded into other field
>> types. You will also find this in SQL databases. If naming these fields
>> with a logical type causes us to throw up our hands and declare that they
>> can't be introspectable, then our users will likely just avoid logical
>> types and encode this data into their own LONG and STRING fields. It will
>> also mean that SQL will not work well over Avro or Proto input.
>>
> Is the Avro/Proto connection really a problem? I would think that we could
> map most common Avro or Proto types to either primitive types or "standard"
> logical types which are the same as the ones used by SQL wherever possible,
> e.g. NanosInstant and NanosDuration added for proto.
>
>>
>> Reuven
>>
>> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> My understanding is that the base type is effectively the wire format at
>>> the type, where the logical type is the in-memory representation for Java.
>>> For org.joda.time.Instant, this is just a wrapper around the underlying
>>> Long. However for the Date logical type, the LocalDate type has struct as
>>> the in-memory representation. There is a performance penalty to this
>>> conversion and I think there are cases where users (SQL for example) would
>>> want a fast path that skips the serialization/deserialization steps if
>>> possible.
>>>
>>> As for more complex logical types, I would urge caution in making
>>> assumptions around the base types. I expect there will be base types of
>>> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
>>> has a civil time module that packs a struct into a Long. All of these are
>>> things you can write SQL around, but I think the users might be somewhat
>>> surprised when they come out less than usable by default.
>>>
>>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>>
>>>>> When we created the portable representation of schemas last summer we
>>>>> intentionally did not include DATETIME as a primitive type [1], even though
>>>>> it already existed as a primitive type in Java [2]. There seemed to be
>>>>> consensus around a couple of points: 1) At the very least DATETIME is a
>>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>>>> logical type is better suited for this purpose.
>>>>>
>>>>> Since then, it's been my intention to replace Java's DATETIME with a
>>>>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>>>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>>>
>>>>> There could be a couple of concerns with this PR that I think would be
>>>>> better discussed here rather than on github.
>>>>>
>>>>>
>>>>> ## Breaking changes in low-level APIs
>>>>> The PR includes some breaking changes in public, low-level schema
>>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also
>>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant
>>>>> fields (Long milliseconds since epoch rather than org.joda.time.Instant).
>>>>> Both of these changes have the potential to break user code. That being
>>>>> said I think the risk is justified for a few reasons:
>>>>> - These are lower-level APIs that we don't intend for most users to
>>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and inferred
>>>>> schemas for user types) should seamlessly transition over to the new
>>>>> MillisInstant logical type.
>>>>> - Schemas are an experimental feature.
>>>>> - We can clearly document this breaking change in the release that
>>>>> includes it.
>>>>>
>>>>
>>>> I am a bit worried about this. Schemas have been marked experimental
>>>> for. almost two years, and I've found that many Beam users are using them.
>>>> If we make a breaking change, can we make sure that it is extremely clear
>>>> to users how to fix their code? I'm afraid we are in the situation now
>>>> where the code is theoretically experimental but practically is widely used.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> ## Mixing joda and java 8 time
>>>>> The NanosInstant logical type that Alex added uses java.time.Instant
>>>>> as it's input type, while my MillisInstant type uses org.joda.time.Instant
>>>>> for compatibility with the rest of Beam and the previous DATETIME primitive
>>>>> type. It feels weird, but it also makes a certain sort of sense to use joda
>>>>> time (which has millisecond precision) for MillisInstant, and java 8 time
>>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>>>> significant effect on end users - the schema inference code could generate
>>>>> conversions between java 8 time and joda time (as we already do for
>>>>> converting between various joda time types [5]) so user types can use
>>>>> either one.
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>>
>>>>>
>>>>> ## Arbitrary Logical Types and SQL
>>>>> Previously much of the SQL code was written to operate on the _base
>>>>> type_ value for any logical types. So for the new MillisInstant type, SQL
>>>>> would attempt to operate on the underlying Long, rather than on a
>>>>> org.joda.time.Instant instance. Thus when I switched over to a
>>>>> MillisInstant logical type as the default for SQL date and time types any
>>>>> tests that used them began failing with ClassCastExceptions.
>>>>>
>>>>> My solution was just to update SQL code to only ever reference the
>>>>> input type (i.e. org.joda.time.Instant) for logical types. A good example
>>>>> of this type of change is in BeamAggregationRel [6].
>>>>>
>>>>> My change does pass all of the SQL tests (including internal Google
>>>>> tests), but I'm a little concerned that using the base type throughout SQL
>>>>> was a conscious decision that I'm stomping on. In theory, it could ensure
>>>>> that SqlTransform would be able to operate on user types that contain
>>>>> custom user logical types, without requiring SqlTransform to understand
>>>>> those types. This is just speculation though, there aren't actually any
>>>>> tests verifying that functionality. Also, I don't think any such tests
>>>>> would have worked since there are several places where we explicitly check
>>>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>>>
>>>>> Personally I'm ok with this move, because I'm against the idea of
>>>>> implicitly stripping away logical types like this in schema-aware
>>>>> transforms. A logical type indicates that the value has some additional
>>>>> meaning beyond just the base type (and maybe a restricted range), and I
>>>>> don't think transforms should be able to ignore that meaning unless the
>>>>> user explicitly allows it, or first converts to the base type themselves.
>>>>>
>>>>
>>>> I disagree with this personally:
>>>>
>>>> Arguably, _every_ schema has "additional meaning" beyond the fields,
>>>> and a LogicalType is just a nice way of naming that and allowing the use of
>>>> other classes to manage that type. For example, Let's say you have a record
>>>> containing latitude and longitude fields - today you can easily write SQL
>>>> over these fields. Later you decide that this is a very common pattern in
>>>> your data and decide to create a Location logical type; this also lets you
>>>> represent this with a convenience Location class that has many other
>>>> helpful features (distance functions, builders, etc.). It seems unfortunate
>>>> if you suddenly lose the ability to easily run SQL over this type.
>>>> Introducing a logical type hasn't changed the semantic meaning of your data
>>>> at. all. All it has done is let you name it, and it would be unfortunate if
>>>> analyzing your data stream suddenly became harder as a result.
>>>>
>>>
> I would argue that defining the logical type *has* changed the semantic
> meaning. Before you just had a tuple of floating point numbers that
> happened to have fields named lat and long. After defining a logical type
> you have a location, and lat and long have a restricted range.
>
>
>>>> Another reason I don't like this, is that the pattern that we have
>>>> encouraged (and already use extensively) is to use logical types when
>>>> converting from external formats such as avro, proto, kryo, etc.  If
>>>> logical types aren't supported by SQL, then we are effectively saying that
>>>> data coming from these formats can't be easily analyzed with SQL (or SQL
>>>> will have to special case the logical types used by every single such input
>>>> format, which does not seem sustainable).
>>>>
>>>
>>>> I think the problem you ran into is that today an aggregation deals
>>>> with logical types in an all-or-nothing manner. You either aggregate
>>>> everything by the representation type or everything by the logical type. If
>>>> we instead allowed SQL to specify well-known logical types (e.g. Instant),
>>>> I think that this problem would go away.
>>>>
>>>>
>>>>>
>>>>>
>>>>> If anyone has any concerns with these points could you raise them
>>>>> here? Otherwise I'd like to move ahead with the PR.
>>>>> (Sorry this message turned out to be a bit of a novel. LMK if it would
>>>>> be better served by separate threads, or maybe a doc.)
>>>>>
>>>>> Thanks!
>>>>> Brian
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>>>> [4]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>>>> [5]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>>>> [6]
>>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>>>> [7]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>>>
>>>>

Re: Removing DATETIME in Java Schemas

Posted by Reuven Lax <re...@google.com>.

On Fri, May 15, 2020 at 5:25 PM Brian Hulette <bh...@google.com> wrote:

> After thinking about this more I've softened on it some, but I'm still a
> little wary. I like Kenn's suggestion:
>
> > - it is OK to convert known logical types like MillisInstant or
> NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
> TIME ZONE"
> > - it is OK to convert unknown logical types into their representation
> type
> > - coming out the other side of a SQL statement, a user should expect the
> type to be a row where every field has a type from the SQL dialect; the
> input types are gone
>
> I think what you're proposing is basically a preprocessing step on
> SqlTransform that demotes any unrecognized logical types to their
> representation. Is that a fair characterization?
>

so SELECT * FROM stream would modify the stream? I guess we could do that,
but might be a bit surprising if SELECT * is not faithful.


>
> I wonder if we could push the demotion further into the transform to
> enable some additional uses. For example "SELECT * FROM PCOLL WHERE ..." -
> if some logical type SQL knows nothing about is pulled into the SELECT *,
> it seems reasonable (from a user-perspective) to just pass the type through
> untouched, and automatically demoting it might be surprising.
>

> What if we had the ability to implicitly convert logical types to their
> representation? Then the above example would work, because no implicit
> conversion is necessary, and also relations that introspect the logical
> type (e.g. projecting f_location.lat) could work by including an implicit
> conversion. The implicit conversion would still be a one-way operation, but
> it would only happen when necessary.
>

> I really have no idea how complicated this would be to implement. Would it
> be feasible? Is it ill-advised?
>

I kinda like it. I. don't know if it would be complicated to implement, but
maybe tricky?



>
> Brian
>
> On Fri, May 15, 2020 at 2:43 PM Reuven Lax <re...@google.com> wrote:
>
>> I would not describe the base type as the "wire type." If that were true,
>> then the only base type we should support should be byte array.
>>
>> My simple point is that this is no different than normal schema fields.
>> You will find many normal schemas containing data encoded into other field
>> types. You will also find this in SQL databases. If naming these fields
>> with a logical type causes us to throw up our hands and declare that they
>> can't be introspectable, then our users will likely just avoid logical
>> types and encode this data into their own LONG and STRING fields. It will
>> also mean that SQL will not work well over Avro or Proto input.
>>
> Is the Avro/Proto connection really a problem? I would think that we could
> map most common Avro or Proto types to either primitive types or "standard"
> logical types which are the same as the ones used by SQL wherever possible,
> e.g. NanosInstant and NanosDuration added for proto.
>
>>
>> Reuven
>>
>> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> My understanding is that the base type is effectively the wire format at
>>> the type, where the logical type is the in-memory representation for Java.
>>> For org.joda.time.Instant, this is just a wrapper around the underlying
>>> Long. However for the Date logical type, the LocalDate type has struct as
>>> the in-memory representation. There is a performance penalty to this
>>> conversion and I think there are cases where users (SQL for example) would
>>> want a fast path that skips the serialization/deserialization steps if
>>> possible.
>>>
>>> As for more complex logical types, I would urge caution in making
>>> assumptions around the base types. I expect there will be base types of
>>> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
>>> has a civil time module that packs a struct into a Long. All of these are
>>> things you can write SQL around, but I think the users might be somewhat
>>> surprised when they come out less than usable by default.
>>>
>>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>>
>>>>> When we created the portable representation of schemas last summer we
>>>>> intentionally did not include DATETIME as a primitive type [1], even though
>>>>> it already existed as a primitive type in Java [2]. There seemed to be
>>>>> consensus around a couple of points: 1) At the very least DATETIME is a
>>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>>>> logical type is better suited for this purpose.
>>>>>
>>>>> Since then, it's been my intention to replace Java's DATETIME with a
>>>>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>>>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>>>
>>>>> There could be a couple of concerns with this PR that I think would be
>>>>> better discussed here rather than on github.
>>>>>
>>>>>
>>>>> ## Breaking changes in low-level APIs
>>>>> The PR includes some breaking changes in public, low-level schema
>>>>> APIs: It removes DATETIME from the TypeName enum [4], and it will also
>>>>> change the type returned by Row#getBaseValue for DATETIME/MillisInstant
>>>>> fields (Long milliseconds since epoch rather than org.joda.time.Instant).
>>>>> Both of these changes have the potential to break user code. That being
>>>>> said I think the risk is justified for a few reasons:
>>>>> - These are lower-level APIs that we don't intend for most users to
>>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and inferred
>>>>> schemas for user types) should seamlessly transition over to the new
>>>>> MillisInstant logical type.
>>>>> - Schemas are an experimental feature.
>>>>> - We can clearly document this breaking change in the release that
>>>>> includes it.
>>>>>
>>>>
>>>> I am a bit worried about this. Schemas have been marked experimental
>>>> for. almost two years, and I've found that many Beam users are using them.
>>>> If we make a breaking change, can we make sure that it is extremely clear
>>>> to users how to fix their code? I'm afraid we are in the situation now
>>>> where the code is theoretically experimental but practically is widely used.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> ## Mixing joda and java 8 time
>>>>> The NanosInstant logical type that Alex added uses java.time.Instant
>>>>> as it's input type, while my MillisInstant type uses org.joda.time.Instant
>>>>> for compatibility with the rest of Beam and the previous DATETIME primitive
>>>>> type. It feels weird, but it also makes a certain sort of sense to use joda
>>>>> time (which has millisecond precision) for MillisInstant, and java 8 time
>>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>>>> significant effect on end users - the schema inference code could generate
>>>>> conversions between java 8 time and joda time (as we already do for
>>>>> converting between various joda time types [5]) so user types can use
>>>>> either one.
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>>
>>>>>
>>>>> ## Arbitrary Logical Types and SQL
>>>>> Previously much of the SQL code was written to operate on the _base
>>>>> type_ value for any logical types. So for the new MillisInstant type, SQL
>>>>> would attempt to operate on the underlying Long, rather than on a
>>>>> org.joda.time.Instant instance. Thus when I switched over to a
>>>>> MillisInstant logical type as the default for SQL date and time types any
>>>>> tests that used them began failing with ClassCastExceptions.
>>>>>
>>>>> My solution was just to update SQL code to only ever reference the
>>>>> input type (i.e. org.joda.time.Instant) for logical types. A good example
>>>>> of this type of change is in BeamAggregationRel [6].
>>>>>
>>>>> My change does pass all of the SQL tests (including internal Google
>>>>> tests), but I'm a little concerned that using the base type throughout SQL
>>>>> was a conscious decision that I'm stomping on. In theory, it could ensure
>>>>> that SqlTransform would be able to operate on user types that contain
>>>>> custom user logical types, without requiring SqlTransform to understand
>>>>> those types. This is just speculation though, there aren't actually any
>>>>> tests verifying that functionality. Also, I don't think any such tests
>>>>> would have worked since there are several places where we explicitly check
>>>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>>>
>>>>> Personally I'm ok with this move, because I'm against the idea of
>>>>> implicitly stripping away logical types like this in schema-aware
>>>>> transforms. A logical type indicates that the value has some additional
>>>>> meaning beyond just the base type (and maybe a restricted range), and I
>>>>> don't think transforms should be able to ignore that meaning unless the
>>>>> user explicitly allows it, or first converts to the base type themselves.
>>>>>
>>>>
>>>> I disagree with this personally:
>>>>
>>>> Arguably, _every_ schema has "additional meaning" beyond the fields,
>>>> and a LogicalType is just a nice way of naming that and allowing the use of
>>>> other classes to manage that type. For example, Let's say you have a record
>>>> containing latitude and longitude fields - today you can easily write SQL
>>>> over these fields. Later you decide that this is a very common pattern in
>>>> your data and decide to create a Location logical type; this also lets you
>>>> represent this with a convenience Location class that has many other
>>>> helpful features (distance functions, builders, etc.). It seems unfortunate
>>>> if you suddenly lose the ability to easily run SQL over this type.
>>>> Introducing a logical type hasn't changed the semantic meaning of your data
>>>> at. all. All it has done is let you name it, and it would be unfortunate if
>>>> analyzing your data stream suddenly became harder as a result.
>>>>
>>>
> I would argue that defining the logical type *has* changed the semantic
> meaning. Before you just had a tuple of floating point numbers that
> happened to have fields named lat and long. After defining a logical type
> you have a location, and lat and long have a restricted range.
>
>
>>>> Another reason I don't like this, is that the pattern that we have
>>>> encouraged (and already use extensively) is to use logical types when
>>>> converting from external formats such as avro, proto, kryo, etc.  If
>>>> logical types aren't supported by SQL, then we are effectively saying that
>>>> data coming from these formats can't be easily analyzed with SQL (or SQL
>>>> will have to special case the logical types used by every single such input
>>>> format, which does not seem sustainable).
>>>>
>>>
>>>> I think the problem you ran into is that today an aggregation deals
>>>> with logical types in an all-or-nothing manner. You either aggregate
>>>> everything by the representation type or everything by the logical type. If
>>>> we instead allowed SQL to specify well-known logical types (e.g. Instant),
>>>> I think that this problem would go away.
>>>>
>>>>
>>>>>
>>>>>
>>>>> If anyone has any concerns with these points could you raise them
>>>>> here? Otherwise I'd like to move ahead with the PR.
>>>>> (Sorry this message turned out to be a bit of a novel. LMK if it would
>>>>> be better served by separate threads, or maybe a doc.)
>>>>>
>>>>> Thanks!
>>>>> Brian
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>>>> [4]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>>>> [5]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>>>> [6]
>>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>>>> [7]
>>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>>>
>>>>

Re: Removing DATETIME in Java Schemas

Posted by Brian Hulette <bh...@google.com>.

After thinking about this more I've softened on it some, but I'm still a
little wary. I like Kenn's suggestion:

> - it is OK to convert known logical types like MillisInstant or
NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL
TIME ZONE"
> - it is OK to convert unknown logical types into their representation type
> - coming out the other side of a SQL statement, a user should expect the
type to be a row where every field has a type from the SQL dialect; the
input types are gone

I think what you're proposing is basically a preprocessing step on
SqlTransform that demotes any unrecognized logical types to their
representation. Is that a fair characterization?

I wonder if we could push the demotion further into the transform to enable
some additional uses. For example "SELECT * FROM PCOLL WHERE ..." - if some
logical type SQL knows nothing about is pulled into the SELECT *, it seems
reasonable (from a user-perspective) to just pass the type through
untouched, and automatically demoting it might be surprising.

What if we had the ability to implicitly convert logical types to their
representation? Then the above example would work, because no implicit
conversion is necessary, and also relations that introspect the logical
type (e.g. projecting f_location.lat) could work by including an implicit
conversion. The implicit conversion would still be a one-way operation, but
it would only happen when necessary.

I really have no idea how complicated this would be to implement. Would it
be feasible? Is it ill-advised?

Brian

On Fri, May 15, 2020 at 2:43 PM Reuven Lax <re...@google.com> wrote:

> I would not describe the base type as the "wire type." If that were true,
> then the only base type we should support should be byte array.
>
> My simple point is that this is no different than normal schema fields.
> You will find many normal schemas containing data encoded into other field
> types. You will also find this in SQL databases. If naming these fields
> with a logical type causes us to throw up our hands and declare that they
> can't be introspectable, then our users will likely just avoid logical
> types and encode this data into their own LONG and STRING fields. It will
> also mean that SQL will not work well over Avro or Proto input.
>
Is the Avro/Proto connection really a problem? I would think that we could
map most common Avro or Proto types to either primitive types or "standard"
logical types which are the same as the ones used by SQL wherever possible,
e.g. NanosInstant and NanosDuration added for proto.

>
> Reuven
>
> On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <ap...@google.com>
> wrote:
>
>> My understanding is that the base type is effectively the wire format at
>> the type, where the logical type is the in-memory representation for Java.
>> For org.joda.time.Instant, this is just a wrapper around the underlying
>> Long. However for the Date logical type, the LocalDate type has struct as
>> the in-memory representation. There is a performance penalty to this
>> conversion and I think there are cases where users (SQL for example) would
>> want a fast path that skips the serialization/deserialization steps if
>> possible.
>>
>> As for more complex logical types, I would urge caution in making
>> assumptions around the base types. I expect there will be base types of
>> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
>> has a civil time module that packs a struct into a Long. All of these are
>> things you can write SQL around, but I think the users might be somewhat
>> surprised when they come out less than usable by default.
>>
>> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>>
>>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com>
>>> wrote:
>>>
>>>> When we created the portable representation of schemas last summer we
>>>> intentionally did not include DATETIME as a primitive type [1], even though
>>>> it already existed as a primitive type in Java [2]. There seemed to be
>>>> consensus around a couple of points: 1) At the very least DATETIME is a
>>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>>> logical type is better suited for this purpose.
>>>>
>>>> Since then, it's been my intention to replace Java's DATETIME with a
>>>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>>
>>>> There could be a couple of concerns with this PR that I think would be
>>>> better discussed here rather than on github.
>>>>
>>>>
>>>> ## Breaking changes in low-level APIs
>>>> The PR includes some breaking changes in public, low-level schema APIs:
>>>> It removes DATETIME from the TypeName enum [4], and it will also change the
>>>> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
>>>> milliseconds since epoch rather than org.joda.time.Instant). Both of these
>>>> changes have the potential to break user code. That being said I think the
>>>> risk is justified for a few reasons:
>>>> - These are lower-level APIs that we don't intend for most users to
>>>> use. The preferred higher-level APIs (SQL, Schema transforms, and inferred
>>>> schemas for user types) should seamlessly transition over to the new
>>>> MillisInstant logical type.
>>>> - Schemas are an experimental feature.
>>>> - We can clearly document this breaking change in the release that
>>>> includes it.
>>>>
>>>
>>> I am a bit worried about this. Schemas have been marked experimental
>>> for. almost two years, and I've found that many Beam users are using them.
>>> If we make a breaking change, can we make sure that it is extremely clear
>>> to users how to fix their code? I'm afraid we are in the situation now
>>> where the code is theoretically experimental but practically is widely used.
>>>
>>>
>>>
>>>>
>>>>
>>>> ## Mixing joda and java 8 time
>>>> The NanosInstant logical type that Alex added uses java.time.Instant as
>>>> it's input type, while my MillisInstant type uses org.joda.time.Instant for
>>>> compatibility with the rest of Beam and the previous DATETIME primitive
>>>> type. It feels weird, but it also makes a certain sort of sense to use joda
>>>> time (which has millisecond precision) for MillisInstant, and java 8 time
>>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>>> significant effect on end users - the schema inference code could generate
>>>> conversions between java 8 time and joda time (as we already do for
>>>> converting between various joda time types [5]) so user types can use
>>>> either one.
>>>>
>>>
>>> +1
>>>
>>>
>>>>
>>>>
>>>> ## Arbitrary Logical Types and SQL
>>>> Previously much of the SQL code was written to operate on the _base
>>>> type_ value for any logical types. So for the new MillisInstant type, SQL
>>>> would attempt to operate on the underlying Long, rather than on a
>>>> org.joda.time.Instant instance. Thus when I switched over to a
>>>> MillisInstant logical type as the default for SQL date and time types any
>>>> tests that used them began failing with ClassCastExceptions.
>>>>
>>>> My solution was just to update SQL code to only ever reference the
>>>> input type (i.e. org.joda.time.Instant) for logical types. A good example
>>>> of this type of change is in BeamAggregationRel [6].
>>>>
>>>> My change does pass all of the SQL tests (including internal Google
>>>> tests), but I'm a little concerned that using the base type throughout SQL
>>>> was a conscious decision that I'm stomping on. In theory, it could ensure
>>>> that SqlTransform would be able to operate on user types that contain
>>>> custom user logical types, without requiring SqlTransform to understand
>>>> those types. This is just speculation though, there aren't actually any
>>>> tests verifying that functionality. Also, I don't think any such tests
>>>> would have worked since there are several places where we explicitly check
>>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>>
>>>> Personally I'm ok with this move, because I'm against the idea of
>>>> implicitly stripping away logical types like this in schema-aware
>>>> transforms. A logical type indicates that the value has some additional
>>>> meaning beyond just the base type (and maybe a restricted range), and I
>>>> don't think transforms should be able to ignore that meaning unless the
>>>> user explicitly allows it, or first converts to the base type themselves.
>>>>
>>>
>>> I disagree with this personally:
>>>
>>> Arguably, _every_ schema has "additional meaning" beyond the fields, and
>>> a LogicalType is just a nice way of naming that and allowing the use of
>>> other classes to manage that type. For example, Let's say you have a record
>>> containing latitude and longitude fields - today you can easily write SQL
>>> over these fields. Later you decide that this is a very common pattern in
>>> your data and decide to create a Location logical type; this also lets you
>>> represent this with a convenience Location class that has many other
>>> helpful features (distance functions, builders, etc.). It seems unfortunate
>>> if you suddenly lose the ability to easily run SQL over this type.
>>> Introducing a logical type hasn't changed the semantic meaning of your data
>>> at. all. All it has done is let you name it, and it would be unfortunate if
>>> analyzing your data stream suddenly became harder as a result.
>>>
>>
I would argue that defining the logical type *has* changed the semantic
meaning. Before you just had a tuple of floating point numbers that
happened to have fields named lat and long. After defining a logical type
you have a location, and lat and long have a restricted range.


>>> Another reason I don't like this, is that the pattern that we have
>>> encouraged (and already use extensively) is to use logical types when
>>> converting from external formats such as avro, proto, kryo, etc.  If
>>> logical types aren't supported by SQL, then we are effectively saying that
>>> data coming from these formats can't be easily analyzed with SQL (or SQL
>>> will have to special case the logical types used by every single such input
>>> format, which does not seem sustainable).
>>>
>>
>>> I think the problem you ran into is that today an aggregation deals with
>>> logical types in an all-or-nothing manner. You either aggregate everything
>>> by the representation type or everything by the logical type. If we instead
>>> allowed SQL to specify well-known logical types (e.g. Instant), I think
>>> that this problem would go away.
>>>
>>>
>>>>
>>>>
>>>> If anyone has any concerns with these points could you raise them here?
>>>> Otherwise I'd like to move ahead with the PR.
>>>> (Sorry this message turned out to be a bit of a novel. LMK if it would
>>>> be better served by separate threads, or maybe a doc.)
>>>>
>>>> Thanks!
>>>> Brian
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>>> [2]
>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>>> [4]
>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>>> [5]
>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>>> [6]
>>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>>> [7]
>>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>>
>>>

Re: Removing DATETIME in Java Schemas

Posted by Reuven Lax <re...@google.com>.

I would not describe the base type as the "wire type." If that were true,
then the only base type we should support should be byte array.

My simple point is that this is no different than normal schema fields. You
will find many normal schemas containing data encoded into other field
types. You will also find this in SQL databases. If naming these fields
with a logical type causes us to throw up our hands and declare that they
can't be introspectable, then our users will likely just avoid logical
types and encode this data into their own LONG and STRING fields. It will
also mean that SQL will not work well over Avro or Proto input.

Reuven

On Fri, May 15, 2020 at 2:19 PM Andrew Pilloud <ap...@google.com> wrote:

> My understanding is that the base type is effectively the wire format at
> the type, where the logical type is the in-memory representation for Java.
> For org.joda.time.Instant, this is just a wrapper around the underlying
> Long. However for the Date logical type, the LocalDate type has struct as
> the in-memory representation. There is a performance penalty to this
> conversion and I think there are cases where users (SQL for example) would
> want a fast path that skips the serialization/deserialization steps if
> possible.
>
> As for more complex logical types, I would urge caution in making
> assumptions around the base types. I expect there will be base types of
> strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
> has a civil time module that packs a struct into a Long. All of these are
> things you can write SQL around, but I think the users might be somewhat
> surprised when they come out less than usable by default.
>
> On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:
>
>>
>>
>> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>> When we created the portable representation of schemas last summer we
>>> intentionally did not include DATETIME as a primitive type [1], even though
>>> it already existed as a primitive type in Java [2]. There seemed to be
>>> consensus around a couple of points: 1) At the very least DATETIME is a
>>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>>> logical type is better suited for this purpose.
>>>
>>> Since then, it's been my intention to replace Java's DATETIME with a
>>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>>
>>> There could be a couple of concerns with this PR that I think would be
>>> better discussed here rather than on github.
>>>
>>>
>>> ## Breaking changes in low-level APIs
>>> The PR includes some breaking changes in public, low-level schema APIs:
>>> It removes DATETIME from the TypeName enum [4], and it will also change the
>>> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
>>> milliseconds since epoch rather than org.joda.time.Instant). Both of these
>>> changes have the potential to break user code. That being said I think the
>>> risk is justified for a few reasons:
>>> - These are lower-level APIs that we don't intend for most users to use.
>>> The preferred higher-level APIs (SQL, Schema transforms, and inferred
>>> schemas for user types) should seamlessly transition over to the new
>>> MillisInstant logical type.
>>> - Schemas are an experimental feature.
>>> - We can clearly document this breaking change in the release that
>>> includes it.
>>>
>>
>> I am a bit worried about this. Schemas have been marked experimental for.
>> almost two years, and I've found that many Beam users are using them. If we
>> make a breaking change, can we make sure that it is extremely clear to
>> users how to fix their code? I'm afraid we are in the situation now where
>> the code is theoretically experimental but practically is widely used.
>>
>>
>>
>>>
>>>
>>> ## Mixing joda and java 8 time
>>> The NanosInstant logical type that Alex added uses java.time.Instant as
>>> it's input type, while my MillisInstant type uses org.joda.time.Instant for
>>> compatibility with the rest of Beam and the previous DATETIME primitive
>>> type. It feels weird, but it also makes a certain sort of sense to use joda
>>> time (which has millisecond precision) for MillisInstant, and java 8 time
>>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>>> significant effect on end users - the schema inference code could generate
>>> conversions between java 8 time and joda time (as we already do for
>>> converting between various joda time types [5]) so user types can use
>>> either one.
>>>
>>
>> +1
>>
>>
>>>
>>>
>>> ## Arbitrary Logical Types and SQL
>>> Previously much of the SQL code was written to operate on the _base
>>> type_ value for any logical types. So for the new MillisInstant type, SQL
>>> would attempt to operate on the underlying Long, rather than on a
>>> org.joda.time.Instant instance. Thus when I switched over to a
>>> MillisInstant logical type as the default for SQL date and time types any
>>> tests that used them began failing with ClassCastExceptions.
>>>
>>> My solution was just to update SQL code to only ever reference the input
>>> type (i.e. org.joda.time.Instant) for logical types. A good example of this
>>> type of change is in BeamAggregationRel [6].
>>>
>>> My change does pass all of the SQL tests (including internal Google
>>> tests), but I'm a little concerned that using the base type throughout SQL
>>> was a conscious decision that I'm stomping on. In theory, it could ensure
>>> that SqlTransform would be able to operate on user types that contain
>>> custom user logical types, without requiring SqlTransform to understand
>>> those types. This is just speculation though, there aren't actually any
>>> tests verifying that functionality. Also, I don't think any such tests
>>> would have worked since there are several places where we explicitly check
>>> for particular logical types (e.g. in BeamCalcRel [7]).
>>>
>>> Personally I'm ok with this move, because I'm against the idea of
>>> implicitly stripping away logical types like this in schema-aware
>>> transforms. A logical type indicates that the value has some additional
>>> meaning beyond just the base type (and maybe a restricted range), and I
>>> don't think transforms should be able to ignore that meaning unless the
>>> user explicitly allows it, or first converts to the base type themselves.
>>>
>>
>> I disagree with this personally:
>>
>> Arguably, _every_ schema has "additional meaning" beyond the fields, and
>> a LogicalType is just a nice way of naming that and allowing the use of
>> other classes to manage that type. For example, Let's say you have a record
>> containing latitude and longitude fields - today you can easily write SQL
>> over these fields. Later you decide that this is a very common pattern in
>> your data and decide to create a Location logical type; this also lets you
>> represent this with a convenience Location class that has many other
>> helpful features (distance functions, builders, etc.). It seems unfortunate
>> if you suddenly lose the ability to easily run SQL over this type.
>> Introducing a logical type hasn't changed the semantic meaning of your data
>> at. all. All it has done is let you name it, and it would be unfortunate if
>> analyzing your data stream suddenly became harder as a result.
>>
>> Another reason I don't like this, is that the pattern that we have
>> encouraged (and already use extensively) is to use logical types when
>> converting from external formats such as avro, proto, kryo, etc.  If
>> logical types aren't supported by SQL, then we are effectively saying that
>> data coming from these formats can't be easily analyzed with SQL (or SQL
>> will have to special case the logical types used by every single such input
>> format, which does not seem sustainable).
>>
>> I think the problem you ran into is that today an aggregation deals with
>> logical types in an all-or-nothing manner. You either aggregate everything
>> by the representation type or everything by the logical type. If we instead
>> allowed SQL to specify well-known logical types (e.g. Instant), I think
>> that this problem would go away.
>>
>>
>>>
>>>
>>> If anyone has any concerns with these points could you raise them here?
>>> Otherwise I'd like to move ahead with the PR.
>>> (Sorry this message turned out to be a bit of a novel. LMK if it would
>>> be better served by separate threads, or maybe a doc.)
>>>
>>> Thanks!
>>> Brian
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>>> [2]
>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>>> [4]
>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>>> [5]
>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>>> [6]
>>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>>> [7]
>>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>>
>>

Re: Removing DATETIME in Java Schemas

Posted by Andrew Pilloud <ap...@google.com>.

My understanding is that the base type is effectively the wire format at
the type, where the logical type is the in-memory representation for Java.
For org.joda.time.Instant, this is just a wrapper around the underlying
Long. However for the Date logical type, the LocalDate type has struct as
the in-memory representation. There is a performance penalty to this
conversion and I think there are cases where users (SQL for example) would
want a fast path that skips the serialization/deserialization steps if
possible.

As for more complex logical types, I would urge caution in making
assumptions around the base types. I expect there will be base types of
strings containing JSON or byte arrays containing protobuf or avro. ZetaSQL
has a civil time module that packs a struct into a Long. All of these are
things you can write SQL around, but I think the users might be somewhat
surprised when they come out less than usable by default.

On Fri, May 15, 2020 at 2:12 PM Reuven Lax <re...@google.com> wrote:

>
>
> On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com>
> wrote:
>
>> When we created the portable representation of schemas last summer we
>> intentionally did not include DATETIME as a primitive type [1], even though
>> it already existed as a primitive type in Java [2]. There seemed to be
>> consensus around a couple of points: 1) At the very least DATETIME is a
>> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
>> logical type is better suited for this purpose.
>>
>> Since then, it's been my intention to replace Java's DATETIME with a
>> MillisInstant logical type backed by Long milliseconds since the epoch (see
>> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
>> keeps tests passing): https://github.com/apache/beam/pull/11456
>>
>> There could be a couple of concerns with this PR that I think would be
>> better discussed here rather than on github.
>>
>>
>> ## Breaking changes in low-level APIs
>> The PR includes some breaking changes in public, low-level schema APIs:
>> It removes DATETIME from the TypeName enum [4], and it will also change the
>> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
>> milliseconds since epoch rather than org.joda.time.Instant). Both of these
>> changes have the potential to break user code. That being said I think the
>> risk is justified for a few reasons:
>> - These are lower-level APIs that we don't intend for most users to use.
>> The preferred higher-level APIs (SQL, Schema transforms, and inferred
>> schemas for user types) should seamlessly transition over to the new
>> MillisInstant logical type.
>> - Schemas are an experimental feature.
>> - We can clearly document this breaking change in the release that
>> includes it.
>>
>
> I am a bit worried about this. Schemas have been marked experimental for.
> almost two years, and I've found that many Beam users are using them. If we
> make a breaking change, can we make sure that it is extremely clear to
> users how to fix their code? I'm afraid we are in the situation now where
> the code is theoretically experimental but practically is widely used.
>
>
>
>>
>>
>> ## Mixing joda and java 8 time
>> The NanosInstant logical type that Alex added uses java.time.Instant as
>> it's input type, while my MillisInstant type uses org.joda.time.Instant for
>> compatibility with the rest of Beam and the previous DATETIME primitive
>> type. It feels weird, but it also makes a certain sort of sense to use joda
>> time (which has millisecond precision) for MillisInstant, and java 8 time
>> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
>> significant effect on end users - the schema inference code could generate
>> conversions between java 8 time and joda time (as we already do for
>> converting between various joda time types [5]) so user types can use
>> either one.
>>
>
> +1
>
>
>>
>>
>> ## Arbitrary Logical Types and SQL
>> Previously much of the SQL code was written to operate on the _base type_
>> value for any logical types. So for the new MillisInstant type, SQL would
>> attempt to operate on the underlying Long, rather than on a
>> org.joda.time.Instant instance. Thus when I switched over to a
>> MillisInstant logical type as the default for SQL date and time types any
>> tests that used them began failing with ClassCastExceptions.
>>
>> My solution was just to update SQL code to only ever reference the input
>> type (i.e. org.joda.time.Instant) for logical types. A good example of this
>> type of change is in BeamAggregationRel [6].
>>
>> My change does pass all of the SQL tests (including internal Google
>> tests), but I'm a little concerned that using the base type throughout SQL
>> was a conscious decision that I'm stomping on. In theory, it could ensure
>> that SqlTransform would be able to operate on user types that contain
>> custom user logical types, without requiring SqlTransform to understand
>> those types. This is just speculation though, there aren't actually any
>> tests verifying that functionality. Also, I don't think any such tests
>> would have worked since there are several places where we explicitly check
>> for particular logical types (e.g. in BeamCalcRel [7]).
>>
>> Personally I'm ok with this move, because I'm against the idea of
>> implicitly stripping away logical types like this in schema-aware
>> transforms. A logical type indicates that the value has some additional
>> meaning beyond just the base type (and maybe a restricted range), and I
>> don't think transforms should be able to ignore that meaning unless the
>> user explicitly allows it, or first converts to the base type themselves.
>>
>
> I disagree with this personally:
>
> Arguably, _every_ schema has "additional meaning" beyond the fields, and a
> LogicalType is just a nice way of naming that and allowing the use of other
> classes to manage that type. For example, Let's say you have a record
> containing latitude and longitude fields - today you can easily write SQL
> over these fields. Later you decide that this is a very common pattern in
> your data and decide to create a Location logical type; this also lets you
> represent this with a convenience Location class that has many other
> helpful features (distance functions, builders, etc.). It seems unfortunate
> if you suddenly lose the ability to easily run SQL over this type.
> Introducing a logical type hasn't changed the semantic meaning of your data
> at. all. All it has done is let you name it, and it would be unfortunate if
> analyzing your data stream suddenly became harder as a result.
>
> Another reason I don't like this, is that the pattern that we have
> encouraged (and already use extensively) is to use logical types when
> converting from external formats such as avro, proto, kryo, etc.  If
> logical types aren't supported by SQL, then we are effectively saying that
> data coming from these formats can't be easily analyzed with SQL (or SQL
> will have to special case the logical types used by every single such input
> format, which does not seem sustainable).
>
> I think the problem you ran into is that today an aggregation deals with
> logical types in an all-or-nothing manner. You either aggregate everything
> by the representation type or everything by the logical type. If we instead
> allowed SQL to specify well-known logical types (e.g. Instant), I think
> that this problem would go away.
>
>
>>
>>
>> If anyone has any concerns with these points could you raise them here?
>> Otherwise I'd like to move ahead with the PR.
>> (Sorry this message turned out to be a bit of a novel. LMK if it would be
>> better served by separate threads, or maybe a doc.)
>>
>> Thanks!
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
>> [2]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
>> [3] https://issues.apache.org/jira/browse/BEAM-7554
>> [4]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
>> [5]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
>> [6]
>> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
>> [7]
>> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>>
>

Re: Removing DATETIME in Java Schemas

Posted by Reuven Lax <re...@google.com>.

On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette <bh...@google.com> wrote:

> When we created the portable representation of schemas last summer we
> intentionally did not include DATETIME as a primitive type [1], even though
> it already existed as a primitive type in Java [2]. There seemed to be
> consensus around a couple of points: 1) At the very least DATETIME is a
> poor name and it should be renamed to INSTANT or TIMESTAMP, and 2) a
> logical type is better suited for this purpose.
>
> Since then, it's been my intention to replace Java's DATETIME with a
> MillisInstant logical type backed by Long milliseconds since the epoch (see
> BEAM-7554 [3]). I've finally managed to get a PR together that does so (and
> keeps tests passing): https://github.com/apache/beam/pull/11456
>
> There could be a couple of concerns with this PR that I think would be
> better discussed here rather than on github.
>
>
> ## Breaking changes in low-level APIs
> The PR includes some breaking changes in public, low-level schema APIs: It
> removes DATETIME from the TypeName enum [4], and it will also change the
> type returned by Row#getBaseValue for DATETIME/MillisInstant fields (Long
> milliseconds since epoch rather than org.joda.time.Instant). Both of these
> changes have the potential to break user code. That being said I think the
> risk is justified for a few reasons:
> - These are lower-level APIs that we don't intend for most users to use.
> The preferred higher-level APIs (SQL, Schema transforms, and inferred
> schemas for user types) should seamlessly transition over to the new
> MillisInstant logical type.
> - Schemas are an experimental feature.
> - We can clearly document this breaking change in the release that
> includes it.
>

I am a bit worried about this. Schemas have been marked experimental for.
almost two years, and I've found that many Beam users are using them. If we
make a breaking change, can we make sure that it is extremely clear to
users how to fix their code? I'm afraid we are in the situation now where
the code is theoretically experimental but practically is widely used.



>
>
> ## Mixing joda and java 8 time
> The NanosInstant logical type that Alex added uses java.time.Instant as
> it's input type, while my MillisInstant type uses org.joda.time.Instant for
> compatibility with the rest of Beam and the previous DATETIME primitive
> type. It feels weird, but it also makes a certain sort of sense to use joda
> time (which has millisecond precision) for MillisInstant, and java 8 time
> (which has nanos) for NanosInstant. Also, the choice shouldn't have a
> significant effect on end users - the schema inference code could generate
> conversions between java 8 time and joda time (as we already do for
> converting between various joda time types [5]) so user types can use
> either one.
>

+1


>
>
> ## Arbitrary Logical Types and SQL
> Previously much of the SQL code was written to operate on the _base type_
> value for any logical types. So for the new MillisInstant type, SQL would
> attempt to operate on the underlying Long, rather than on a
> org.joda.time.Instant instance. Thus when I switched over to a
> MillisInstant logical type as the default for SQL date and time types any
> tests that used them began failing with ClassCastExceptions.
>
> My solution was just to update SQL code to only ever reference the input
> type (i.e. org.joda.time.Instant) for logical types. A good example of this
> type of change is in BeamAggregationRel [6].
>
> My change does pass all of the SQL tests (including internal Google
> tests), but I'm a little concerned that using the base type throughout SQL
> was a conscious decision that I'm stomping on. In theory, it could ensure
> that SqlTransform would be able to operate on user types that contain
> custom user logical types, without requiring SqlTransform to understand
> those types. This is just speculation though, there aren't actually any
> tests verifying that functionality. Also, I don't think any such tests
> would have worked since there are several places where we explicitly check
> for particular logical types (e.g. in BeamCalcRel [7]).
>
> Personally I'm ok with this move, because I'm against the idea of
> implicitly stripping away logical types like this in schema-aware
> transforms. A logical type indicates that the value has some additional
> meaning beyond just the base type (and maybe a restricted range), and I
> don't think transforms should be able to ignore that meaning unless the
> user explicitly allows it, or first converts to the base type themselves.
>

I disagree with this personally:

Arguably, _every_ schema has "additional meaning" beyond the fields, and a
LogicalType is just a nice way of naming that and allowing the use of other
classes to manage that type. For example, Let's say you have a record
containing latitude and longitude fields - today you can easily write SQL
over these fields. Later you decide that this is a very common pattern in
your data and decide to create a Location logical type; this also lets you
represent this with a convenience Location class that has many other
helpful features (distance functions, builders, etc.). It seems unfortunate
if you suddenly lose the ability to easily run SQL over this type.
Introducing a logical type hasn't changed the semantic meaning of your data
at. all. All it has done is let you name it, and it would be unfortunate if
analyzing your data stream suddenly became harder as a result.

Another reason I don't like this, is that the pattern that we have
encouraged (and already use extensively) is to use logical types when
converting from external formats such as avro, proto, kryo, etc.  If
logical types aren't supported by SQL, then we are effectively saying that
data coming from these formats can't be easily analyzed with SQL (or SQL
will have to special case the logical types used by every single such input
format, which does not seem sustainable).

I think the problem you ran into is that today an aggregation deals with
logical types in an all-or-nothing manner. You either aggregate everything
by the representation type or everything by the logical type. If we instead
allowed SQL to specify well-known logical types (e.g. Instant), I think
that this problem would go away.


>
>
> If anyone has any concerns with these points could you raise them here?
> Otherwise I'd like to move ahead with the PR.
> (Sorry this message turned out to be a bit of a novel. LMK if it would be
> better served by separate threads, or maybe a doc.)
>
> Thanks!
> Brian
>
> [1]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/model/pipeline/src/main/proto/schema.proto#L58
> [2]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L682
> [3] https://issues.apache.org/jira/browse/BEAM-7554
> [4]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L422
> [5]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/ByteBuddyUtils.java#L1155-L1158
> [6]
> https://github.com/apache/beam/pull/11456/files#diff-98d3c841460c6d59f281c40040260f2d
> [7]
> https://github.com/apache/beam/blob/ef267d97a7cc6bb1bece30af940ee9d8e487f963/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamCalcRel.java#L409
>