You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Rui Wang <ru...@google.com> on 2018/11/06 05:21:30 UTC

[DISCUSS] More precision supported by DATETIME field in Schema

Hi Community,

The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime
(see Row.java#L611
<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L611>
 and Row.java#L169
<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L169>).
Joda's Datetime is limited to the precision of millisecond. It has good
enough precision to represent timestamp of event time, but it is not enough
for the real "time" data. For the "time" type data, we probably need to
support even up to the precision of nanosecond.

Unfortunately, Joda decided to keep the precision of millisecond:
https://github.com/JodaOrg/joda-time/issues/139.

If we want to support the precision of nanosecond, we could have two
options:

Option one: utilize current FieldType's metadata field
<https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L421>,
such that we could set something into meta data and Row could check the
metadata to decide what's saved in DATETIME field: Joda's Datetime or an
implementation that supports nanosecond.

Option two: have another field (maybe called TIMESTAMP field?), to have an
implementation to support higher precision of time.

What do you think about the need of higher precision for time type and
which option is preferred?

-Rui

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Rui Wang <ru...@google.com>.

Hi Charles,

It's only for Beam schema DATETIME field.

-Rui

On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com> wrote:

> Is the proposal to do this for both Beam Schema DATETIME fields as well as
> for Beam timestamps in general?  The latter likely has a bunch of
> downstream consequences for all runners.
>
> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> +1 to more precision even to the nano level, probably via Reuven's
>> proposal of a different internal representation.
>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >
>> > +1 to offering more granular timestamps in general. I think it will be
>> > odd if setting the element timestamp from a row DATETIME field is
>> > lossy, so we should seriously consider upgrading that as well.
>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
>> > >
>> > > One related issue that came up before is that we (perhaps
>> unnecessarily) restrict the precision of timestamps in the Python SDK to
>> milliseconds because of legacy reasons related to the Java runner's use of
>> Joda time.  Perhaps Beam portability should natively use a more granular
>> timestamp unit.
>> > >
>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
>> > >>
>> > >> Thanks Reuven!
>> > >>
>> > >> I think Reuven gives the third option:
>> > >>
>> > >> Change internal representation of DATETIME field in Row. Still keep
>> public ReadableDateTime getDateTime(String fieldName) API to be compatible
>> with existing code. And I think we could add one more API to
>> getDataTimeNanosecond. This option is different from the option one because
>> option one actually maintains two implementation of time.
>> > >>
>> > >> -Rui
>> > >>
>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
>> > >>>
>> > >>> I would vote that we change the internal representation of Row to
>> something other than Joda. Java 8 times would give us at least
>> microseconds, and if we want nanoseconds we could simply store it as a
>> number.
>> > >>>
>> > >>> We should still keep accessor methods that return and take Joda
>> objects, as the rest of Beam still depends on Joda.
>> > >>>
>> > >>> Reuven
>> > >>>
>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>> > >>>>
>> > >>>> Hi Community,
>> > >>>>
>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
>> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
>> to the precision of millisecond. It has good enough precision to represent
>> timestamp of event time, but it is not enough for the real "time" data. For
>> the "time" type data, we probably need to support even up to the precision
>> of nanosecond.
>> > >>>>
>> > >>>> Unfortunately, Joda decided to keep the precision of millisecond:
>> https://github.com/JodaOrg/joda-time/issues/139.
>> > >>>>
>> > >>>> If we want to support the precision of nanosecond, we could have
>> two options:
>> > >>>>
>> > >>>> Option one: utilize current FieldType's metadata field, such that
>> we could set something into meta data and Row could check the metadata to
>> decide what's saved in DATETIME field: Joda's Datetime or an implementation
>> that supports nanosecond.
>> > >>>>
>> > >>>> Option two: have another field (maybe called TIMESTAMP field?), to
>> have an implementation to support higher precision of time.
>> > >>>>
>> > >>>> What do you think about the need of higher precision for time type
>> and which option is preferred?
>> > >>>>
>> > >>>> -Rui
>>
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Lukasz Cwik <lc...@google.com>.

Added a few folks for visibility.

On Fri, Nov 9, 2018 at 12:43 AM Robert Bradshaw <ro...@google.com> wrote:

> We *might* have a few bits left in the WindowedValue representation to
> make this backwards compatible if we really wanted.
>
> The use of java.time.instant means that we won't be able to upgrade
> (even in v3) our internal timestamps to match without either
> internally supporting >64 bits of precision or limiting the date
> range. But using the standard Java time does make a lot of sense.
> On Fri, Nov 9, 2018 at 12:33 AM Rui Wang <ru...@google.com> wrote:
> >
> > https://github.com/apache/beam/pull/6991
> >
> > I am using java.time.instant as the internal representation to replace
> Joda time for DateTime field in the PR. The java.time.instant uses a long
> to save seconds-after-epoch and a int to save nanoseconds-of-second.
> Therefore 64 bits are fully used for seconds-after-epoch, which loses
> nothing.
> >
> > Comments are very welcome to this PR.
> >
> > -Rui
> >
> > On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax <re...@google.com> wrote:
> >>
> >> As you said, this would be update incompatible across all streaming
> pipelines. At the very least this would be a big problem for Dataflow
> users, and I believe many Flink users as well. I'm not sure the benefit
> here justifies causing problems for so many users.
> >>
> >> Reuven
> >>
> >> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>
> >>> Yes, microseconds is a good compromise for covering a long enough
> >>> timespan that there's little reason it could be hit (even for
> >>> processing historical data).
> >>>
> >>> Regarding backwards compatibility, could we just change the internal
> >>> representation of Beam's element timestamps, possibly with new APIs to
> >>> access the finer granularity? (True, it may not be upgrade
> >>> compatible.)
> >>> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <re...@google.com> wrote:
> >>> >
> >>> > The main difference (though possibly theoretical) is when time runs
> out. With 64 bits and nanosecond precision, we can only represent times
> about 244 years in the future (or the past).
> >>> >
> >>> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <ke...@apache.org>
> wrote:
> >>> >>
> >>> >> I like nanoseconds as extremely future-proof. What about specing
> this out in stages (1) domain of values (2) portable encoding that can
> represent those values (3) language-specific types to embed the values in.
> >>> >>
> >>> >> 1. If it is a nanosecond-precision absolute time, and we eventually
> want to migrate event time timestamps to match, then we need values for
> "end of global window" and "end of time". TBH I am not sure we need both of
> these any more. We can either define a max on the nanosecond range or
> create distinguished values.
> >>> >>
> >>> >> 2. For portability, presumably an order-preserving integer encoding
> of nanoseconds since epoch with whatever tweaks to allow for representing
> the end of time. It might be useful to find a way to allow multiple. Not
> super useful at a particular version, but might have given us a migration
> path. It would also allow experiments for performance.
> >>> >>
> >>> >> 3. We could probably find a way to keep user-facing API
> compatibility here while increasing underlying precision at 1 and 2, but I
> probably not worth it. A new Java type IMO addresses the lossiness issue
> because a user would have to explicitly request truncation to assign to a
> millis event time timestamp.
> >>> >>
> >>> >> Kenn
> >>> >>
> >>> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com>
> wrote:
> >>> >>>
> >>> >>> Is the proposal to do this for both Beam Schema DATETIME fields as
> well as for Beam timestamps in general?  The latter likely has a bunch of
> downstream consequences for all runners.
> >>> >>>
> >>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com>
> wrote:
> >>> >>>>
> >>> >>>> +1 to more precision even to the nano level, probably via Reuven's
> >>> >>>> proposal of a different internal representation.
> >>> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <
> robertwb@google.com> wrote:
> >>> >>>> >
> >>> >>>> > +1 to offering more granular timestamps in general. I think it
> will be
> >>> >>>> > odd if setting the element timestamp from a row DATETIME field
> is
> >>> >>>> > lossy, so we should seriously consider upgrading that as well.
> >>> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com>
> wrote:
> >>> >>>> > >
> >>> >>>> > > One related issue that came up before is that we (perhaps
> unnecessarily) restrict the precision of timestamps in the Python SDK to
> milliseconds because of legacy reasons related to the Java runner's use of
> Joda time.  Perhaps Beam portability should natively use a more granular
> timestamp unit.
> >>> >>>> > >
> >>> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com>
> wrote:
> >>> >>>> > >>
> >>> >>>> > >> Thanks Reuven!
> >>> >>>> > >>
> >>> >>>> > >> I think Reuven gives the third option:
> >>> >>>> > >>
> >>> >>>> > >> Change internal representation of DATETIME field in Row.
> Still keep public ReadableDateTime getDateTime(String fieldName) API to be
> compatible with existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
> >>> >>>> > >>
> >>> >>>> > >> -Rui
> >>> >>>> > >>
> >>> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com>
> wrote:
> >>> >>>> > >>>
> >>> >>>> > >>> I would vote that we change the internal representation of
> Row to something other than Joda. Java 8 times would give us at least
> microseconds, and if we want nanoseconds we could simply store it as a
> number.
> >>> >>>> > >>>
> >>> >>>> > >>> We should still keep accessor methods that return and take
> Joda objects, as the rest of Beam still depends on Joda.
> >>> >>>> > >>>
> >>> >>>> > >>> Reuven
> >>> >>>> > >>>
> >>> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com>
> wrote:
> >>> >>>> > >>>>
> >>> >>>> > >>>> Hi Community,
> >>> >>>> > >>>>
> >>> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by
> Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is
> limited to the precision of millisecond. It has good enough precision to
> represent timestamp of event time, but it is not enough for the real "time"
> data. For the "time" type data, we probably need to support even up to the
> precision of nanosecond.
> >>> >>>> > >>>>
> >>> >>>> > >>>> Unfortunately, Joda decided to keep the precision of
> millisecond: https://github.com/JodaOrg/joda-time/issues/139.
> >>> >>>> > >>>>
> >>> >>>> > >>>> If we want to support the precision of nanosecond, we
> could have two options:
> >>> >>>> > >>>>
> >>> >>>> > >>>> Option one: utilize current FieldType's metadata field,
> such that we could set something into meta data and Row could check the
> metadata to decide what's saved in DATETIME field: Joda's Datetime or an
> implementation that supports nanosecond.
> >>> >>>> > >>>>
> >>> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP
> field?), to have an implementation to support higher precision of time.
> >>> >>>> > >>>>
> >>> >>>> > >>>> What do you think about the need of higher precision for
> time type and which option is preferred?
> >>> >>>> > >>>>
> >>> >>>> > >>>> -Rui
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Robert Bradshaw <ro...@google.com>.

We *might* have a few bits left in the WindowedValue representation to
make this backwards compatible if we really wanted.

The use of java.time.instant means that we won't be able to upgrade
(even in v3) our internal timestamps to match without either
internally supporting >64 bits of precision or limiting the date
range. But using the standard Java time does make a lot of sense.
On Fri, Nov 9, 2018 at 12:33 AM Rui Wang <ru...@google.com> wrote:
>
> https://github.com/apache/beam/pull/6991
>
> I am using java.time.instant as the internal representation to replace Joda time for DateTime field in the PR. The java.time.instant uses a long to save seconds-after-epoch and a int to save nanoseconds-of-second. Therefore 64 bits are fully used for seconds-after-epoch, which loses nothing.
>
> Comments are very welcome to this PR.
>
> -Rui
>
> On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax <re...@google.com> wrote:
>>
>> As you said, this would be update incompatible across all streaming pipelines. At the very least this would be a big problem for Dataflow users, and I believe many Flink users as well. I'm not sure the benefit here justifies causing problems for so many users.
>>
>> Reuven
>>
>> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <ro...@google.com> wrote:
>>>
>>> Yes, microseconds is a good compromise for covering a long enough
>>> timespan that there's little reason it could be hit (even for
>>> processing historical data).
>>>
>>> Regarding backwards compatibility, could we just change the internal
>>> representation of Beam's element timestamps, possibly with new APIs to
>>> access the finer granularity? (True, it may not be upgrade
>>> compatible.)
>>> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <re...@google.com> wrote:
>>> >
>>> > The main difference (though possibly theoretical) is when time runs out. With 64 bits and nanosecond precision, we can only represent times about 244 years in the future (or the past).
>>> >
>>> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <ke...@apache.org> wrote:
>>> >>
>>> >> I like nanoseconds as extremely future-proof. What about specing this out in stages (1) domain of values (2) portable encoding that can represent those values (3) language-specific types to embed the values in.
>>> >>
>>> >> 1. If it is a nanosecond-precision absolute time, and we eventually want to migrate event time timestamps to match, then we need values for "end of global window" and "end of time". TBH I am not sure we need both of these any more. We can either define a max on the nanosecond range or create distinguished values.
>>> >>
>>> >> 2. For portability, presumably an order-preserving integer encoding of nanoseconds since epoch with whatever tweaks to allow for representing the end of time. It might be useful to find a way to allow multiple. Not super useful at a particular version, but might have given us a migration path. It would also allow experiments for performance.
>>> >>
>>> >> 3. We could probably find a way to keep user-facing API compatibility here while increasing underlying precision at 1 and 2, but I probably not worth it. A new Java type IMO addresses the lossiness issue because a user would have to explicitly request truncation to assign to a millis event time timestamp.
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com> wrote:
>>> >>>
>>> >>> Is the proposal to do this for both Beam Schema DATETIME fields as well as for Beam timestamps in general?  The latter likely has a bunch of downstream consequences for all runners.
>>> >>>
>>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>> >>>>
>>> >>>> +1 to more precision even to the nano level, probably via Reuven's
>>> >>>> proposal of a different internal representation.
>>> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com> wrote:
>>> >>>> >
>>> >>>> > +1 to offering more granular timestamps in general. I think it will be
>>> >>>> > odd if setting the element timestamp from a row DATETIME field is
>>> >>>> > lossy, so we should seriously consider upgrading that as well.
>>> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
>>> >>>> > >
>>> >>>> > > One related issue that came up before is that we (perhaps unnecessarily) restrict the precision of timestamps in the Python SDK to milliseconds because of legacy reasons related to the Java runner's use of Joda time.  Perhaps Beam portability should natively use a more granular timestamp unit.
>>> >>>> > >
>>> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
>>> >>>> > >>
>>> >>>> > >> Thanks Reuven!
>>> >>>> > >>
>>> >>>> > >> I think Reuven gives the third option:
>>> >>>> > >>
>>> >>>> > >> Change internal representation of DATETIME field in Row. Still keep public ReadableDateTime getDateTime(String fieldName) API to be compatible with existing code. And I think we could add one more API to getDataTimeNanosecond. This option is different from the option one because option one actually maintains two implementation of time.
>>> >>>> > >>
>>> >>>> > >> -Rui
>>> >>>> > >>
>>> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
>>> >>>> > >>>
>>> >>>> > >>> I would vote that we change the internal representation of Row to something other than Joda. Java 8 times would give us at least microseconds, and if we want nanoseconds we could simply store it as a number.
>>> >>>> > >>>
>>> >>>> > >>> We should still keep accessor methods that return and take Joda objects, as the rest of Beam still depends on Joda.
>>> >>>> > >>>
>>> >>>> > >>> Reuven
>>> >>>> > >>>
>>> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>>> >>>> > >>>>
>>> >>>> > >>>> Hi Community,
>>> >>>> > >>>>
>>> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited to the precision of millisecond. It has good enough precision to represent timestamp of event time, but it is not enough for the real "time" data. For the "time" type data, we probably need to support even up to the precision of nanosecond.
>>> >>>> > >>>>
>>> >>>> > >>>> Unfortunately, Joda decided to keep the precision of millisecond: https://github.com/JodaOrg/joda-time/issues/139.
>>> >>>> > >>>>
>>> >>>> > >>>> If we want to support the precision of nanosecond, we could have two options:
>>> >>>> > >>>>
>>> >>>> > >>>> Option one: utilize current FieldType's metadata field, such that we could set something into meta data and Row could check the metadata to decide what's saved in DATETIME field: Joda's Datetime or an implementation that supports nanosecond.
>>> >>>> > >>>>
>>> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP field?), to have an implementation to support higher precision of time.
>>> >>>> > >>>>
>>> >>>> > >>>> What do you think about the need of higher precision for time type and which option is preferred?
>>> >>>> > >>>>
>>> >>>> > >>>> -Rui

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Rui Wang <ru...@google.com>.

https://github.com/apache/beam/pull/6991

I am using java.time.instant as the internal representation to replace Joda
time for DateTime field in the PR. The java.time.instant uses a *long* to
save seconds-after-epoch and a *int* to save nanoseconds-of-second.
Therefore 64 bits are fully used for seconds-after-epoch, which loses
nothing.

Comments are very welcome to this PR.

-Rui

On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax <re...@google.com> wrote:

> As you said, this would be update incompatible across all streaming
> pipelines. At the very least this would be a big problem for Dataflow
> users, and I believe many Flink users as well. I'm not sure the benefit
> here justifies causing problems for so many users.
>
> Reuven
>
> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Yes, microseconds is a good compromise for covering a long enough
>> timespan that there's little reason it could be hit (even for
>> processing historical data).
>>
>> Regarding backwards compatibility, could we just change the internal
>> representation of Beam's element timestamps, possibly with new APIs to
>> access the finer granularity? (True, it may not be upgrade
>> compatible.)
>> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <re...@google.com> wrote:
>> >
>> > The main difference (though possibly theoretical) is when time runs
>> out. With 64 bits and nanosecond precision, we can only represent times
>> about 244 years in the future (or the past).
>> >
>> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <ke...@apache.org>
>> wrote:
>> >>
>> >> I like nanoseconds as extremely future-proof. What about specing this
>> out in stages (1) domain of values (2) portable encoding that can represent
>> those values (3) language-specific types to embed the values in.
>> >>
>> >> 1. If it is a nanosecond-precision absolute time, and we eventually
>> want to migrate event time timestamps to match, then we need values for
>> "end of global window" and "end of time". TBH I am not sure we need both of
>> these any more. We can either define a max on the nanosecond range or
>> create distinguished values.
>> >>
>> >> 2. For portability, presumably an order-preserving integer encoding of
>> nanoseconds since epoch with whatever tweaks to allow for representing the
>> end of time. It might be useful to find a way to allow multiple. Not super
>> useful at a particular version, but might have given us a migration path.
>> It would also allow experiments for performance.
>> >>
>> >> 3. We could probably find a way to keep user-facing API compatibility
>> here while increasing underlying precision at 1 and 2, but I probably not
>> worth it. A new Java type IMO addresses the lossiness issue because a user
>> would have to explicitly request truncation to assign to a millis event
>> time timestamp.
>> >>
>> >> Kenn
>> >>
>> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com> wrote:
>> >>>
>> >>> Is the proposal to do this for both Beam Schema DATETIME fields as
>> well as for Beam timestamps in general?  The latter likely has a bunch of
>> downstream consequences for all runners.
>> >>>
>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com>
>> wrote:
>> >>>>
>> >>>> +1 to more precision even to the nano level, probably via Reuven's
>> >>>> proposal of a different internal representation.
>> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >>>> >
>> >>>> > +1 to offering more granular timestamps in general. I think it
>> will be
>> >>>> > odd if setting the element timestamp from a row DATETIME field is
>> >>>> > lossy, so we should seriously consider upgrading that as well.
>> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com>
>> wrote:
>> >>>> > >
>> >>>> > > One related issue that came up before is that we (perhaps
>> unnecessarily) restrict the precision of timestamps in the Python SDK to
>> milliseconds because of legacy reasons related to the Java runner's use of
>> Joda time.  Perhaps Beam portability should natively use a more granular
>> timestamp unit.
>> >>>> > >
>> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com>
>> wrote:
>> >>>> > >>
>> >>>> > >> Thanks Reuven!
>> >>>> > >>
>> >>>> > >> I think Reuven gives the third option:
>> >>>> > >>
>> >>>> > >> Change internal representation of DATETIME field in Row. Still
>> keep public ReadableDateTime getDateTime(String fieldName) API to be
>> compatible with existing code. And I think we could add one more API to
>> getDataTimeNanosecond. This option is different from the option one because
>> option one actually maintains two implementation of time.
>> >>>> > >>
>> >>>> > >> -Rui
>> >>>> > >>
>> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com>
>> wrote:
>> >>>> > >>>
>> >>>> > >>> I would vote that we change the internal representation of Row
>> to something other than Joda. Java 8 times would give us at least
>> microseconds, and if we want nanoseconds we could simply store it as a
>> number.
>> >>>> > >>>
>> >>>> > >>> We should still keep accessor methods that return and take
>> Joda objects, as the rest of Beam still depends on Joda.
>> >>>> > >>>
>> >>>> > >>> Reuven
>> >>>> > >>>
>> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com>
>> wrote:
>> >>>> > >>>>
>> >>>> > >>>> Hi Community,
>> >>>> > >>>>
>> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by
>> Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is
>> limited to the precision of millisecond. It has good enough precision to
>> represent timestamp of event time, but it is not enough for the real "time"
>> data. For the "time" type data, we probably need to support even up to the
>> precision of nanosecond.
>> >>>> > >>>>
>> >>>> > >>>> Unfortunately, Joda decided to keep the precision of
>> millisecond: https://github.com/JodaOrg/joda-time/issues/139.
>> >>>> > >>>>
>> >>>> > >>>> If we want to support the precision of nanosecond, we could
>> have two options:
>> >>>> > >>>>
>> >>>> > >>>> Option one: utilize current FieldType's metadata field, such
>> that we could set something into meta data and Row could check the metadata
>> to decide what's saved in DATETIME field: Joda's Datetime or an
>> implementation that supports nanosecond.
>> >>>> > >>>>
>> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP
>> field?), to have an implementation to support higher precision of time.
>> >>>> > >>>>
>> >>>> > >>>> What do you think about the need of higher precision for time
>> type and which option is preferred?
>> >>>> > >>>>
>> >>>> > >>>> -Rui
>>
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Reuven Lax <re...@google.com>.

As you said, this would be update incompatible across all streaming
pipelines. At the very least this would be a big problem for Dataflow
users, and I believe many Flink users as well. I'm not sure the benefit
here justifies causing problems for so many users.

Reuven

On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <ro...@google.com> wrote:

> Yes, microseconds is a good compromise for covering a long enough
> timespan that there's little reason it could be hit (even for
> processing historical data).
>
> Regarding backwards compatibility, could we just change the internal
> representation of Beam's element timestamps, possibly with new APIs to
> access the finer granularity? (True, it may not be upgrade
> compatible.)
> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <re...@google.com> wrote:
> >
> > The main difference (though possibly theoretical) is when time runs out.
> With 64 bits and nanosecond precision, we can only represent times about
> 244 years in the future (or the past).
> >
> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <ke...@apache.org> wrote:
> >>
> >> I like nanoseconds as extremely future-proof. What about specing this
> out in stages (1) domain of values (2) portable encoding that can represent
> those values (3) language-specific types to embed the values in.
> >>
> >> 1. If it is a nanosecond-precision absolute time, and we eventually
> want to migrate event time timestamps to match, then we need values for
> "end of global window" and "end of time". TBH I am not sure we need both of
> these any more. We can either define a max on the nanosecond range or
> create distinguished values.
> >>
> >> 2. For portability, presumably an order-preserving integer encoding of
> nanoseconds since epoch with whatever tweaks to allow for representing the
> end of time. It might be useful to find a way to allow multiple. Not super
> useful at a particular version, but might have given us a migration path.
> It would also allow experiments for performance.
> >>
> >> 3. We could probably find a way to keep user-facing API compatibility
> here while increasing underlying precision at 1 and 2, but I probably not
> worth it. A new Java type IMO addresses the lossiness issue because a user
> would have to explicitly request truncation to assign to a millis event
> time timestamp.
> >>
> >> Kenn
> >>
> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com> wrote:
> >>>
> >>> Is the proposal to do this for both Beam Schema DATETIME fields as
> well as for Beam timestamps in general?  The latter likely has a bunch of
> downstream consequences for all runners.
> >>>
> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com>
> wrote:
> >>>>
> >>>> +1 to more precision even to the nano level, probably via Reuven's
> >>>> proposal of a different internal representation.
> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>> >
> >>>> > +1 to offering more granular timestamps in general. I think it will
> be
> >>>> > odd if setting the element timestamp from a row DATETIME field is
> >>>> > lossy, so we should seriously consider upgrading that as well.
> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
> >>>> > >
> >>>> > > One related issue that came up before is that we (perhaps
> unnecessarily) restrict the precision of timestamps in the Python SDK to
> milliseconds because of legacy reasons related to the Java runner's use of
> Joda time.  Perhaps Beam portability should natively use a more granular
> timestamp unit.
> >>>> > >
> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com>
> wrote:
> >>>> > >>
> >>>> > >> Thanks Reuven!
> >>>> > >>
> >>>> > >> I think Reuven gives the third option:
> >>>> > >>
> >>>> > >> Change internal representation of DATETIME field in Row. Still
> keep public ReadableDateTime getDateTime(String fieldName) API to be
> compatible with existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
> >>>> > >>
> >>>> > >> -Rui
> >>>> > >>
> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com>
> wrote:
> >>>> > >>>
> >>>> > >>> I would vote that we change the internal representation of Row
> to something other than Joda. Java 8 times would give us at least
> microseconds, and if we want nanoseconds we could simply store it as a
> number.
> >>>> > >>>
> >>>> > >>> We should still keep accessor methods that return and take Joda
> objects, as the rest of Beam still depends on Joda.
> >>>> > >>>
> >>>> > >>> Reuven
> >>>> > >>>
> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com>
> wrote:
> >>>> > >>>>
> >>>> > >>>> Hi Community,
> >>>> > >>>>
> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
> to the precision of millisecond. It has good enough precision to represent
> timestamp of event time, but it is not enough for the real "time" data. For
> the "time" type data, we probably need to support even up to the precision
> of nanosecond.
> >>>> > >>>>
> >>>> > >>>> Unfortunately, Joda decided to keep the precision of
> millisecond: https://github.com/JodaOrg/joda-time/issues/139.
> >>>> > >>>>
> >>>> > >>>> If we want to support the precision of nanosecond, we could
> have two options:
> >>>> > >>>>
> >>>> > >>>> Option one: utilize current FieldType's metadata field, such
> that we could set something into meta data and Row could check the metadata
> to decide what's saved in DATETIME field: Joda's Datetime or an
> implementation that supports nanosecond.
> >>>> > >>>>
> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP
> field?), to have an implementation to support higher precision of time.
> >>>> > >>>>
> >>>> > >>>> What do you think about the need of higher precision for time
> type and which option is preferred?
> >>>> > >>>>
> >>>> > >>>> -Rui
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Robert Bradshaw <ro...@google.com>.

Yes, microseconds is a good compromise for covering a long enough
timespan that there's little reason it could be hit (even for
processing historical data).

Regarding backwards compatibility, could we just change the internal
representation of Beam's element timestamps, possibly with new APIs to
access the finer granularity? (True, it may not be upgrade
compatible.)
On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <re...@google.com> wrote:
>
> The main difference (though possibly theoretical) is when time runs out. With 64 bits and nanosecond precision, we can only represent times about 244 years in the future (or the past).
>
> On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>> I like nanoseconds as extremely future-proof. What about specing this out in stages (1) domain of values (2) portable encoding that can represent those values (3) language-specific types to embed the values in.
>>
>> 1. If it is a nanosecond-precision absolute time, and we eventually want to migrate event time timestamps to match, then we need values for "end of global window" and "end of time". TBH I am not sure we need both of these any more. We can either define a max on the nanosecond range or create distinguished values.
>>
>> 2. For portability, presumably an order-preserving integer encoding of nanoseconds since epoch with whatever tweaks to allow for representing the end of time. It might be useful to find a way to allow multiple. Not super useful at a particular version, but might have given us a migration path. It would also allow experiments for performance.
>>
>> 3. We could probably find a way to keep user-facing API compatibility here while increasing underlying precision at 1 and 2, but I probably not worth it. A new Java type IMO addresses the lossiness issue because a user would have to explicitly request truncation to assign to a millis event time timestamp.
>>
>> Kenn
>>
>> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com> wrote:
>>>
>>> Is the proposal to do this for both Beam Schema DATETIME fields as well as for Beam timestamps in general?  The latter likely has a bunch of downstream consequences for all runners.
>>>
>>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>> +1 to more precision even to the nano level, probably via Reuven's
>>>> proposal of a different internal representation.
>>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com> wrote:
>>>> >
>>>> > +1 to offering more granular timestamps in general. I think it will be
>>>> > odd if setting the element timestamp from a row DATETIME field is
>>>> > lossy, so we should seriously consider upgrading that as well.
>>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
>>>> > >
>>>> > > One related issue that came up before is that we (perhaps unnecessarily) restrict the precision of timestamps in the Python SDK to milliseconds because of legacy reasons related to the Java runner's use of Joda time.  Perhaps Beam portability should natively use a more granular timestamp unit.
>>>> > >
>>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
>>>> > >>
>>>> > >> Thanks Reuven!
>>>> > >>
>>>> > >> I think Reuven gives the third option:
>>>> > >>
>>>> > >> Change internal representation of DATETIME field in Row. Still keep public ReadableDateTime getDateTime(String fieldName) API to be compatible with existing code. And I think we could add one more API to getDataTimeNanosecond. This option is different from the option one because option one actually maintains two implementation of time.
>>>> > >>
>>>> > >> -Rui
>>>> > >>
>>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
>>>> > >>>
>>>> > >>> I would vote that we change the internal representation of Row to something other than Joda. Java 8 times would give us at least microseconds, and if we want nanoseconds we could simply store it as a number.
>>>> > >>>
>>>> > >>> We should still keep accessor methods that return and take Joda objects, as the rest of Beam still depends on Joda.
>>>> > >>>
>>>> > >>> Reuven
>>>> > >>>
>>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>>>> > >>>>
>>>> > >>>> Hi Community,
>>>> > >>>>
>>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited to the precision of millisecond. It has good enough precision to represent timestamp of event time, but it is not enough for the real "time" data. For the "time" type data, we probably need to support even up to the precision of nanosecond.
>>>> > >>>>
>>>> > >>>> Unfortunately, Joda decided to keep the precision of millisecond: https://github.com/JodaOrg/joda-time/issues/139.
>>>> > >>>>
>>>> > >>>> If we want to support the precision of nanosecond, we could have two options:
>>>> > >>>>
>>>> > >>>> Option one: utilize current FieldType's metadata field, such that we could set something into meta data and Row could check the metadata to decide what's saved in DATETIME field: Joda's Datetime or an implementation that supports nanosecond.
>>>> > >>>>
>>>> > >>>> Option two: have another field (maybe called TIMESTAMP field?), to have an implementation to support higher precision of time.
>>>> > >>>>
>>>> > >>>> What do you think about the need of higher precision for time type and which option is preferred?
>>>> > >>>>
>>>> > >>>> -Rui

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Reuven Lax <re...@google.com>.

The main difference (though possibly theoretical) is when time runs out.
With 64 bits and nanosecond precision, we can only represent times about
244 years in the future (or the past).

On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <ke...@apache.org> wrote:

> I like nanoseconds as extremely future-proof. What about specing this out
> in stages (1) domain of values (2) portable encoding that can represent
> those values (3) language-specific types to embed the values in.
>
> 1. If it is a nanosecond-precision absolute time, and we eventually want
> to migrate event time timestamps to match, then we need values for "end of
> global window" and "end of time". TBH I am not sure we need both of these
> any more. We can either define a max on the nanosecond range or create
> distinguished values.
>
> 2. For portability, presumably an order-preserving integer encoding of
> nanoseconds since epoch with whatever tweaks to allow for representing the
> end of time. It might be useful to find a way to allow multiple. Not super
> useful at a particular version, but might have given us a migration path.
> It would also allow experiments for performance.
>
> 3. We could probably find a way to keep user-facing API compatibility here
> while increasing underlying precision at 1 and 2, but I probably not worth
> it. A new Java type IMO addresses the lossiness issue because a user would
> have to explicitly request truncation to assign to a millis event time
> timestamp.
>
> Kenn
>
> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com> wrote:
>
>> Is the proposal to do this for both Beam Schema DATETIME fields as well
>> as for Beam timestamps in general?  The latter likely has a bunch of
>> downstream consequences for all runners.
>>
>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>>> +1 to more precision even to the nano level, probably via Reuven's
>>> proposal of a different internal representation.
>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>> >
>>> > +1 to offering more granular timestamps in general. I think it will be
>>> > odd if setting the element timestamp from a row DATETIME field is
>>> > lossy, so we should seriously consider upgrading that as well.
>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
>>> > >
>>> > > One related issue that came up before is that we (perhaps
>>> unnecessarily) restrict the precision of timestamps in the Python SDK to
>>> milliseconds because of legacy reasons related to the Java runner's use of
>>> Joda time.  Perhaps Beam portability should natively use a more granular
>>> timestamp unit.
>>> > >
>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
>>> > >>
>>> > >> Thanks Reuven!
>>> > >>
>>> > >> I think Reuven gives the third option:
>>> > >>
>>> > >> Change internal representation of DATETIME field in Row. Still keep
>>> public ReadableDateTime getDateTime(String fieldName) API to be compatible
>>> with existing code. And I think we could add one more API to
>>> getDataTimeNanosecond. This option is different from the option one because
>>> option one actually maintains two implementation of time.
>>> > >>
>>> > >> -Rui
>>> > >>
>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
>>> > >>>
>>> > >>> I would vote that we change the internal representation of Row to
>>> something other than Joda. Java 8 times would give us at least
>>> microseconds, and if we want nanoseconds we could simply store it as a
>>> number.
>>> > >>>
>>> > >>> We should still keep accessor methods that return and take Joda
>>> objects, as the rest of Beam still depends on Joda.
>>> > >>>
>>> > >>> Reuven
>>> > >>>
>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>>> > >>>>
>>> > >>>> Hi Community,
>>> > >>>>
>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
>>> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
>>> to the precision of millisecond. It has good enough precision to represent
>>> timestamp of event time, but it is not enough for the real "time" data. For
>>> the "time" type data, we probably need to support even up to the precision
>>> of nanosecond.
>>> > >>>>
>>> > >>>> Unfortunately, Joda decided to keep the precision of millisecond:
>>> https://github.com/JodaOrg/joda-time/issues/139.
>>> > >>>>
>>> > >>>> If we want to support the precision of nanosecond, we could have
>>> two options:
>>> > >>>>
>>> > >>>> Option one: utilize current FieldType's metadata field, such that
>>> we could set something into meta data and Row could check the metadata to
>>> decide what's saved in DATETIME field: Joda's Datetime or an implementation
>>> that supports nanosecond.
>>> > >>>>
>>> > >>>> Option two: have another field (maybe called TIMESTAMP field?),
>>> to have an implementation to support higher precision of time.
>>> > >>>>
>>> > >>>> What do you think about the need of higher precision for time
>>> type and which option is preferred?
>>> > >>>>
>>> > >>>> -Rui
>>>
>>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Kenneth Knowles <ke...@apache.org>.

I like nanoseconds as extremely future-proof. What about specing this out
in stages (1) domain of values (2) portable encoding that can represent
those values (3) language-specific types to embed the values in.

1. If it is a nanosecond-precision absolute time, and we eventually want to
migrate event time timestamps to match, then we need values for "end of
global window" and "end of time". TBH I am not sure we need both of these
any more. We can either define a max on the nanosecond range or create
distinguished values.

2. For portability, presumably an order-preserving integer encoding of
nanoseconds since epoch with whatever tweaks to allow for representing the
end of time. It might be useful to find a way to allow multiple. Not super
useful at a particular version, but might have given us a migration path.
It would also allow experiments for performance.

3. We could probably find a way to keep user-facing API compatibility here
while increasing underlying precision at 1 and 2, but I probably not worth
it. A new Java type IMO addresses the lossiness issue because a user would
have to explicitly request truncation to assign to a millis event time
timestamp.

Kenn

On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <cc...@google.com> wrote:

> Is the proposal to do this for both Beam Schema DATETIME fields as well as
> for Beam timestamps in general?  The latter likely has a bunch of
> downstream consequences for all runners.
>
> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> +1 to more precision even to the nano level, probably via Reuven's
>> proposal of a different internal representation.
>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>> >
>> > +1 to offering more granular timestamps in general. I think it will be
>> > odd if setting the element timestamp from a row DATETIME field is
>> > lossy, so we should seriously consider upgrading that as well.
>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
>> > >
>> > > One related issue that came up before is that we (perhaps
>> unnecessarily) restrict the precision of timestamps in the Python SDK to
>> milliseconds because of legacy reasons related to the Java runner's use of
>> Joda time.  Perhaps Beam portability should natively use a more granular
>> timestamp unit.
>> > >
>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
>> > >>
>> > >> Thanks Reuven!
>> > >>
>> > >> I think Reuven gives the third option:
>> > >>
>> > >> Change internal representation of DATETIME field in Row. Still keep
>> public ReadableDateTime getDateTime(String fieldName) API to be compatible
>> with existing code. And I think we could add one more API to
>> getDataTimeNanosecond. This option is different from the option one because
>> option one actually maintains two implementation of time.
>> > >>
>> > >> -Rui
>> > >>
>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
>> > >>>
>> > >>> I would vote that we change the internal representation of Row to
>> something other than Joda. Java 8 times would give us at least
>> microseconds, and if we want nanoseconds we could simply store it as a
>> number.
>> > >>>
>> > >>> We should still keep accessor methods that return and take Joda
>> objects, as the rest of Beam still depends on Joda.
>> > >>>
>> > >>> Reuven
>> > >>>
>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>> > >>>>
>> > >>>> Hi Community,
>> > >>>>
>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
>> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
>> to the precision of millisecond. It has good enough precision to represent
>> timestamp of event time, but it is not enough for the real "time" data. For
>> the "time" type data, we probably need to support even up to the precision
>> of nanosecond.
>> > >>>>
>> > >>>> Unfortunately, Joda decided to keep the precision of millisecond:
>> https://github.com/JodaOrg/joda-time/issues/139.
>> > >>>>
>> > >>>> If we want to support the precision of nanosecond, we could have
>> two options:
>> > >>>>
>> > >>>> Option one: utilize current FieldType's metadata field, such that
>> we could set something into meta data and Row could check the metadata to
>> decide what's saved in DATETIME field: Joda's Datetime or an implementation
>> that supports nanosecond.
>> > >>>>
>> > >>>> Option two: have another field (maybe called TIMESTAMP field?), to
>> have an implementation to support higher precision of time.
>> > >>>>
>> > >>>> What do you think about the need of higher precision for time type
>> and which option is preferred?
>> > >>>>
>> > >>>> -Rui
>>
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Charles Chen <cc...@google.com>.

Is the proposal to do this for both Beam Schema DATETIME fields as well as
for Beam timestamps in general?  The latter likely has a bunch of
downstream consequences for all runners.

On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <ie...@gmail.com> wrote:

> +1 to more precision even to the nano level, probably via Reuven's
> proposal of a different internal representation.
> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >
> > +1 to offering more granular timestamps in general. I think it will be
> > odd if setting the element timestamp from a row DATETIME field is
> > lossy, so we should seriously consider upgrading that as well.
> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
> > >
> > > One related issue that came up before is that we (perhaps
> unnecessarily) restrict the precision of timestamps in the Python SDK to
> milliseconds because of legacy reasons related to the Java runner's use of
> Joda time.  Perhaps Beam portability should natively use a more granular
> timestamp unit.
> > >
> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
> > >>
> > >> Thanks Reuven!
> > >>
> > >> I think Reuven gives the third option:
> > >>
> > >> Change internal representation of DATETIME field in Row. Still keep
> public ReadableDateTime getDateTime(String fieldName) API to be compatible
> with existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
> > >>
> > >> -Rui
> > >>
> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
> > >>>
> > >>> I would vote that we change the internal representation of Row to
> something other than Joda. Java 8 times would give us at least
> microseconds, and if we want nanoseconds we could simply store it as a
> number.
> > >>>
> > >>> We should still keep accessor methods that return and take Joda
> objects, as the rest of Beam still depends on Joda.
> > >>>
> > >>> Reuven
> > >>>
> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
> > >>>>
> > >>>> Hi Community,
> > >>>>
> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
> to the precision of millisecond. It has good enough precision to represent
> timestamp of event time, but it is not enough for the real "time" data. For
> the "time" type data, we probably need to support even up to the precision
> of nanosecond.
> > >>>>
> > >>>> Unfortunately, Joda decided to keep the precision of millisecond:
> https://github.com/JodaOrg/joda-time/issues/139.
> > >>>>
> > >>>> If we want to support the precision of nanosecond, we could have
> two options:
> > >>>>
> > >>>> Option one: utilize current FieldType's metadata field, such that
> we could set something into meta data and Row could check the metadata to
> decide what's saved in DATETIME field: Joda's Datetime or an implementation
> that supports nanosecond.
> > >>>>
> > >>>> Option two: have another field (maybe called TIMESTAMP field?), to
> have an implementation to support higher precision of time.
> > >>>>
> > >>>> What do you think about the need of higher precision for time type
> and which option is preferred?
> > >>>>
> > >>>> -Rui
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Ismaël Mejía <ie...@gmail.com>.

+1 to more precision even to the nano level, probably via Reuven's
proposal of a different internal representation.
On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <ro...@google.com> wrote:
>
> +1 to offering more granular timestamps in general. I think it will be
> odd if setting the element timestamp from a row DATETIME field is
> lossy, so we should seriously consider upgrading that as well.
> On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
> >
> > One related issue that came up before is that we (perhaps unnecessarily) restrict the precision of timestamps in the Python SDK to milliseconds because of legacy reasons related to the Java runner's use of Joda time.  Perhaps Beam portability should natively use a more granular timestamp unit.
> >
> > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
> >>
> >> Thanks Reuven!
> >>
> >> I think Reuven gives the third option:
> >>
> >> Change internal representation of DATETIME field in Row. Still keep public ReadableDateTime getDateTime(String fieldName) API to be compatible with existing code. And I think we could add one more API to getDataTimeNanosecond. This option is different from the option one because option one actually maintains two implementation of time.
> >>
> >> -Rui
> >>
> >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
> >>>
> >>> I would vote that we change the internal representation of Row to something other than Joda. Java 8 times would give us at least microseconds, and if we want nanoseconds we could simply store it as a number.
> >>>
> >>> We should still keep accessor methods that return and take Joda objects, as the rest of Beam still depends on Joda.
> >>>
> >>> Reuven
> >>>
> >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
> >>>>
> >>>> Hi Community,
> >>>>
> >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited to the precision of millisecond. It has good enough precision to represent timestamp of event time, but it is not enough for the real "time" data. For the "time" type data, we probably need to support even up to the precision of nanosecond.
> >>>>
> >>>> Unfortunately, Joda decided to keep the precision of millisecond: https://github.com/JodaOrg/joda-time/issues/139.
> >>>>
> >>>> If we want to support the precision of nanosecond, we could have two options:
> >>>>
> >>>> Option one: utilize current FieldType's metadata field, such that we could set something into meta data and Row could check the metadata to decide what's saved in DATETIME field: Joda's Datetime or an implementation that supports nanosecond.
> >>>>
> >>>> Option two: have another field (maybe called TIMESTAMP field?), to have an implementation to support higher precision of time.
> >>>>
> >>>> What do you think about the need of higher precision for time type and which option is preferred?
> >>>>
> >>>> -Rui

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Reuven Lax <re...@google.com>.

Robert - unfortunately I think changing Beam's element timestamps is not
backwards compatible, and will have to wait till Beam 3.0.

On Tue, Nov 6, 2018 at 12:19 AM Robert Bradshaw <ro...@google.com> wrote:

> +1 to offering more granular timestamps in general. I think it will be
> odd if setting the element timestamp from a row DATETIME field is
> lossy, so we should seriously consider upgrading that as well.
> On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
> >
> > One related issue that came up before is that we (perhaps unnecessarily)
> restrict the precision of timestamps in the Python SDK to milliseconds
> because of legacy reasons related to the Java runner's use of Joda time.
> Perhaps Beam portability should natively use a more granular timestamp unit.
> >
> > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
> >>
> >> Thanks Reuven!
> >>
> >> I think Reuven gives the third option:
> >>
> >> Change internal representation of DATETIME field in Row. Still keep
> public ReadableDateTime getDateTime(String fieldName) API to be compatible
> with existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
> >>
> >> -Rui
> >>
> >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
> >>>
> >>> I would vote that we change the internal representation of Row to
> something other than Joda. Java 8 times would give us at least
> microseconds, and if we want nanoseconds we could simply store it as a
> number.
> >>>
> >>> We should still keep accessor methods that return and take Joda
> objects, as the rest of Beam still depends on Joda.
> >>>
> >>> Reuven
> >>>
> >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
> >>>>
> >>>> Hi Community,
> >>>>
> >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's
> Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited
> to the precision of millisecond. It has good enough precision to represent
> timestamp of event time, but it is not enough for the real "time" data. For
> the "time" type data, we probably need to support even up to the precision
> of nanosecond.
> >>>>
> >>>> Unfortunately, Joda decided to keep the precision of millisecond:
> https://github.com/JodaOrg/joda-time/issues/139.
> >>>>
> >>>> If we want to support the precision of nanosecond, we could have two
> options:
> >>>>
> >>>> Option one: utilize current FieldType's metadata field, such that we
> could set something into meta data and Row could check the metadata to
> decide what's saved in DATETIME field: Joda's Datetime or an implementation
> that supports nanosecond.
> >>>>
> >>>> Option two: have another field (maybe called TIMESTAMP field?), to
> have an implementation to support higher precision of time.
> >>>>
> >>>> What do you think about the need of higher precision for time type
> and which option is preferred?
> >>>>
> >>>> -Rui
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Robert Bradshaw <ro...@google.com>.

+1 to offering more granular timestamps in general. I think it will be
odd if setting the element timestamp from a row DATETIME field is
lossy, so we should seriously consider upgrading that as well.
On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <cc...@google.com> wrote:
>
> One related issue that came up before is that we (perhaps unnecessarily) restrict the precision of timestamps in the Python SDK to milliseconds because of legacy reasons related to the Java runner's use of Joda time.  Perhaps Beam portability should natively use a more granular timestamp unit.
>
> On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:
>>
>> Thanks Reuven!
>>
>> I think Reuven gives the third option:
>>
>> Change internal representation of DATETIME field in Row. Still keep public ReadableDateTime getDateTime(String fieldName) API to be compatible with existing code. And I think we could add one more API to getDataTimeNanosecond. This option is different from the option one because option one actually maintains two implementation of time.
>>
>> -Rui
>>
>> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
>>>
>>> I would vote that we change the internal representation of Row to something other than Joda. Java 8 times would give us at least microseconds, and if we want nanoseconds we could simply store it as a number.
>>>
>>> We should still keep accessor methods that return and take Joda objects, as the rest of Beam still depends on Joda.
>>>
>>> Reuven
>>>
>>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>>>>
>>>> Hi Community,
>>>>
>>>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited to the precision of millisecond. It has good enough precision to represent timestamp of event time, but it is not enough for the real "time" data. For the "time" type data, we probably need to support even up to the precision of nanosecond.
>>>>
>>>> Unfortunately, Joda decided to keep the precision of millisecond: https://github.com/JodaOrg/joda-time/issues/139.
>>>>
>>>> If we want to support the precision of nanosecond, we could have two options:
>>>>
>>>> Option one: utilize current FieldType's metadata field, such that we could set something into meta data and Row could check the metadata to decide what's saved in DATETIME field: Joda's Datetime or an implementation that supports nanosecond.
>>>>
>>>> Option two: have another field (maybe called TIMESTAMP field?), to have an implementation to support higher precision of time.
>>>>
>>>> What do you think about the need of higher precision for time type and which option is preferred?
>>>>
>>>> -Rui

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Charles Chen <cc...@google.com>.

One related issue that came up before is that we (perhaps unnecessarily)
restrict the precision of timestamps in the Python SDK to milliseconds
because of legacy reasons related to the Java runner's use of Joda time.
Perhaps Beam portability should natively use a more granular timestamp unit.

On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ru...@google.com> wrote:

> Thanks Reuven!
>
> I think Reuven gives the third option:
>
> Change internal representation of DATETIME field in Row. Still keep public
> ReadableDateTime getDateTime(String fieldName) API to be compatible with
> existing code. And I think we could add one more API to
> getDataTimeNanosecond. This option is different from the option one because
> option one actually maintains two implementation of time.
>
> -Rui
>
> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:
>
>> I would vote that we change the internal representation of Row to
>> something other than Joda. Java 8 times would give us at least
>> microseconds, and if we want nanoseconds we could simply store it as a
>> number.
>>
>> We should still keep accessor methods that return and take Joda objects,
>> as the rest of Beam still depends on Joda.
>>
>> Reuven
>>
>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>>
>>> Hi Community,
>>>
>>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime
>>> (see Row.java#L611
>>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L611>
>>>  and Row.java#L169
>>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L169>).
>>> Joda's Datetime is limited to the precision of millisecond. It has good
>>> enough precision to represent timestamp of event time, but it is not enough
>>> for the real "time" data. For the "time" type data, we probably need to
>>> support even up to the precision of nanosecond.
>>>
>>> Unfortunately, Joda decided to keep the precision of millisecond:
>>> https://github.com/JodaOrg/joda-time/issues/139.
>>>
>>> If we want to support the precision of nanosecond, we could have two
>>> options:
>>>
>>> Option one: utilize current FieldType's metadata field
>>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L421>,
>>> such that we could set something into meta data and Row could check the
>>> metadata to decide what's saved in DATETIME field: Joda's Datetime or an
>>> implementation that supports nanosecond.
>>>
>>> Option two: have another field (maybe called TIMESTAMP field?), to have
>>> an implementation to support higher precision of time.
>>>
>>> What do you think about the need of higher precision for time type and
>>> which option is preferred?
>>>
>>> -Rui
>>>
>>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Rui Wang <ru...@google.com>.

Thanks Reuven!

I think Reuven gives the third option:

Change internal representation of DATETIME field in Row. Still keep
public ReadableDateTime
getDateTime(String fieldName) API to be compatible with existing code. And
I think we could add one more API to getDataTimeNanosecond. This option is
different from the option one because option one actually maintains two
implementation of time.

-Rui

On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <re...@google.com> wrote:

> I would vote that we change the internal representation of Row to
> something other than Joda. Java 8 times would give us at least
> microseconds, and if we want nanoseconds we could simply store it as a
> number.
>
> We should still keep accessor methods that return and take Joda objects,
> as the rest of Beam still depends on Joda.
>
> Reuven
>
> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:
>
>> Hi Community,
>>
>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime
>> (see Row.java#L611
>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L611>
>>  and Row.java#L169
>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L169>).
>> Joda's Datetime is limited to the precision of millisecond. It has good
>> enough precision to represent timestamp of event time, but it is not enough
>> for the real "time" data. For the "time" type data, we probably need to
>> support even up to the precision of nanosecond.
>>
>> Unfortunately, Joda decided to keep the precision of millisecond:
>> https://github.com/JodaOrg/joda-time/issues/139.
>>
>> If we want to support the precision of nanosecond, we could have two
>> options:
>>
>> Option one: utilize current FieldType's metadata field
>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L421>,
>> such that we could set something into meta data and Row could check the
>> metadata to decide what's saved in DATETIME field: Joda's Datetime or an
>> implementation that supports nanosecond.
>>
>> Option two: have another field (maybe called TIMESTAMP field?), to have
>> an implementation to support higher precision of time.
>>
>> What do you think about the need of higher precision for time type and
>> which option is preferred?
>>
>> -Rui
>>
>

Re: [DISCUSS] More precision supported by DATETIME field in Schema

Posted by Reuven Lax <re...@google.com>.

I would vote that we change the internal representation of Row to something
other than Joda. Java 8 times would give us at least microseconds, and if
we want nanoseconds we could simply store it as a number.

We should still keep accessor methods that return and take Joda objects, as
the rest of Beam still depends on Joda.

Reuven

On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ru...@google.com> wrote:

> Hi Community,
>
> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime
> (see Row.java#L611
> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L611>
>  and Row.java#L169
> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java#L169>).
> Joda's Datetime is limited to the precision of millisecond. It has good
> enough precision to represent timestamp of event time, but it is not enough
> for the real "time" data. For the "time" type data, we probably need to
> support even up to the precision of nanosecond.
>
> Unfortunately, Joda decided to keep the precision of millisecond:
> https://github.com/JodaOrg/joda-time/issues/139.
>
> If we want to support the precision of nanosecond, we could have two
> options:
>
> Option one: utilize current FieldType's metadata field
> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L421>,
> such that we could set something into meta data and Row could check the
> metadata to decide what's saved in DATETIME field: Joda's Datetime or an
> implementation that supports nanosecond.
>
> Option two: have another field (maybe called TIMESTAMP field?), to have an
> implementation to support higher precision of time.
>
> What do you think about the need of higher precision for time type and
> which option is preferred?
>
> -Rui
>