You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Luke Cwik <lc...@google.com> on 2019/11/11 22:02:53 UTC

Date/Time Ranges & Protobuf

While crites@ was investigating using protobuf to represent Apache Beam
timestamps within the TestStreamEvents, he found out that the well known
type google.protobuf.Timestamp doesn't support certain timestamps we were
using in our tests (specifically the max timestamp that Apache Beam
supports).

This lead me to investigate and the well known type
google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
range that Apache Beam currently supports -9223372036854775ms
to 9223372036854775ms which is about 292277BC to 294247AD (it was difficult
to find a time range that represented this).

Similarly the google.protobuf.Duration represents any time range over those
~10000 years. Google decided to limit their range to be compatible with the
RFC 3339[2] standard to which does simplify many things since it guarantees
that all RFC 3339 time parsing/manipulation libraries are supported.

Should we:
A) define our own timestamp/duration types to be able to represent the full
time range that Apache Beam can express?
B) limit the valid timestamps in Apache Beam to some standard such as RFC
3339?

This discussion is somewhat related to the efforts to support nano
timestamps[2].

1: https://tools.ietf.org/html/rfc3339
2:
https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E

Re: Date/Time Ranges & Protobuf

Posted by Sam Rohde <sr...@google.com>.
Also made a JIRA:
https://issues.apache.org/jira/projects/BEAM/issues/BEAM-8738

On Mon, Nov 18, 2019 at 2:32 PM Sam Rohde <sr...@google.com> wrote:

> Cool I wrote up https://github.com/apache/beam/pull/10146
>
> On Mon, Nov 18, 2019 at 2:09 PM Luke Cwik <lc...@google.com> wrote:
>
>> Sam, I think doing that makes the most sense right now as we haven't yet
>> had a strong enough consensus to change it so to support all of Beam's
>> timestamps/durations it makes sense to still use the format but work around
>> the limitation that is imposed.
>>
>> On Mon, Nov 18, 2019 at 11:25 AM Sam Rohde <sr...@google.com> wrote:
>>
>>> Timestamp related question: I want to modify Python's utils/timestamp.py
>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/timestamp.py>
>>> module to include google.protobuf.timestamp to/from translation methods.
>>> What do you guys think? Now that we know the timestamp.proto is implicitly
>>> RFC3339 compliant, is it right to include translation methods that could
>>> potentially break that compliance (a la min/max watermarks)? We already use
>>> the timestamp.proto in: windows definitions
>>> <https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/standard_window_fns.proto#L44>,
>>> pubsub messages
>>> <https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/pubsub.proto#L32>,
>>> bundle applications
>>> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
>>> metrics
>>> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
>>> and logs
>>> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L804>.
>>> Is my change okay?
>>>
>>> On Thu, Nov 14, 2019 at 3:40 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> The timestamps flow both ways since:
>>>> * IO authors are responsible for saying what the watermark timestamp is
>>>> and stateful DoFns also allow for users to set timers in relative and
>>>> processing time domains.
>>>> * Runner authors need to understand and merge these timestamps together
>>>> to compute what the global watermark is for a PCollection.
>>>>
>>>> On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde <sr...@google.com> wrote:
>>>>
>>>>> My two cents are we just need a proto representation for timestamps
>>>>> and durations that includes units. The underlying library can then
>>>>> determine what to do with it. Then further, we can have a standard across
>>>>> Beam SDKs and Runners of how to interpret the proto. Using a raw int64 for
>>>>> timestamps and durations is confusing and *very very *bug prone (as
>>>>> we have seen in the past).
>>>>>
>>>>> I don't know if this is relevant, but does Apache Beam have any
>>>>> standards surrounding leap years or seconds? If we were to make our own
>>>>> timestamp format, would we have to worry about that? Or is the timestamp
>>>>> supplied to Beam a property of the underlying system giving Beam the
>>>>> timestamp? If it is, then there may be some interop problems between
>>>>> sources.
>>>>>
>>>>> On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I do agree that Apache Beam can represent dates and times with
>>>>>> arbitrary precision and can do it many different ways.
>>>>>>
>>>>>> My argument has always been should around whether we restrict this
>>>>>> range to a common standard to increase interoperability across other
>>>>>> systems. For example, SQL database servers have varying degrees as to what
>>>>>> ranges they support:
>>>>>> * Oracle 10[1]: 0001-01-01 to 9999-12-31
>>>>>> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
>>>>>> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
>>>>>> mask)
>>>>>> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
>>>>>> * Microsoft SQL:  January 1, 1753, through December 31, 9999 for
>>>>>> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]
>>>>>>
>>>>>> The common case of the global window containing timestamps that are
>>>>>> before and after all of these supported ranges above means that our users
>>>>>> can't represent a global window within a database using its common data
>>>>>> types.
>>>>>>
>>>>>> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
>>>>>> 2:
>>>>>> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
>>>>>> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
>>>>>> 4:
>>>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
>>>>>> 5:
>>>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>>>>
>>>>>> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> just an idea on these related topics that appear these days - it
>>>>>>> might help to realize, that what we actually don't need a full arithmetic
>>>>>>> on timestamps (Beam model IMHO doesn't need to know exactly what is the
>>>>>>> exact difference of two events). What we actually need is a slightly
>>>>>>> simplified algebra. Given two timestamps T1 and T2 and a "duration" (a
>>>>>>> different type from timestamp), we need operations (not 100% sure that this
>>>>>>> is exhaustive, but seems to be):
>>>>>>>
>>>>>>>  - is_preceding(T1, T2): bool
>>>>>>>
>>>>>>>    - important !is_preceding(T1, T2) does NOT imply that
>>>>>>> is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would
>>>>>>> mean events are _concurrent_
>>>>>>>
>>>>>>>    - this relation has to be also antisymmetric
>>>>>>>
>>>>>>>    - given this function we can construct a comparator, where
>>>>>>> multiple distinct timestamps can be "equal" (or with no particular
>>>>>>> ordering, which is natural property of time)
>>>>>>>
>>>>>>>  - min_timestamp_following(T1, duration): T2
>>>>>>>
>>>>>>>    - that would return a timestamp for which is_preceding(T1 +
>>>>>>> duration, T2) would return true and no other timestamp X would exist for
>>>>>>> which is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>>>>>>>
>>>>>>>    - actually, this function would serve as the definition for the
>>>>>>> duration object
>>>>>>>
>>>>>>> If we can supply this algebra, it seems that we can use any
>>>>>>> representation of timestamps and intervals. It might be (probably) even
>>>>>>> possible to let user specify his own type used as timestamps and durations,
>>>>>>> which could solve the issues of not currently being able to correctly
>>>>>>> represent timestamps lower than Long.MIN_VALUE (although we can get data
>>>>>>> for that low timestamps - cosmic microwave background being one example
>>>>>>> :)). Specifying this algebra actually probably boils down to proposal (3)
>>>>>>> in Robert's thread [1].
>>>>>>>
>>>>>>> Just my 2 cents.
>>>>>>>
>>>>>>> Jan
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
>>>>>>> On 11/13/19 10:11 AM, jincheng sun wrote:
>>>>>>>
>>>>>>> Thanks for bringing up this discussion @Luke.
>>>>>>>
>>>>>>> As @Kenn mentioned, in Beam we have defined the constants value for
>>>>>>> the min/max/end of global window. I noticed that
>>>>>>> google.protobuf.Timestamp/Duration is only used in window
>>>>>>> definitions, such as FixedWindowsPayload, SlidingWindowsPayload,
>>>>>>> SessionsPayload, etc.
>>>>>>>
>>>>>>> I think that both RFC 3339 and Beam's current implementation are big
>>>>>>> enough to express a common window definitions. But users can really
>>>>>>> define a window size that outside the scope of the RFC 3339.
>>>>>>> Conceptually, we should not limit the time range for
>>>>>>> window(although I think the range of RPC 3339 is big enough in most cases).
>>>>>>>
>>>>>>> To ensure that people well know the background of the discussion,
>>>>>>> hope you don't mind that I put the original conversion thread[1] here.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jincheng
>>>>>>>
>>>>>>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>>>>>>>
>>>>>>> Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:
>>>>>>>
>>>>>>>> I agree about it being a tagged union in the model (together with
>>>>>>>> actual_time(...) - epsilon). It's not just a performance hack
>>>>>>>> though,
>>>>>>>> it's also (as discussed elsewhere) a question of being able to find
>>>>>>>> an
>>>>>>>> embedding into existing datetime libraries. The real question here
>>>>>>>> is
>>>>>>>> whether we should limit ourselves to just these 10000 years AD, or
>>>>>>>> find value in being able to process events for the lifetime of the
>>>>>>>> universe (or, at least, recorded human history). Artificially
>>>>>>>> limiting
>>>>>>>> in this way would seem surprising to me at least.
>>>>>>>>
>>>>>>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > The max timestamp, min timestamp, and end of the global window
>>>>>>>> are all performance hacks in my view. Timestamps in beam are really a
>>>>>>>> tagged union:
>>>>>>>> >
>>>>>>>> >     timestamp ::= min | max | end_of_global | actual_time(...
>>>>>>>> some quantitative timestamp ...)
>>>>>>>> >
>>>>>>>> > with the ordering
>>>>>>>> >
>>>>>>>> >     min < actual_time(...) < end_of_global < max
>>>>>>>> >
>>>>>>>> > We chose arbitrary numbers so that we could do simple numeric
>>>>>>>> comparisons and arithmetic.
>>>>>>>> >
>>>>>>>> > Kenn
>>>>>>>> >
>>>>>>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> While crites@ was investigating using protobuf to represent
>>>>>>>> Apache Beam timestamps within the TestStreamEvents, he found out that the
>>>>>>>> well known type google.protobuf.Timestamp doesn't support certain
>>>>>>>> timestamps we were using in our tests (specifically the max timestamp that
>>>>>>>> Apache Beam supports).
>>>>>>>> >>
>>>>>>>> >> This lead me to investigate and the well known type
>>>>>>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>>>>>>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>>>>>>>> range that Apache Beam currently supports -9223372036854775ms to
>>>>>>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>>>>>>>> find a time range that represented this).
>>>>>>>> >>
>>>>>>>> >> Similarly the google.protobuf.Duration represents any time range
>>>>>>>> over those ~10000 years. Google decided to limit their range to be
>>>>>>>> compatible with the RFC 3339[2] standard to which does simplify many things
>>>>>>>> since it guarantees that all RFC 3339 time parsing/manipulation libraries
>>>>>>>> are supported.
>>>>>>>> >>
>>>>>>>> >> Should we:
>>>>>>>> >> A) define our own timestamp/duration types to be able to
>>>>>>>> represent the full time range that Apache Beam can express?
>>>>>>>> >> B) limit the valid timestamps in Apache Beam to some standard
>>>>>>>> such as RFC 3339?
>>>>>>>> >>
>>>>>>>> >> This discussion is somewhat related to the efforts to support
>>>>>>>> nano timestamps[2].
>>>>>>>> >>
>>>>>>>> >> 1: https://tools.ietf.org/html/rfc3339
>>>>>>>> >> 2:
>>>>>>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>>>>>>>
>>>>>>>

Re: Date/Time Ranges & Protobuf

Posted by Sam Rohde <sr...@google.com>.
Cool I wrote up https://github.com/apache/beam/pull/10146

On Mon, Nov 18, 2019 at 2:09 PM Luke Cwik <lc...@google.com> wrote:

> Sam, I think doing that makes the most sense right now as we haven't yet
> had a strong enough consensus to change it so to support all of Beam's
> timestamps/durations it makes sense to still use the format but work around
> the limitation that is imposed.
>
> On Mon, Nov 18, 2019 at 11:25 AM Sam Rohde <sr...@google.com> wrote:
>
>> Timestamp related question: I want to modify Python's utils/timestamp.py
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/timestamp.py>
>> module to include google.protobuf.timestamp to/from translation methods.
>> What do you guys think? Now that we know the timestamp.proto is implicitly
>> RFC3339 compliant, is it right to include translation methods that could
>> potentially break that compliance (a la min/max watermarks)? We already use
>> the timestamp.proto in: windows definitions
>> <https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/standard_window_fns.proto#L44>,
>> pubsub messages
>> <https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/pubsub.proto#L32>,
>> bundle applications
>> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
>> metrics
>> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
>> and logs
>> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L804>.
>> Is my change okay?
>>
>> On Thu, Nov 14, 2019 at 3:40 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> The timestamps flow both ways since:
>>> * IO authors are responsible for saying what the watermark timestamp is
>>> and stateful DoFns also allow for users to set timers in relative and
>>> processing time domains.
>>> * Runner authors need to understand and merge these timestamps together
>>> to compute what the global watermark is for a PCollection.
>>>
>>> On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde <sr...@google.com> wrote:
>>>
>>>> My two cents are we just need a proto representation for timestamps and
>>>> durations that includes units. The underlying library can then determine
>>>> what to do with it. Then further, we can have a standard across Beam SDKs
>>>> and Runners of how to interpret the proto. Using a raw int64 for timestamps
>>>> and durations is confusing and *very very *bug prone (as we have seen
>>>> in the past).
>>>>
>>>> I don't know if this is relevant, but does Apache Beam have any
>>>> standards surrounding leap years or seconds? If we were to make our own
>>>> timestamp format, would we have to worry about that? Or is the timestamp
>>>> supplied to Beam a property of the underlying system giving Beam the
>>>> timestamp? If it is, then there may be some interop problems between
>>>> sources.
>>>>
>>>> On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> I do agree that Apache Beam can represent dates and times with
>>>>> arbitrary precision and can do it many different ways.
>>>>>
>>>>> My argument has always been should around whether we restrict this
>>>>> range to a common standard to increase interoperability across other
>>>>> systems. For example, SQL database servers have varying degrees as to what
>>>>> ranges they support:
>>>>> * Oracle 10[1]: 0001-01-01 to 9999-12-31
>>>>> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
>>>>> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
>>>>> mask)
>>>>> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
>>>>> * Microsoft SQL:  January 1, 1753, through December 31, 9999 for
>>>>> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]
>>>>>
>>>>> The common case of the global window containing timestamps that are
>>>>> before and after all of these supported ranges above means that our users
>>>>> can't represent a global window within a database using its common data
>>>>> types.
>>>>>
>>>>> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
>>>>> 2:
>>>>> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
>>>>> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
>>>>> 4:
>>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
>>>>> 5:
>>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>>>
>>>>> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> just an idea on these related topics that appear these days - it
>>>>>> might help to realize, that what we actually don't need a full arithmetic
>>>>>> on timestamps (Beam model IMHO doesn't need to know exactly what is the
>>>>>> exact difference of two events). What we actually need is a slightly
>>>>>> simplified algebra. Given two timestamps T1 and T2 and a "duration" (a
>>>>>> different type from timestamp), we need operations (not 100% sure that this
>>>>>> is exhaustive, but seems to be):
>>>>>>
>>>>>>  - is_preceding(T1, T2): bool
>>>>>>
>>>>>>    - important !is_preceding(T1, T2) does NOT imply that
>>>>>> is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would
>>>>>> mean events are _concurrent_
>>>>>>
>>>>>>    - this relation has to be also antisymmetric
>>>>>>
>>>>>>    - given this function we can construct a comparator, where
>>>>>> multiple distinct timestamps can be "equal" (or with no particular
>>>>>> ordering, which is natural property of time)
>>>>>>
>>>>>>  - min_timestamp_following(T1, duration): T2
>>>>>>
>>>>>>    - that would return a timestamp for which is_preceding(T1 +
>>>>>> duration, T2) would return true and no other timestamp X would exist for
>>>>>> which is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>>>>>>
>>>>>>    - actually, this function would serve as the definition for the
>>>>>> duration object
>>>>>>
>>>>>> If we can supply this algebra, it seems that we can use any
>>>>>> representation of timestamps and intervals. It might be (probably) even
>>>>>> possible to let user specify his own type used as timestamps and durations,
>>>>>> which could solve the issues of not currently being able to correctly
>>>>>> represent timestamps lower than Long.MIN_VALUE (although we can get data
>>>>>> for that low timestamps - cosmic microwave background being one example
>>>>>> :)). Specifying this algebra actually probably boils down to proposal (3)
>>>>>> in Robert's thread [1].
>>>>>>
>>>>>> Just my 2 cents.
>>>>>>
>>>>>> Jan
>>>>>>
>>>>>> [1]
>>>>>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
>>>>>> On 11/13/19 10:11 AM, jincheng sun wrote:
>>>>>>
>>>>>> Thanks for bringing up this discussion @Luke.
>>>>>>
>>>>>> As @Kenn mentioned, in Beam we have defined the constants value for
>>>>>> the min/max/end of global window. I noticed that
>>>>>> google.protobuf.Timestamp/Duration is only used in window definitions,
>>>>>> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>>>>>>
>>>>>> I think that both RFC 3339 and Beam's current implementation are big
>>>>>> enough to express a common window definitions. But users can really
>>>>>> define a window size that outside the scope of the RFC 3339.
>>>>>> Conceptually, we should not limit the time range for window(although
>>>>>> I think the range of RPC 3339 is big enough in most cases).
>>>>>>
>>>>>> To ensure that people well know the background of the discussion,
>>>>>> hope you don't mind that I put the original conversion thread[1] here.
>>>>>>
>>>>>> Best,
>>>>>> Jincheng
>>>>>>
>>>>>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>>>>>>
>>>>>> Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:
>>>>>>
>>>>>>> I agree about it being a tagged union in the model (together with
>>>>>>> actual_time(...) - epsilon). It's not just a performance hack though,
>>>>>>> it's also (as discussed elsewhere) a question of being able to find
>>>>>>> an
>>>>>>> embedding into existing datetime libraries. The real question here is
>>>>>>> whether we should limit ourselves to just these 10000 years AD, or
>>>>>>> find value in being able to process events for the lifetime of the
>>>>>>> universe (or, at least, recorded human history). Artificially
>>>>>>> limiting
>>>>>>> in this way would seem surprising to me at least.
>>>>>>>
>>>>>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > The max timestamp, min timestamp, and end of the global window are
>>>>>>> all performance hacks in my view. Timestamps in beam are really a tagged
>>>>>>> union:
>>>>>>> >
>>>>>>> >     timestamp ::= min | max | end_of_global | actual_time(... some
>>>>>>> quantitative timestamp ...)
>>>>>>> >
>>>>>>> > with the ordering
>>>>>>> >
>>>>>>> >     min < actual_time(...) < end_of_global < max
>>>>>>> >
>>>>>>> > We chose arbitrary numbers so that we could do simple numeric
>>>>>>> comparisons and arithmetic.
>>>>>>> >
>>>>>>> > Kenn
>>>>>>> >
>>>>>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> While crites@ was investigating using protobuf to represent
>>>>>>> Apache Beam timestamps within the TestStreamEvents, he found out that the
>>>>>>> well known type google.protobuf.Timestamp doesn't support certain
>>>>>>> timestamps we were using in our tests (specifically the max timestamp that
>>>>>>> Apache Beam supports).
>>>>>>> >>
>>>>>>> >> This lead me to investigate and the well known type
>>>>>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>>>>>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>>>>>>> range that Apache Beam currently supports -9223372036854775ms to
>>>>>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>>>>>>> find a time range that represented this).
>>>>>>> >>
>>>>>>> >> Similarly the google.protobuf.Duration represents any time range
>>>>>>> over those ~10000 years. Google decided to limit their range to be
>>>>>>> compatible with the RFC 3339[2] standard to which does simplify many things
>>>>>>> since it guarantees that all RFC 3339 time parsing/manipulation libraries
>>>>>>> are supported.
>>>>>>> >>
>>>>>>> >> Should we:
>>>>>>> >> A) define our own timestamp/duration types to be able to
>>>>>>> represent the full time range that Apache Beam can express?
>>>>>>> >> B) limit the valid timestamps in Apache Beam to some standard
>>>>>>> such as RFC 3339?
>>>>>>> >>
>>>>>>> >> This discussion is somewhat related to the efforts to support
>>>>>>> nano timestamps[2].
>>>>>>> >>
>>>>>>> >> 1: https://tools.ietf.org/html/rfc3339
>>>>>>> >> 2:
>>>>>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>

Re: Date/Time Ranges & Protobuf

Posted by Luke Cwik <lc...@google.com>.
Sam, I think doing that makes the most sense right now as we haven't yet
had a strong enough consensus to change it so to support all of Beam's
timestamps/durations it makes sense to still use the format but work around
the limitation that is imposed.

On Mon, Nov 18, 2019 at 11:25 AM Sam Rohde <sr...@google.com> wrote:

> Timestamp related question: I want to modify Python's utils/timestamp.py
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/timestamp.py>
> module to include google.protobuf.timestamp to/from translation methods.
> What do you guys think? Now that we know the timestamp.proto is implicitly
> RFC3339 compliant, is it right to include translation methods that could
> potentially break that compliance (a la min/max watermarks)? We already use
> the timestamp.proto in: windows definitions
> <https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/standard_window_fns.proto#L44>,
> pubsub messages
> <https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/pubsub.proto#L32>,
> bundle applications
> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
> metrics
> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
> and logs
> <https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L804>.
> Is my change okay?
>
> On Thu, Nov 14, 2019 at 3:40 PM Luke Cwik <lc...@google.com> wrote:
>
>> The timestamps flow both ways since:
>> * IO authors are responsible for saying what the watermark timestamp is
>> and stateful DoFns also allow for users to set timers in relative and
>> processing time domains.
>> * Runner authors need to understand and merge these timestamps together
>> to compute what the global watermark is for a PCollection.
>>
>> On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> My two cents are we just need a proto representation for timestamps and
>>> durations that includes units. The underlying library can then determine
>>> what to do with it. Then further, we can have a standard across Beam SDKs
>>> and Runners of how to interpret the proto. Using a raw int64 for timestamps
>>> and durations is confusing and *very very *bug prone (as we have seen
>>> in the past).
>>>
>>> I don't know if this is relevant, but does Apache Beam have any
>>> standards surrounding leap years or seconds? If we were to make our own
>>> timestamp format, would we have to worry about that? Or is the timestamp
>>> supplied to Beam a property of the underlying system giving Beam the
>>> timestamp? If it is, then there may be some interop problems between
>>> sources.
>>>
>>> On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> I do agree that Apache Beam can represent dates and times with
>>>> arbitrary precision and can do it many different ways.
>>>>
>>>> My argument has always been should around whether we restrict this
>>>> range to a common standard to increase interoperability across other
>>>> systems. For example, SQL database servers have varying degrees as to what
>>>> ranges they support:
>>>> * Oracle 10[1]: 0001-01-01 to 9999-12-31
>>>> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
>>>> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
>>>> mask)
>>>> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
>>>> * Microsoft SQL:  January 1, 1753, through December 31, 9999 for
>>>> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]
>>>>
>>>> The common case of the global window containing timestamps that are
>>>> before and after all of these supported ranges above means that our users
>>>> can't represent a global window within a database using its common data
>>>> types.
>>>>
>>>> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
>>>> 2:
>>>> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
>>>> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
>>>> 4:
>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
>>>> 5:
>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>>
>>>> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> just an idea on these related topics that appear these days - it might
>>>>> help to realize, that what we actually don't need a full arithmetic on
>>>>> timestamps (Beam model IMHO doesn't need to know exactly what is the exact
>>>>> difference of two events). What we actually need is a slightly simplified
>>>>> algebra. Given two timestamps T1 and T2 and a "duration" (a different type
>>>>> from timestamp), we need operations (not 100% sure that this is exhaustive,
>>>>> but seems to be):
>>>>>
>>>>>  - is_preceding(T1, T2): bool
>>>>>
>>>>>    - important !is_preceding(T1, T2) does NOT imply that
>>>>> is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would
>>>>> mean events are _concurrent_
>>>>>
>>>>>    - this relation has to be also antisymmetric
>>>>>
>>>>>    - given this function we can construct a comparator, where multiple
>>>>> distinct timestamps can be "equal" (or with no particular ordering, which
>>>>> is natural property of time)
>>>>>
>>>>>  - min_timestamp_following(T1, duration): T2
>>>>>
>>>>>    - that would return a timestamp for which is_preceding(T1 +
>>>>> duration, T2) would return true and no other timestamp X would exist for
>>>>> which is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>>>>>
>>>>>    - actually, this function would serve as the definition for the
>>>>> duration object
>>>>>
>>>>> If we can supply this algebra, it seems that we can use any
>>>>> representation of timestamps and intervals. It might be (probably) even
>>>>> possible to let user specify his own type used as timestamps and durations,
>>>>> which could solve the issues of not currently being able to correctly
>>>>> represent timestamps lower than Long.MIN_VALUE (although we can get data
>>>>> for that low timestamps - cosmic microwave background being one example
>>>>> :)). Specifying this algebra actually probably boils down to proposal (3)
>>>>> in Robert's thread [1].
>>>>>
>>>>> Just my 2 cents.
>>>>>
>>>>> Jan
>>>>>
>>>>> [1]
>>>>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
>>>>> On 11/13/19 10:11 AM, jincheng sun wrote:
>>>>>
>>>>> Thanks for bringing up this discussion @Luke.
>>>>>
>>>>> As @Kenn mentioned, in Beam we have defined the constants value for
>>>>> the min/max/end of global window. I noticed that
>>>>> google.protobuf.Timestamp/Duration is only used in window definitions,
>>>>> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>>>>>
>>>>> I think that both RFC 3339 and Beam's current implementation are big
>>>>> enough to express a common window definitions. But users can really
>>>>> define a window size that outside the scope of the RFC 3339.
>>>>> Conceptually, we should not limit the time range for window(although
>>>>> I think the range of RPC 3339 is big enough in most cases).
>>>>>
>>>>> To ensure that people well know the background of the discussion, hope
>>>>> you don't mind that I put the original conversion thread[1] here.
>>>>>
>>>>> Best,
>>>>> Jincheng
>>>>>
>>>>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>>>>>
>>>>> Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:
>>>>>
>>>>>> I agree about it being a tagged union in the model (together with
>>>>>> actual_time(...) - epsilon). It's not just a performance hack though,
>>>>>> it's also (as discussed elsewhere) a question of being able to find an
>>>>>> embedding into existing datetime libraries. The real question here is
>>>>>> whether we should limit ourselves to just these 10000 years AD, or
>>>>>> find value in being able to process events for the lifetime of the
>>>>>> universe (or, at least, recorded human history). Artificially limiting
>>>>>> in this way would seem surprising to me at least.
>>>>>>
>>>>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > The max timestamp, min timestamp, and end of the global window are
>>>>>> all performance hacks in my view. Timestamps in beam are really a tagged
>>>>>> union:
>>>>>> >
>>>>>> >     timestamp ::= min | max | end_of_global | actual_time(... some
>>>>>> quantitative timestamp ...)
>>>>>> >
>>>>>> > with the ordering
>>>>>> >
>>>>>> >     min < actual_time(...) < end_of_global < max
>>>>>> >
>>>>>> > We chose arbitrary numbers so that we could do simple numeric
>>>>>> comparisons and arithmetic.
>>>>>> >
>>>>>> > Kenn
>>>>>> >
>>>>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:
>>>>>> >>
>>>>>> >> While crites@ was investigating using protobuf to represent
>>>>>> Apache Beam timestamps within the TestStreamEvents, he found out that the
>>>>>> well known type google.protobuf.Timestamp doesn't support certain
>>>>>> timestamps we were using in our tests (specifically the max timestamp that
>>>>>> Apache Beam supports).
>>>>>> >>
>>>>>> >> This lead me to investigate and the well known type
>>>>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>>>>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>>>>>> range that Apache Beam currently supports -9223372036854775ms to
>>>>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>>>>>> find a time range that represented this).
>>>>>> >>
>>>>>> >> Similarly the google.protobuf.Duration represents any time range
>>>>>> over those ~10000 years. Google decided to limit their range to be
>>>>>> compatible with the RFC 3339[2] standard to which does simplify many things
>>>>>> since it guarantees that all RFC 3339 time parsing/manipulation libraries
>>>>>> are supported.
>>>>>> >>
>>>>>> >> Should we:
>>>>>> >> A) define our own timestamp/duration types to be able to represent
>>>>>> the full time range that Apache Beam can express?
>>>>>> >> B) limit the valid timestamps in Apache Beam to some standard such
>>>>>> as RFC 3339?
>>>>>> >>
>>>>>> >> This discussion is somewhat related to the efforts to support nano
>>>>>> timestamps[2].
>>>>>> >>
>>>>>> >> 1: https://tools.ietf.org/html/rfc3339
>>>>>> >> 2:
>>>>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>

Re: Date/Time Ranges & Protobuf

Posted by Sam Rohde <sr...@google.com>.
Timestamp related question: I want to modify Python's utils/timestamp.py
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/timestamp.py>
module to include google.protobuf.timestamp to/from translation methods.
What do you guys think? Now that we know the timestamp.proto is implicitly
RFC3339 compliant, is it right to include translation methods that could
potentially break that compliance (a la min/max watermarks)? We already use
the timestamp.proto in: windows definitions
<https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/standard_window_fns.proto#L44>,
pubsub messages
<https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/windmill/src/main/proto/pubsub.proto#L32>,
bundle applications
<https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
metrics
<https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L173>,
and logs
<https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L804>.
Is my change okay?

On Thu, Nov 14, 2019 at 3:40 PM Luke Cwik <lc...@google.com> wrote:

> The timestamps flow both ways since:
> * IO authors are responsible for saying what the watermark timestamp is
> and stateful DoFns also allow for users to set timers in relative and
> processing time domains.
> * Runner authors need to understand and merge these timestamps together to
> compute what the global watermark is for a PCollection.
>
> On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde <sr...@google.com> wrote:
>
>> My two cents are we just need a proto representation for timestamps and
>> durations that includes units. The underlying library can then determine
>> what to do with it. Then further, we can have a standard across Beam SDKs
>> and Runners of how to interpret the proto. Using a raw int64 for timestamps
>> and durations is confusing and *very very *bug prone (as we have seen in
>> the past).
>>
>> I don't know if this is relevant, but does Apache Beam have any standards
>> surrounding leap years or seconds? If we were to make our own timestamp
>> format, would we have to worry about that? Or is the timestamp supplied to
>> Beam a property of the underlying system giving Beam the timestamp? If it
>> is, then there may be some interop problems between sources.
>>
>> On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> I do agree that Apache Beam can represent dates and times with arbitrary
>>> precision and can do it many different ways.
>>>
>>> My argument has always been should around whether we restrict this range
>>> to a common standard to increase interoperability across other systems. For
>>> example, SQL database servers have varying degrees as to what ranges they
>>> support:
>>> * Oracle 10[1]: 0001-01-01 to 9999-12-31
>>> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
>>> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
>>> mask)
>>> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
>>> * Microsoft SQL:  January 1, 1753, through December 31, 9999 for
>>> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]
>>>
>>> The common case of the global window containing timestamps that are
>>> before and after all of these supported ranges above means that our users
>>> can't represent a global window within a database using its common data
>>> types.
>>>
>>> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
>>> 2:
>>> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
>>> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
>>> 4:
>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
>>> 5:
>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>
>>> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi,
>>>>
>>>> just an idea on these related topics that appear these days - it might
>>>> help to realize, that what we actually don't need a full arithmetic on
>>>> timestamps (Beam model IMHO doesn't need to know exactly what is the exact
>>>> difference of two events). What we actually need is a slightly simplified
>>>> algebra. Given two timestamps T1 and T2 and a "duration" (a different type
>>>> from timestamp), we need operations (not 100% sure that this is exhaustive,
>>>> but seems to be):
>>>>
>>>>  - is_preceding(T1, T2): bool
>>>>
>>>>    - important !is_preceding(T1, T2) does NOT imply that
>>>> is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would
>>>> mean events are _concurrent_
>>>>
>>>>    - this relation has to be also antisymmetric
>>>>
>>>>    - given this function we can construct a comparator, where multiple
>>>> distinct timestamps can be "equal" (or with no particular ordering, which
>>>> is natural property of time)
>>>>
>>>>  - min_timestamp_following(T1, duration): T2
>>>>
>>>>    - that would return a timestamp for which is_preceding(T1 +
>>>> duration, T2) would return true and no other timestamp X would exist for
>>>> which is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>>>>
>>>>    - actually, this function would serve as the definition for the
>>>> duration object
>>>>
>>>> If we can supply this algebra, it seems that we can use any
>>>> representation of timestamps and intervals. It might be (probably) even
>>>> possible to let user specify his own type used as timestamps and durations,
>>>> which could solve the issues of not currently being able to correctly
>>>> represent timestamps lower than Long.MIN_VALUE (although we can get data
>>>> for that low timestamps - cosmic microwave background being one example
>>>> :)). Specifying this algebra actually probably boils down to proposal (3)
>>>> in Robert's thread [1].
>>>>
>>>> Just my 2 cents.
>>>>
>>>> Jan
>>>>
>>>> [1]
>>>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
>>>> On 11/13/19 10:11 AM, jincheng sun wrote:
>>>>
>>>> Thanks for bringing up this discussion @Luke.
>>>>
>>>> As @Kenn mentioned, in Beam we have defined the constants value for the
>>>> min/max/end of global window. I noticed that
>>>> google.protobuf.Timestamp/Duration is only used in window definitions,
>>>> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>>>>
>>>> I think that both RFC 3339 and Beam's current implementation are big
>>>> enough to express a common window definitions. But users can really
>>>> define a window size that outside the scope of the RFC 3339.
>>>> Conceptually, we should not limit the time range for window(although I
>>>> think the range of RPC 3339 is big enough in most cases).
>>>>
>>>> To ensure that people well know the background of the discussion, hope
>>>> you don't mind that I put the original conversion thread[1] here.
>>>>
>>>> Best,
>>>> Jincheng
>>>>
>>>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>>>>
>>>> Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:
>>>>
>>>>> I agree about it being a tagged union in the model (together with
>>>>> actual_time(...) - epsilon). It's not just a performance hack though,
>>>>> it's also (as discussed elsewhere) a question of being able to find an
>>>>> embedding into existing datetime libraries. The real question here is
>>>>> whether we should limit ourselves to just these 10000 years AD, or
>>>>> find value in being able to process events for the lifetime of the
>>>>> universe (or, at least, recorded human history). Artificially limiting
>>>>> in this way would seem surprising to me at least.
>>>>>
>>>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > The max timestamp, min timestamp, and end of the global window are
>>>>> all performance hacks in my view. Timestamps in beam are really a tagged
>>>>> union:
>>>>> >
>>>>> >     timestamp ::= min | max | end_of_global | actual_time(... some
>>>>> quantitative timestamp ...)
>>>>> >
>>>>> > with the ordering
>>>>> >
>>>>> >     min < actual_time(...) < end_of_global < max
>>>>> >
>>>>> > We chose arbitrary numbers so that we could do simple numeric
>>>>> comparisons and arithmetic.
>>>>> >
>>>>> > Kenn
>>>>> >
>>>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:
>>>>> >>
>>>>> >> While crites@ was investigating using protobuf to represent Apache
>>>>> Beam timestamps within the TestStreamEvents, he found out that the well
>>>>> known type google.protobuf.Timestamp doesn't support certain timestamps we
>>>>> were using in our tests (specifically the max timestamp that Apache Beam
>>>>> supports).
>>>>> >>
>>>>> >> This lead me to investigate and the well known type
>>>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>>>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>>>>> range that Apache Beam currently supports -9223372036854775ms to
>>>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>>>>> find a time range that represented this).
>>>>> >>
>>>>> >> Similarly the google.protobuf.Duration represents any time range
>>>>> over those ~10000 years. Google decided to limit their range to be
>>>>> compatible with the RFC 3339[2] standard to which does simplify many things
>>>>> since it guarantees that all RFC 3339 time parsing/manipulation libraries
>>>>> are supported.
>>>>> >>
>>>>> >> Should we:
>>>>> >> A) define our own timestamp/duration types to be able to represent
>>>>> the full time range that Apache Beam can express?
>>>>> >> B) limit the valid timestamps in Apache Beam to some standard such
>>>>> as RFC 3339?
>>>>> >>
>>>>> >> This discussion is somewhat related to the efforts to support nano
>>>>> timestamps[2].
>>>>> >>
>>>>> >> 1: https://tools.ietf.org/html/rfc3339
>>>>> >> 2:
>>>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>>>>
>>>>

Re: Date/Time Ranges & Protobuf

Posted by Luke Cwik <lc...@google.com>.
The timestamps flow both ways since:
* IO authors are responsible for saying what the watermark timestamp is and
stateful DoFns also allow for users to set timers in relative and
processing time domains.
* Runner authors need to understand and merge these timestamps together to
compute what the global watermark is for a PCollection.

On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde <sr...@google.com> wrote:

> My two cents are we just need a proto representation for timestamps and
> durations that includes units. The underlying library can then determine
> what to do with it. Then further, we can have a standard across Beam SDKs
> and Runners of how to interpret the proto. Using a raw int64 for timestamps
> and durations is confusing and *very very *bug prone (as we have seen in
> the past).
>
> I don't know if this is relevant, but does Apache Beam have any standards
> surrounding leap years or seconds? If we were to make our own timestamp
> format, would we have to worry about that? Or is the timestamp supplied to
> Beam a property of the underlying system giving Beam the timestamp? If it
> is, then there may be some interop problems between sources.
>
> On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <lc...@google.com> wrote:
>
>> I do agree that Apache Beam can represent dates and times with arbitrary
>> precision and can do it many different ways.
>>
>> My argument has always been should around whether we restrict this range
>> to a common standard to increase interoperability across other systems. For
>> example, SQL database servers have varying degrees as to what ranges they
>> support:
>> * Oracle 10[1]: 0001-01-01 to 9999-12-31
>> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
>> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
>> mask)
>> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
>> * Microsoft SQL:  January 1, 1753, through December 31, 9999 for
>> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]
>>
>> The common case of the global window containing timestamps that are
>> before and after all of these supported ranges above means that our users
>> can't represent a global window within a database using its common data
>> types.
>>
>> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
>> 2:
>> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
>> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
>> 4:
>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
>> 5:
>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>
>> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi,
>>>
>>> just an idea on these related topics that appear these days - it might
>>> help to realize, that what we actually don't need a full arithmetic on
>>> timestamps (Beam model IMHO doesn't need to know exactly what is the exact
>>> difference of two events). What we actually need is a slightly simplified
>>> algebra. Given two timestamps T1 and T2 and a "duration" (a different type
>>> from timestamp), we need operations (not 100% sure that this is exhaustive,
>>> but seems to be):
>>>
>>>  - is_preceding(T1, T2): bool
>>>
>>>    - important !is_preceding(T1, T2) does NOT imply that
>>> is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would
>>> mean events are _concurrent_
>>>
>>>    - this relation has to be also antisymmetric
>>>
>>>    - given this function we can construct a comparator, where multiple
>>> distinct timestamps can be "equal" (or with no particular ordering, which
>>> is natural property of time)
>>>
>>>  - min_timestamp_following(T1, duration): T2
>>>
>>>    - that would return a timestamp for which is_preceding(T1 + duration,
>>> T2) would return true and no other timestamp X would exist for which
>>> is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>>>
>>>    - actually, this function would serve as the definition for the
>>> duration object
>>>
>>> If we can supply this algebra, it seems that we can use any
>>> representation of timestamps and intervals. It might be (probably) even
>>> possible to let user specify his own type used as timestamps and durations,
>>> which could solve the issues of not currently being able to correctly
>>> represent timestamps lower than Long.MIN_VALUE (although we can get data
>>> for that low timestamps - cosmic microwave background being one example
>>> :)). Specifying this algebra actually probably boils down to proposal (3)
>>> in Robert's thread [1].
>>>
>>> Just my 2 cents.
>>>
>>> Jan
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
>>> On 11/13/19 10:11 AM, jincheng sun wrote:
>>>
>>> Thanks for bringing up this discussion @Luke.
>>>
>>> As @Kenn mentioned, in Beam we have defined the constants value for the
>>> min/max/end of global window. I noticed that
>>> google.protobuf.Timestamp/Duration is only used in window definitions,
>>> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>>>
>>> I think that both RFC 3339 and Beam's current implementation are big
>>> enough to express a common window definitions. But users can really
>>> define a window size that outside the scope of the RFC 3339.
>>> Conceptually, we should not limit the time range for window(although I
>>> think the range of RPC 3339 is big enough in most cases).
>>>
>>> To ensure that people well know the background of the discussion, hope
>>> you don't mind that I put the original conversion thread[1] here.
>>>
>>> Best,
>>> Jincheng
>>>
>>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>>>
>>> Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:
>>>
>>>> I agree about it being a tagged union in the model (together with
>>>> actual_time(...) - epsilon). It's not just a performance hack though,
>>>> it's also (as discussed elsewhere) a question of being able to find an
>>>> embedding into existing datetime libraries. The real question here is
>>>> whether we should limit ourselves to just these 10000 years AD, or
>>>> find value in being able to process events for the lifetime of the
>>>> universe (or, at least, recorded human history). Artificially limiting
>>>> in this way would seem surprising to me at least.
>>>>
>>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>> >
>>>> > The max timestamp, min timestamp, and end of the global window are
>>>> all performance hacks in my view. Timestamps in beam are really a tagged
>>>> union:
>>>> >
>>>> >     timestamp ::= min | max | end_of_global | actual_time(... some
>>>> quantitative timestamp ...)
>>>> >
>>>> > with the ordering
>>>> >
>>>> >     min < actual_time(...) < end_of_global < max
>>>> >
>>>> > We chose arbitrary numbers so that we could do simple numeric
>>>> comparisons and arithmetic.
>>>> >
>>>> > Kenn
>>>> >
>>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:
>>>> >>
>>>> >> While crites@ was investigating using protobuf to represent Apache
>>>> Beam timestamps within the TestStreamEvents, he found out that the well
>>>> known type google.protobuf.Timestamp doesn't support certain timestamps we
>>>> were using in our tests (specifically the max timestamp that Apache Beam
>>>> supports).
>>>> >>
>>>> >> This lead me to investigate and the well known type
>>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>>>> range that Apache Beam currently supports -9223372036854775ms to
>>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>>>> find a time range that represented this).
>>>> >>
>>>> >> Similarly the google.protobuf.Duration represents any time range
>>>> over those ~10000 years. Google decided to limit their range to be
>>>> compatible with the RFC 3339[2] standard to which does simplify many things
>>>> since it guarantees that all RFC 3339 time parsing/manipulation libraries
>>>> are supported.
>>>> >>
>>>> >> Should we:
>>>> >> A) define our own timestamp/duration types to be able to represent
>>>> the full time range that Apache Beam can express?
>>>> >> B) limit the valid timestamps in Apache Beam to some standard such
>>>> as RFC 3339?
>>>> >>
>>>> >> This discussion is somewhat related to the efforts to support nano
>>>> timestamps[2].
>>>> >>
>>>> >> 1: https://tools.ietf.org/html/rfc3339
>>>> >> 2:
>>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>>>
>>>

Re: Date/Time Ranges & Protobuf

Posted by Sam Rohde <sr...@google.com>.
My two cents are we just need a proto representation for timestamps and
durations that includes units. The underlying library can then determine
what to do with it. Then further, we can have a standard across Beam SDKs
and Runners of how to interpret the proto. Using a raw int64 for timestamps
and durations is confusing and *very very *bug prone (as we have seen in
the past).

I don't know if this is relevant, but does Apache Beam have any standards
surrounding leap years or seconds? If we were to make our own timestamp
format, would we have to worry about that? Or is the timestamp supplied to
Beam a property of the underlying system giving Beam the timestamp? If it
is, then there may be some interop problems between sources.

On Wed, Nov 13, 2019 at 10:35 AM Luke Cwik <lc...@google.com> wrote:

> I do agree that Apache Beam can represent dates and times with arbitrary
> precision and can do it many different ways.
>
> My argument has always been should around whether we restrict this range
> to a common standard to increase interoperability across other systems. For
> example, SQL database servers have varying degrees as to what ranges they
> support:
> * Oracle 10[1]: 0001-01-01 to 9999-12-31
> * Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
> December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
> mask)
> * MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
> * Microsoft SQL:  January 1, 1753, through December 31, 9999 for
> datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]
>
> The common case of the global window containing timestamps that are before
> and after all of these supported ranges above means that our users can't
> represent a global window within a database using its common data types.
>
> 1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
> 2:
> https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
> 3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
> 4:
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
> 5:
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>
> On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi,
>>
>> just an idea on these related topics that appear these days - it might
>> help to realize, that what we actually don't need a full arithmetic on
>> timestamps (Beam model IMHO doesn't need to know exactly what is the exact
>> difference of two events). What we actually need is a slightly simplified
>> algebra. Given two timestamps T1 and T2 and a "duration" (a different type
>> from timestamp), we need operations (not 100% sure that this is exhaustive,
>> but seems to be):
>>
>>  - is_preceding(T1, T2): bool
>>
>>    - important !is_preceding(T1, T2) does NOT imply that is_preceding(T2,
>> T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would mean events are
>> _concurrent_
>>
>>    - this relation has to be also antisymmetric
>>
>>    - given this function we can construct a comparator, where multiple
>> distinct timestamps can be "equal" (or with no particular ordering, which
>> is natural property of time)
>>
>>  - min_timestamp_following(T1, duration): T2
>>
>>    - that would return a timestamp for which is_preceding(T1 + duration,
>> T2) would return true and no other timestamp X would exist for which
>> is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>>
>>    - actually, this function would serve as the definition for the
>> duration object
>>
>> If we can supply this algebra, it seems that we can use any
>> representation of timestamps and intervals. It might be (probably) even
>> possible to let user specify his own type used as timestamps and durations,
>> which could solve the issues of not currently being able to correctly
>> represent timestamps lower than Long.MIN_VALUE (although we can get data
>> for that low timestamps - cosmic microwave background being one example
>> :)). Specifying this algebra actually probably boils down to proposal (3)
>> in Robert's thread [1].
>>
>> Just my 2 cents.
>>
>> Jan
>>
>> [1]
>> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
>> On 11/13/19 10:11 AM, jincheng sun wrote:
>>
>> Thanks for bringing up this discussion @Luke.
>>
>> As @Kenn mentioned, in Beam we have defined the constants value for the
>> min/max/end of global window. I noticed that
>> google.protobuf.Timestamp/Duration is only used in window definitions,
>> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>>
>> I think that both RFC 3339 and Beam's current implementation are big
>> enough to express a common window definitions. But users can really
>> define a window size that outside the scope of the RFC 3339. Conceptually,
>> we should not limit the time range for window(although I think the range of
>> RPC 3339 is big enough in most cases).
>>
>> To ensure that people well know the background of the discussion, hope
>> you don't mind that I put the original conversion thread[1] here.
>>
>> Best,
>> Jincheng
>>
>> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>>
>> Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:
>>
>>> I agree about it being a tagged union in the model (together with
>>> actual_time(...) - epsilon). It's not just a performance hack though,
>>> it's also (as discussed elsewhere) a question of being able to find an
>>> embedding into existing datetime libraries. The real question here is
>>> whether we should limit ourselves to just these 10000 years AD, or
>>> find value in being able to process events for the lifetime of the
>>> universe (or, at least, recorded human history). Artificially limiting
>>> in this way would seem surprising to me at least.
>>>
>>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org>
>>> wrote:
>>> >
>>> > The max timestamp, min timestamp, and end of the global window are all
>>> performance hacks in my view. Timestamps in beam are really a tagged union:
>>> >
>>> >     timestamp ::= min | max | end_of_global | actual_time(... some
>>> quantitative timestamp ...)
>>> >
>>> > with the ordering
>>> >
>>> >     min < actual_time(...) < end_of_global < max
>>> >
>>> > We chose arbitrary numbers so that we could do simple numeric
>>> comparisons and arithmetic.
>>> >
>>> > Kenn
>>> >
>>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:
>>> >>
>>> >> While crites@ was investigating using protobuf to represent Apache
>>> Beam timestamps within the TestStreamEvents, he found out that the well
>>> known type google.protobuf.Timestamp doesn't support certain timestamps we
>>> were using in our tests (specifically the max timestamp that Apache Beam
>>> supports).
>>> >>
>>> >> This lead me to investigate and the well known type
>>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>>> range that Apache Beam currently supports -9223372036854775ms to
>>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>>> find a time range that represented this).
>>> >>
>>> >> Similarly the google.protobuf.Duration represents any time range over
>>> those ~10000 years. Google decided to limit their range to be compatible
>>> with the RFC 3339[2] standard to which does simplify many things since it
>>> guarantees that all RFC 3339 time parsing/manipulation libraries are
>>> supported.
>>> >>
>>> >> Should we:
>>> >> A) define our own timestamp/duration types to be able to represent
>>> the full time range that Apache Beam can express?
>>> >> B) limit the valid timestamps in Apache Beam to some standard such as
>>> RFC 3339?
>>> >>
>>> >> This discussion is somewhat related to the efforts to support nano
>>> timestamps[2].
>>> >>
>>> >> 1: https://tools.ietf.org/html/rfc3339
>>> >> 2:
>>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>>
>>

Re: Date/Time Ranges & Protobuf

Posted by Luke Cwik <lc...@google.com>.
I do agree that Apache Beam can represent dates and times with arbitrary
precision and can do it many different ways.

My argument has always been should around whether we restrict this range to
a common standard to increase interoperability across other systems. For
example, SQL database servers have varying degrees as to what ranges they
support:
* Oracle 10[1]: 0001-01-01 to 9999-12-31
* Oracle 11g[2]: Julian era, ranging from January 1, 4712 BCE through
December 31, 9999 CE (Common Era, or 'AD'). Unless BCE ('BC' in the format
mask)
* MySQL[3]: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
* Microsoft SQL:  January 1, 1753, through December 31, 9999 for
datetime[4] and January 1,1 CE through December 31, 9999 CE for datetime2[5]

The common case of the global window containing timestamps that are before
and after all of these supported ranges above means that our users can't
represent a global window within a database using its common data types.

1: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefdttlimits.html
2:
https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT413
3: https://dev.mysql.com/doc/refman/8.0/en/datetime.html
4:
https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime-transact-sql?view=sql-server-ver15
5:
https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15

On Wed, Nov 13, 2019 at 3:28 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi,
>
> just an idea on these related topics that appear these days - it might
> help to realize, that what we actually don't need a full arithmetic on
> timestamps (Beam model IMHO doesn't need to know exactly what is the exact
> difference of two events). What we actually need is a slightly simplified
> algebra. Given two timestamps T1 and T2 and a "duration" (a different type
> from timestamp), we need operations (not 100% sure that this is exhaustive,
> but seems to be):
>
>  - is_preceding(T1, T2): bool
>
>    - important !is_preceding(T1, T2) does NOT imply that is_preceding(T2,
> T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) would mean events are
> _concurrent_
>
>    - this relation has to be also antisymmetric
>
>    - given this function we can construct a comparator, where multiple
> distinct timestamps can be "equal" (or with no particular ordering, which
> is natural property of time)
>
>  - min_timestamp_following(T1, duration): T2
>
>    - that would return a timestamp for which is_preceding(T1 + duration,
> T2) would return true and no other timestamp X would exist for which
> is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true
>
>    - actually, this function would serve as the definition for the
> duration object
>
> If we can supply this algebra, it seems that we can use any representation
> of timestamps and intervals. It might be (probably) even possible to let
> user specify his own type used as timestamps and durations, which could
> solve the issues of not currently being able to correctly represent
> timestamps lower than Long.MIN_VALUE (although we can get data for that low
> timestamps - cosmic microwave background being one example :)). Specifying
> this algebra actually probably boils down to proposal (3) in Robert's
> thread [1].
>
> Just my 2 cents.
>
> Jan
>
> [1]
> https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E
> On 11/13/19 10:11 AM, jincheng sun wrote:
>
> Thanks for bringing up this discussion @Luke.
>
> As @Kenn mentioned, in Beam we have defined the constants value for the
> min/max/end of global window. I noticed that
> google.protobuf.Timestamp/Duration is only used in window definitions,
> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>
> I think that both RFC 3339 and Beam's current implementation are big
> enough to express a common window definitions. But users can really
> define a window size that outside the scope of the RFC 3339. Conceptually,
> we should not limit the time range for window(although I think the range of
> RPC 3339 is big enough in most cases).
>
> To ensure that people well know the background of the discussion, hope you
> don't mind that I put the original conversion thread[1] here.
>
> Best,
> Jincheng
>
> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>
> Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:
>
>> I agree about it being a tagged union in the model (together with
>> actual_time(...) - epsilon). It's not just a performance hack though,
>> it's also (as discussed elsewhere) a question of being able to find an
>> embedding into existing datetime libraries. The real question here is
>> whether we should limit ourselves to just these 10000 years AD, or
>> find value in being able to process events for the lifetime of the
>> universe (or, at least, recorded human history). Artificially limiting
>> in this way would seem surprising to me at least.
>>
>> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org> wrote:
>> >
>> > The max timestamp, min timestamp, and end of the global window are all
>> performance hacks in my view. Timestamps in beam are really a tagged union:
>> >
>> >     timestamp ::= min | max | end_of_global | actual_time(... some
>> quantitative timestamp ...)
>> >
>> > with the ordering
>> >
>> >     min < actual_time(...) < end_of_global < max
>> >
>> > We chose arbitrary numbers so that we could do simple numeric
>> comparisons and arithmetic.
>> >
>> > Kenn
>> >
>> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:
>> >>
>> >> While crites@ was investigating using protobuf to represent Apache
>> Beam timestamps within the TestStreamEvents, he found out that the well
>> known type google.protobuf.Timestamp doesn't support certain timestamps we
>> were using in our tests (specifically the max timestamp that Apache Beam
>> supports).
>> >>
>> >> This lead me to investigate and the well known type
>> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
>> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
>> range that Apache Beam currently supports -9223372036854775ms to
>> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
>> find a time range that represented this).
>> >>
>> >> Similarly the google.protobuf.Duration represents any time range over
>> those ~10000 years. Google decided to limit their range to be compatible
>> with the RFC 3339[2] standard to which does simplify many things since it
>> guarantees that all RFC 3339 time parsing/manipulation libraries are
>> supported.
>> >>
>> >> Should we:
>> >> A) define our own timestamp/duration types to be able to represent the
>> full time range that Apache Beam can express?
>> >> B) limit the valid timestamps in Apache Beam to some standard such as
>> RFC 3339?
>> >>
>> >> This discussion is somewhat related to the efforts to support nano
>> timestamps[2].
>> >>
>> >> 1: https://tools.ietf.org/html/rfc3339
>> >> 2:
>> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>>
>

Re: Date/Time Ranges & Protobuf

Posted by Jan Lukavský <je...@seznam.cz>.
Hi,

just an idea on these related topics that appear these days - it might 
help to realize, that what we actually don't need a full arithmetic on 
timestamps (Beam model IMHO doesn't need to know exactly what is the 
exact difference of two events). What we actually need is a slightly 
simplified algebra. Given two timestamps T1 and T2 and a "duration" (a 
different type from timestamp), we need operations (not 100% sure that 
this is exhaustive, but seems to be):

  - is_preceding(T1, T2): bool

    - important !is_preceding(T1, T2) does NOT imply that 
is_preceding(T2, T1) - !is_preceding(T1, T2) && !is_preceding(T2, T1) 
would mean events are _concurrent_

    - this relation has to be also antisymmetric

    - given this function we can construct a comparator, where multiple 
distinct timestamps can be "equal" (or with no particular ordering, 
which is natural property of time)

  - min_timestamp_following(T1, duration): T2

    - that would return a timestamp for which is_preceding(T1 + 
duration, T2) would return true and no other timestamp X would exist for 
which is_preceding(T1 + duration, X) && is_preceding(X, T2) would be true

    - actually, this function would serve as the definition for the 
duration object

If we can supply this algebra, it seems that we can use any 
representation of timestamps and intervals. It might be (probably) even 
possible to let user specify his own type used as timestamps and 
durations, which could solve the issues of not currently being able to 
correctly represent timestamps lower than Long.MIN_VALUE (although we 
can get data for that low timestamps - cosmic microwave background being 
one example :)). Specifying this algebra actually probably boils down to 
proposal (3) in Robert's thread [1].

Just my 2 cents.

Jan

[1] 
https://lists.apache.org/thread.html/1672898393cb0d54a77a879be0fb5725902289a3e5063d0f9ec36fe1@%3Cdev.beam.apache.org%3E

On 11/13/19 10:11 AM, jincheng sun wrote:
> Thanks for bringing up this discussion @Luke.
>
> As @Kenn mentioned, in Beam we have defined the constants value for 
> the min/max/end of global window. I noticed that 
> google.protobuf.Timestamp/Duration is only used in window definitions, 
> such as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.
>
> I think that both RFC 3339 and Beam's current implementation are big 
> enough to express a common window definitions. But users can really 
> define a window size that outside the scope of the RFC 3339. 
> Conceptually, we should not limit the time range for window(although I 
> think the range of RPC 3339 is big enough in most cases).
>
> To ensure that people well know the background of the discussion, hope 
> you don't mind that I put the original conversion thread[1] here.
>
> Best,
> Jincheng
>
> [1] https://github.com/apache/beam/pull/10041#discussion_r344380809
>
> Robert Bradshaw <robertwb@google.com <ma...@google.com>> 
> 于2019年11月12日周二 下午4:09写道:
>
>     I agree about it being a tagged union in the model (together with
>     actual_time(...) - epsilon). It's not just a performance hack though,
>     it's also (as discussed elsewhere) a question of being able to find an
>     embedding into existing datetime libraries. The real question here is
>     whether we should limit ourselves to just these 10000 years AD, or
>     find value in being able to process events for the lifetime of the
>     universe (or, at least, recorded human history). Artificially limiting
>     in this way would seem surprising to me at least.
>
>     On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <kenn@apache.org
>     <ma...@apache.org>> wrote:
>     >
>     > The max timestamp, min timestamp, and end of the global window
>     are all performance hacks in my view. Timestamps in beam are
>     really a tagged union:
>     >
>     >     timestamp ::= min | max | end_of_global | actual_time(...
>     some quantitative timestamp ...)
>     >
>     > with the ordering
>     >
>     >     min < actual_time(...) < end_of_global < max
>     >
>     > We chose arbitrary numbers so that we could do simple numeric
>     comparisons and arithmetic.
>     >
>     > Kenn
>     >
>     > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lcwik@google.com
>     <ma...@google.com>> wrote:
>     >>
>     >> While crites@ was investigating using protobuf to represent
>     Apache Beam timestamps within the TestStreamEvents, he found out
>     that the well known type google.protobuf.Timestamp doesn't support
>     certain timestamps we were using in our tests (specifically the
>     max timestamp that Apache Beam supports).
>     >>
>     >> This lead me to investigate and the well known type
>     google.protobuf.Timestamp supports dates/times from
>     0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z which is
>     much smaller than the timestamp range that Apache Beam currently
>     supports -9223372036854775ms to 9223372036854775ms which is about
>     292277BC to 294247AD (it was difficult to find a time range that
>     represented this).
>     >>
>     >> Similarly the google.protobuf.Duration represents any time
>     range over those ~10000 years. Google decided to limit their range
>     to be compatible with the RFC 3339[2] standard to which does
>     simplify many things since it guarantees that all RFC 3339 time
>     parsing/manipulation libraries are supported.
>     >>
>     >> Should we:
>     >> A) define our own timestamp/duration types to be able to
>     represent the full time range that Apache Beam can express?
>     >> B) limit the valid timestamps in Apache Beam to some standard
>     such as RFC 3339?
>     >>
>     >> This discussion is somewhat related to the efforts to support
>     nano timestamps[2].
>     >>
>     >> 1: https://tools.ietf.org/html/rfc3339
>     >> 2:
>     https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>

Re: Date/Time Ranges & Protobuf

Posted by jincheng sun <su...@gmail.com>.
Thanks for bringing up this discussion @Luke.

As @Kenn mentioned, in Beam we have defined the constants value for the
min/max/end of global window. I noticed that
google.protobuf.Timestamp/Duration is only used in window definitions, such
as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.

I think that both RFC 3339 and Beam's current implementation are big enough
to express a common window definitions. But users can really define a
window size that outside the scope of the RFC 3339. Conceptually, we should
not limit the time range for window(although I think the range of RPC 3339
is big enough in most cases).

To ensure that people well know the background of the discussion, hope you
don't mind that I put the original conversion thread[1] here.

Best,
Jincheng

[1] https://github.com/apache/beam/pull/10041#discussion_r344380809

Robert Bradshaw <ro...@google.com> 于2019年11月12日周二 下午4:09写道:

> I agree about it being a tagged union in the model (together with
> actual_time(...) - epsilon). It's not just a performance hack though,
> it's also (as discussed elsewhere) a question of being able to find an
> embedding into existing datetime libraries. The real question here is
> whether we should limit ourselves to just these 10000 years AD, or
> find value in being able to process events for the lifetime of the
> universe (or, at least, recorded human history). Artificially limiting
> in this way would seem surprising to me at least.
>
> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org> wrote:
> >
> > The max timestamp, min timestamp, and end of the global window are all
> performance hacks in my view. Timestamps in beam are really a tagged union:
> >
> >     timestamp ::= min | max | end_of_global | actual_time(... some
> quantitative timestamp ...)
> >
> > with the ordering
> >
> >     min < actual_time(...) < end_of_global < max
> >
> > We chose arbitrary numbers so that we could do simple numeric
> comparisons and arithmetic.
> >
> > Kenn
> >
> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:
> >>
> >> While crites@ was investigating using protobuf to represent Apache
> Beam timestamps within the TestStreamEvents, he found out that the well
> known type google.protobuf.Timestamp doesn't support certain timestamps we
> were using in our tests (specifically the max timestamp that Apache Beam
> supports).
> >>
> >> This lead me to investigate and the well known type
> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
> range that Apache Beam currently supports -9223372036854775ms to
> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
> find a time range that represented this).
> >>
> >> Similarly the google.protobuf.Duration represents any time range over
> those ~10000 years. Google decided to limit their range to be compatible
> with the RFC 3339[2] standard to which does simplify many things since it
> guarantees that all RFC 3339 time parsing/manipulation libraries are
> supported.
> >>
> >> Should we:
> >> A) define our own timestamp/duration types to be able to represent the
> full time range that Apache Beam can express?
> >> B) limit the valid timestamps in Apache Beam to some standard such as
> RFC 3339?
> >>
> >> This discussion is somewhat related to the efforts to support nano
> timestamps[2].
> >>
> >> 1: https://tools.ietf.org/html/rfc3339
> >> 2:
> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>

Re: Date/Time Ranges & Protobuf

Posted by Robert Bradshaw <ro...@google.com>.
I agree about it being a tagged union in the model (together with
actual_time(...) - epsilon). It's not just a performance hack though,
it's also (as discussed elsewhere) a question of being able to find an
embedding into existing datetime libraries. The real question here is
whether we should limit ourselves to just these 10000 years AD, or
find value in being able to process events for the lifetime of the
universe (or, at least, recorded human history). Artificially limiting
in this way would seem surprising to me at least.

On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles <ke...@apache.org> wrote:
>
> The max timestamp, min timestamp, and end of the global window are all performance hacks in my view. Timestamps in beam are really a tagged union:
>
>     timestamp ::= min | max | end_of_global | actual_time(... some quantitative timestamp ...)
>
> with the ordering
>
>     min < actual_time(...) < end_of_global < max
>
> We chose arbitrary numbers so that we could do simple numeric comparisons and arithmetic.
>
> Kenn
>
> On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:
>>
>> While crites@ was investigating using protobuf to represent Apache Beam timestamps within the TestStreamEvents, he found out that the well known type google.protobuf.Timestamp doesn't support certain timestamps we were using in our tests (specifically the max timestamp that Apache Beam supports).
>>
>> This lead me to investigate and the well known type google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp range that Apache Beam currently supports -9223372036854775ms to 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to find a time range that represented this).
>>
>> Similarly the google.protobuf.Duration represents any time range over those ~10000 years. Google decided to limit their range to be compatible with the RFC 3339[2] standard to which does simplify many things since it guarantees that all RFC 3339 time parsing/manipulation libraries are supported.
>>
>> Should we:
>> A) define our own timestamp/duration types to be able to represent the full time range that Apache Beam can express?
>> B) limit the valid timestamps in Apache Beam to some standard such as RFC 3339?
>>
>> This discussion is somewhat related to the efforts to support nano timestamps[2].
>>
>> 1: https://tools.ietf.org/html/rfc3339
>> 2: https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E

Re: Date/Time Ranges & Protobuf

Posted by Kenneth Knowles <ke...@apache.org>.
The max timestamp, min timestamp, and end of the global window are all
performance hacks in my view. Timestamps in beam are really a tagged union:

    timestamp ::= min | max | end_of_global | actual_time(... some
quantitative timestamp ...)

with the ordering

    min < actual_time(...) < end_of_global < max

We chose arbitrary numbers so that we could do simple numeric comparisons
and arithmetic.

Kenn

On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik <lc...@google.com> wrote:

> While crites@ was investigating using protobuf to represent Apache Beam
> timestamps within the TestStreamEvents, he found out that the well known
> type google.protobuf.Timestamp doesn't support certain timestamps we were
> using in our tests (specifically the max timestamp that Apache Beam
> supports).
>
> This lead me to investigate and the well known type
> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
> 9999-12-31T23:59:59.999999999Z which is much smaller than the timestamp
> range that Apache Beam currently supports -9223372036854775ms
> to 9223372036854775ms which is about 292277BC to 294247AD (it was difficult
> to find a time range that represented this).
>
> Similarly the google.protobuf.Duration represents any time range over
> those ~10000 years. Google decided to limit their range to be compatible
> with the RFC 3339[2] standard to which does simplify many things since it
> guarantees that all RFC 3339 time parsing/manipulation libraries are
> supported.
>
> Should we:
> A) define our own timestamp/duration types to be able to represent the
> full time range that Apache Beam can express?
> B) limit the valid timestamps in Apache Beam to some standard such as RFC
> 3339?
>
> This discussion is somewhat related to the efforts to support nano
> timestamps[2].
>
> 1: https://tools.ietf.org/html/rfc3339
> 2:
> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>