You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2021/06/02 11:56:36 UTC

[Format] Timestamp timezone semantics?

Hello,

For the first time I notice this piece of information about the 
timestamp type:

>   /// * If the time zone is set to a valid value, values can be displayed as
>   ///   "localized" to that time zone, even though the underlying 64-bit
>   ///   integers are identical to the same data stored in UTC. Converting
>   ///   between time zones is a metadata-only operation and does not change the
>   ///   underlying values

(from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223 )

This seems rather weird to me: timestamps always convey a UTC timestamp 
value, optionally decorated with a local timezone?  What is the 
motivation for such a representation?  It is unlike other systems such 
as Python, where a timezone-aware timestamp really expresses a local 
time value, not a UTC time value.

Thank you,

Antoine.

Re: [Format] Timestamp timezone semantics?

Posted by Adam Hooper <ad...@adamhooper.com>.
On Wed, Jun 2, 2021 at 7:56 AM Antoine Pitrou <an...@python.org> wrote:

>
> This seems rather weird to me: timestamps always convey a UTC timestamp
> value, optionally decorated with a local timezone?  What is the
> motivation for such a representation?  It is unlike other systems such
> as Python


It's standard. I think the motivation is: "local timestamps are the worst
things in computing." (Oh no, here comes a rant!)

SQL gets timestamps completely wrong. MySQL, PostgreSQL, MSSQL and Oracle
all use similar words ("timestamp", "datetime", etc.) to mean different
things. Depending on the RDBMS, you need to think in five timezones --
server timezone, client timezone, database timezone, database-column
timezone and cell timezone. The syntax and semantics are different in all
database engines. (Personally, I always wince at Postgres' "TIMESTAMP WITH
TIMEZONE": it's the best practice because *it doesn't store a timezone*.
All RDBMSs are similarly absurd; props to MySQL for being slightly less
nonsensical than the rest.)

Python is based on C, and C has an obsession with "local time". What an
awful relic. Python `datetime` deals in wildly-inefficient 9-tuples, not
integers; and it happily stores and represents nonexistent times such as
`datetime.datetime(2018, 3, 11, 2, 30,
tzinfo=zoneinfo.ZoneInfo(key='US/Eastern'))`. Python's `time` module gets
you into C-land and integers; there, timezone-aware math only works in the
"local timezone", a global variable read from os.environ["TZ"] and cached
elsewhere in the module.

Local times are *hard to compare* (they jump around daylight savings);
they're *hard to validate* (some don't exist, others are ambiguous); and
they *cannot store future times* (future timezones are yet to be decreed by
politicians).

Don't follow in C or SQL's footsteps. Store timestamps as integers UTC
timestamps. Store timezone somewhere else; use it to convert to local
timezone when formatting and to convert to calendar for calendar math.

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format] Timestamp timezone semantics?

Posted by Antoine Pitrou <an...@python.org>.
Le 02/06/2021 à 14:58, Joris Van den Bossche a écrit :
> On Wed, 2 Jun 2021 at 13:56, Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Hello,
>>
>> For the first time I notice this piece of information about the
>> timestamp type:
>>
>>>    /// * If the time zone is set to a valid value, values can be
>> displayed as
>>>    ///   "localized" to that time zone, even though the underlying 64-bit
>>>    ///   integers are identical to the same data stored in UTC. Converting
>>>    ///   between time zones is a metadata-only operation and does not
>> change the
>>>    ///   underlying values
>>
>> (from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223 )
>>
>> This seems rather weird to me: timestamps always convey a UTC timestamp
>> value, optionally decorated with a local timezone?  What is the
>> motivation for such a representation?  It is unlike other systems such
>> as Python, where a timezone-aware timestamp really expresses a local
>> time value, not a UTC time value.
>>
> 
> Just as reference: pandas uses the same model of storing UTC timestamps for
> timezone-aware data (I think numpy also stored it as UTC, before they
> removed support for it). And for example, I think also databases like
> Postgresql store it as UTC internally, AFAIK.
> The Python standard library datetime.datetime indeed stores localized
> timestamps. But important difference is that Python actually stores the
> year/month/day/hour/etc as separate values, so directly representing an
> actual moment in time in a certain timezone. While I think what we store is
> considered as "unix time"? (epoch since January 1st, 1970 at UTC) I am not
> sure how you would store a timestamp in a certain timezone in this model.

Ah, my bad. I was under the (apparently mistaken) impression that Arrow 
was the exception here.

Regards

Antoine.

Re: [Format] Timestamp timezone semantics?

Posted by Wes McKinney <we...@gmail.com>.
Sorry, I definitely did NOT mean "Python functions treat a naive
timestamp as if it were a
UTC timestamp."

I am referring to the relationship between the behavior of attribute
accessors like "hour" or "day" and the representation of the data.
datetime.datetime.hour returns the same thing for the same timestamp
ordinal value when there is no time zone wherever you are in the
world. I do not mean that tz-naive data is intended to be casted
implicitly to UTC in operations with tz-aware data, that's crazy talk
=) Operations between tz-naive and tz-aware data are not permitted
without explicit casts / localization.

On Fri, Jun 4, 2021 at 2:18 PM Weston Pace <we...@gmail.com> wrote:
>
> > We are recommending that the behavior of
> > these functions should consistently have the UTC interpretation of the
> > value rather than using the system locale. This is what Python does
> > with "tz-naive" datetime.datetime objects
>
> This is not quite true, although perhaps my reading is incorrect.  I
> read that as "Python functions treat a naive timestamp as if it were a
> UTC timestamp."  Python does not treat a naive timestamp the same as a
> UTC timestamp.  And I think this is the heart of what Julilan's point
> is (which I agree with).  For example, consider this snippet:
>
> >>> import datetime
> >>> import pytz
> >>> x = datetime.datetime.now()
> >>> y = pytz.utc.localize(x)
> >>> x - y
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: can't subtract offset-naive and offset-aware datetimes
>
> x is not assumed to be UTC (If it were I would get
> datetime.timedelta(0) instead of an exception).  Another example:
>
> >>> x.isoformat()
> '2021-06-04T09:09:18.304640'
> >>> y.isoformat()
> '2021-06-04T09:09:18.304640+00:00'
>
> On Fri, Jun 4, 2021 at 7:46 AM Julian Hyde <jh...@gmail.com> wrote:
> >
> > The learning there is: library software shouldn’t use anything from its environment (time zone, locale, encoding, endianness). Functions that use time zone should always have a time zone parameter.
> >
> > Once you take that step, the functions that work with zoneless timestamps start to look different to functions that work with local timestamps, and you start to realize that they should be separate data types.
> >
> > > On Jun 3, 2021, at 12:26 PM, Wes McKinney <we...@gmail.com> wrote:
> > >
> > > Arrow's decision was not to permit storage of timestamps with
> > > "localized" representation (which is distinct from UTC internal
> > > representation with a different time zone set). The problem really
> > > comes down to the interpretation of "time zone naive" timestamps on
> > > different systems: operations in my opinion should not yield different
> > > results depending on the particular locale of the system where the
> > > operations are being run.
> > >
> > > date on my Linux system returns 1622748048, which is 19:21 UTC. If you
> > > encounter 1622748048 without any given time zone, and want to
> > > interpret 1622748048 as CDT (US/Central where I live), then Arrow is
> > > asking you to localize that timestamp to the UTC representation of
> > > 19:21 CDT, which is 7 hours later, so you need to add 7 hours of
> > > seconds to the timestamp to adjust it to UTC.
> > >
> > > In some systems, if you encounter 1622748048 without time zone
> > > indicated, the behavior of timestamp_day() or timestamp_hour() will
> > > depend on the system locale. We are recommending that the behavior of
> > > these functions should consistently have the UTC interpretation of the
> > > value rather than using the system locale. This is what Python does
> > > with "tz-naive" datetime.datetime objects — if you call access
> > > datetime.hour on a timezone-less datetime.datetime, it will return the
> > > same result no matter where in the world you are.
> > >
> > > On Thu, Jun 3, 2021 at 1:19 PM Julian Hyde <jh...@gmail.com> wrote:
> > >>
> > >> It seems that Arrow’s timestamp type can either have no time zone or be UTC. I think that is a flawed design, because doesn’t catch user errors.
> > >>
> > >> Suppose you want to find the number of milliseconds between two timestamps. If the first has a timezone and the second is implicitly UTC, then you can convert them both to instants and subtract. But if the first has a timezone and the second has no time zone, you must supply a time zone for the second. So, the subtraction function will have a different signature.
> > >>
> > >> There are many similar operations, where a time zone needs to be supplied, or where you cannot safely mix timestamps with different time zones.
> > >>
> > >> Julian
> > >>
> > >>
> > >>> On Jun 3, 2021, at 11:07 AM, Adam Hooper <ad...@adamhooper.com> wrote:
> > >>>
> > >>> On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper <ad...@adamhooper.com> wrote:
> > >>>
> > >>>> I understand isAdjustedToUTC=true to mean "timestamp", and
> > >>>> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
> > >>>> docs because
> > >>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > >>>> lists a whole slew of potential meanings and without extra metadata I'll
> > >>>> never be able to figure out what this column means."
> > >>>>
> > >>>
> > >>> Correcting myself here: Parquet isAdjustedToUTC=false does have just one
> > >>> meaning. It means encoding a "(year, month, day, hour, minute, second,
> > >>> microsecond)" tuple as a single integer.
> > >>>
> > >>> Adam
> > >>>
> > >>> --
> > >>> Adam Hooper
> > >>> +1-514-882-9694
> > >>> http://adamhooper.com
> > >>
> >

Re: [Format] Timestamp timezone semantics?

Posted by Weston Pace <we...@gmail.com>.
> We are recommending that the behavior of
> these functions should consistently have the UTC interpretation of the
> value rather than using the system locale. This is what Python does
> with "tz-naive" datetime.datetime objects

This is not quite true, although perhaps my reading is incorrect.  I
read that as "Python functions treat a naive timestamp as if it were a
UTC timestamp."  Python does not treat a naive timestamp the same as a
UTC timestamp.  And I think this is the heart of what Julilan's point
is (which I agree with).  For example, consider this snippet:

>>> import datetime
>>> import pytz
>>> x = datetime.datetime.now()
>>> y = pytz.utc.localize(x)
>>> x - y
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't subtract offset-naive and offset-aware datetimes

x is not assumed to be UTC (If it were I would get
datetime.timedelta(0) instead of an exception).  Another example:

>>> x.isoformat()
'2021-06-04T09:09:18.304640'
>>> y.isoformat()
'2021-06-04T09:09:18.304640+00:00'

On Fri, Jun 4, 2021 at 7:46 AM Julian Hyde <jh...@gmail.com> wrote:
>
> The learning there is: library software shouldn’t use anything from its environment (time zone, locale, encoding, endianness). Functions that use time zone should always have a time zone parameter.
>
> Once you take that step, the functions that work with zoneless timestamps start to look different to functions that work with local timestamps, and you start to realize that they should be separate data types.
>
> > On Jun 3, 2021, at 12:26 PM, Wes McKinney <we...@gmail.com> wrote:
> >
> > Arrow's decision was not to permit storage of timestamps with
> > "localized" representation (which is distinct from UTC internal
> > representation with a different time zone set). The problem really
> > comes down to the interpretation of "time zone naive" timestamps on
> > different systems: operations in my opinion should not yield different
> > results depending on the particular locale of the system where the
> > operations are being run.
> >
> > date on my Linux system returns 1622748048, which is 19:21 UTC. If you
> > encounter 1622748048 without any given time zone, and want to
> > interpret 1622748048 as CDT (US/Central where I live), then Arrow is
> > asking you to localize that timestamp to the UTC representation of
> > 19:21 CDT, which is 7 hours later, so you need to add 7 hours of
> > seconds to the timestamp to adjust it to UTC.
> >
> > In some systems, if you encounter 1622748048 without time zone
> > indicated, the behavior of timestamp_day() or timestamp_hour() will
> > depend on the system locale. We are recommending that the behavior of
> > these functions should consistently have the UTC interpretation of the
> > value rather than using the system locale. This is what Python does
> > with "tz-naive" datetime.datetime objects — if you call access
> > datetime.hour on a timezone-less datetime.datetime, it will return the
> > same result no matter where in the world you are.
> >
> > On Thu, Jun 3, 2021 at 1:19 PM Julian Hyde <jh...@gmail.com> wrote:
> >>
> >> It seems that Arrow’s timestamp type can either have no time zone or be UTC. I think that is a flawed design, because doesn’t catch user errors.
> >>
> >> Suppose you want to find the number of milliseconds between two timestamps. If the first has a timezone and the second is implicitly UTC, then you can convert them both to instants and subtract. But if the first has a timezone and the second has no time zone, you must supply a time zone for the second. So, the subtraction function will have a different signature.
> >>
> >> There are many similar operations, where a time zone needs to be supplied, or where you cannot safely mix timestamps with different time zones.
> >>
> >> Julian
> >>
> >>
> >>> On Jun 3, 2021, at 11:07 AM, Adam Hooper <ad...@adamhooper.com> wrote:
> >>>
> >>> On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper <ad...@adamhooper.com> wrote:
> >>>
> >>>> I understand isAdjustedToUTC=true to mean "timestamp", and
> >>>> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
> >>>> docs because
> >>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> >>>> lists a whole slew of potential meanings and without extra metadata I'll
> >>>> never be able to figure out what this column means."
> >>>>
> >>>
> >>> Correcting myself here: Parquet isAdjustedToUTC=false does have just one
> >>> meaning. It means encoding a "(year, month, day, hour, minute, second,
> >>> microsecond)" tuple as a single integer.
> >>>
> >>> Adam
> >>>
> >>> --
> >>> Adam Hooper
> >>> +1-514-882-9694
> >>> http://adamhooper.com
> >>
>

Re: [Format] Timestamp timezone semantics?

Posted by Julian Hyde <jh...@gmail.com>.
The learning there is: library software shouldn’t use anything from its environment (time zone, locale, encoding, endianness). Functions that use time zone should always have a time zone parameter.

Once you take that step, the functions that work with zoneless timestamps start to look different to functions that work with local timestamps, and you start to realize that they should be separate data types.

> On Jun 3, 2021, at 12:26 PM, Wes McKinney <we...@gmail.com> wrote:
> 
> Arrow's decision was not to permit storage of timestamps with
> "localized" representation (which is distinct from UTC internal
> representation with a different time zone set). The problem really
> comes down to the interpretation of "time zone naive" timestamps on
> different systems: operations in my opinion should not yield different
> results depending on the particular locale of the system where the
> operations are being run.
> 
> date on my Linux system returns 1622748048, which is 19:21 UTC. If you
> encounter 1622748048 without any given time zone, and want to
> interpret 1622748048 as CDT (US/Central where I live), then Arrow is
> asking you to localize that timestamp to the UTC representation of
> 19:21 CDT, which is 7 hours later, so you need to add 7 hours of
> seconds to the timestamp to adjust it to UTC.
> 
> In some systems, if you encounter 1622748048 without time zone
> indicated, the behavior of timestamp_day() or timestamp_hour() will
> depend on the system locale. We are recommending that the behavior of
> these functions should consistently have the UTC interpretation of the
> value rather than using the system locale. This is what Python does
> with "tz-naive" datetime.datetime objects — if you call access
> datetime.hour on a timezone-less datetime.datetime, it will return the
> same result no matter where in the world you are.
> 
> On Thu, Jun 3, 2021 at 1:19 PM Julian Hyde <jh...@gmail.com> wrote:
>> 
>> It seems that Arrow’s timestamp type can either have no time zone or be UTC. I think that is a flawed design, because doesn’t catch user errors.
>> 
>> Suppose you want to find the number of milliseconds between two timestamps. If the first has a timezone and the second is implicitly UTC, then you can convert them both to instants and subtract. But if the first has a timezone and the second has no time zone, you must supply a time zone for the second. So, the subtraction function will have a different signature.
>> 
>> There are many similar operations, where a time zone needs to be supplied, or where you cannot safely mix timestamps with different time zones.
>> 
>> Julian
>> 
>> 
>>> On Jun 3, 2021, at 11:07 AM, Adam Hooper <ad...@adamhooper.com> wrote:
>>> 
>>> On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper <ad...@adamhooper.com> wrote:
>>> 
>>>> I understand isAdjustedToUTC=true to mean "timestamp", and
>>>> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
>>>> docs because
>>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
>>>> lists a whole slew of potential meanings and without extra metadata I'll
>>>> never be able to figure out what this column means."
>>>> 
>>> 
>>> Correcting myself here: Parquet isAdjustedToUTC=false does have just one
>>> meaning. It means encoding a "(year, month, day, hour, minute, second,
>>> microsecond)" tuple as a single integer.
>>> 
>>> Adam
>>> 
>>> --
>>> Adam Hooper
>>> +1-514-882-9694
>>> http://adamhooper.com
>> 


Re: [Format] Timestamp timezone semantics?

Posted by Wes McKinney <we...@gmail.com>.
Arrow's decision was not to permit storage of timestamps with
"localized" representation (which is distinct from UTC internal
representation with a different time zone set). The problem really
comes down to the interpretation of "time zone naive" timestamps on
different systems: operations in my opinion should not yield different
results depending on the particular locale of the system where the
operations are being run.

date on my Linux system returns 1622748048, which is 19:21 UTC. If you
encounter 1622748048 without any given time zone, and want to
interpret 1622748048 as CDT (US/Central where I live), then Arrow is
asking you to localize that timestamp to the UTC representation of
19:21 CDT, which is 7 hours later, so you need to add 7 hours of
seconds to the timestamp to adjust it to UTC.

In some systems, if you encounter 1622748048 without time zone
indicated, the behavior of timestamp_day() or timestamp_hour() will
depend on the system locale. We are recommending that the behavior of
these functions should consistently have the UTC interpretation of the
value rather than using the system locale. This is what Python does
with "tz-naive" datetime.datetime objects — if you call access
datetime.hour on a timezone-less datetime.datetime, it will return the
same result no matter where in the world you are.

On Thu, Jun 3, 2021 at 1:19 PM Julian Hyde <jh...@gmail.com> wrote:
>
> It seems that Arrow’s timestamp type can either have no time zone or be UTC. I think that is a flawed design, because doesn’t catch user errors.
>
> Suppose you want to find the number of milliseconds between two timestamps. If the first has a timezone and the second is implicitly UTC, then you can convert them both to instants and subtract. But if the first has a timezone and the second has no time zone, you must supply a time zone for the second. So, the subtraction function will have a different signature.
>
> There are many similar operations, where a time zone needs to be supplied, or where you cannot safely mix timestamps with different time zones.
>
> Julian
>
>
> > On Jun 3, 2021, at 11:07 AM, Adam Hooper <ad...@adamhooper.com> wrote:
> >
> > On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper <ad...@adamhooper.com> wrote:
> >
> >> I understand isAdjustedToUTC=true to mean "timestamp", and
> >> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
> >> docs because
> >> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> >> lists a whole slew of potential meanings and without extra metadata I'll
> >> never be able to figure out what this column means."
> >>
> >
> > Correcting myself here: Parquet isAdjustedToUTC=false does have just one
> > meaning. It means encoding a "(year, month, day, hour, minute, second,
> > microsecond)" tuple as a single integer.
> >
> > Adam
> >
> > --
> > Adam Hooper
> > +1-514-882-9694
> > http://adamhooper.com
>

Re: [Format] Timestamp timezone semantics?

Posted by Julian Hyde <jh...@gmail.com>.
It seems that Arrow’s timestamp type can either have no time zone or be UTC. I think that is a flawed design, because doesn’t catch user errors.

Suppose you want to find the number of milliseconds between two timestamps. If the first has a timezone and the second is implicitly UTC, then you can convert them both to instants and subtract. But if the first has a timezone and the second has no time zone, you must supply a time zone for the second. So, the subtraction function will have a different signature.

There are many similar operations, where a time zone needs to be supplied, or where you cannot safely mix timestamps with different time zones.

Julian


> On Jun 3, 2021, at 11:07 AM, Adam Hooper <ad...@adamhooper.com> wrote:
> 
> On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper <ad...@adamhooper.com> wrote:
> 
>> I understand isAdjustedToUTC=true to mean "timestamp", and
>> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
>> docs because
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
>> lists a whole slew of potential meanings and without extra metadata I'll
>> never be able to figure out what this column means."
>> 
> 
> Correcting myself here: Parquet isAdjustedToUTC=false does have just one
> meaning. It means encoding a "(year, month, day, hour, minute, second,
> microsecond)" tuple as a single integer.
> 
> Adam
> 
> -- 
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com


Re: [Format] Timestamp timezone semantics?

Posted by Adam Hooper <ad...@adamhooper.com>.
On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper <ad...@adamhooper.com> wrote:

> I understand isAdjustedToUTC=true to mean "timestamp", and
> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
> docs because
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> lists a whole slew of potential meanings and without extra metadata I'll
> never be able to figure out what this column means."
>

Correcting myself here: Parquet isAdjustedToUTC=false does have just one
meaning. It means encoding a "(year, month, day, hour, minute, second,
microsecond)" tuple as a single integer.

Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format] Timestamp timezone semantics?

Posted by Adam Hooper <ad...@adamhooper.com>.
On Thu, Jun 3, 2021 at 1:17 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> That is my understanding as well, a timestamp either has a timezone or it
> has not. If it does not have a timezone, it should be presented as is and
> no assumptions can be made about its timezone. In particular, but given two
> fields X and Y, one with a timezone and another without, e.g. it is not
> meaningful to compute X - Y.
>

My understanding is, timestamp is always UTC. The full spec:

/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970,
excluding
/// leap seconds, as a 64-bit integer. Note that UNIX time does not include
/// leap seconds.
///
/// The Timestamp metadata supports both "time zone naive" and "time zone
/// aware" timestamps. Read about the timezone attribute for more detail
table Timestamp {
  unit: TimeUnit;

  /// The time zone is a string indicating the name of a time zone, one of:
  ///
  /// * As used in the Olson time zone database (the "tz database" or
  ///   "tzdata"), such as "America/New_York"
  /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as
+07:30
  ///
  /// Whether a timezone string is present indicates different semantics
about
  /// the data:
  ///
  /// * If the time zone is null or equal to an empty string, the data is
"time
  ///   zone naive" and shall be displayed *as is* to the user, not
localized
  ///   to the locale of the user. This data can be though of as UTC but
  ///   without having "UTC" as the time zone, it is not considered to be
  ///   localized to any time zone
  ///
  /// * If the time zone is set to a valid value, values can be displayed as
  ///   "localized" to that time zone, even though the underlying 64-bit
  ///   integers are identical to the same data stored in UTC. Converting
  ///   between time zones is a metadata-only operation and does not change
the
  ///   underlying values
  timezone: string;
}

I *think*, from that description, that "naive timestamp" is UTC. I think
that because *the spec literally tells me I can think of it that way*. So
the way I see it, "X [timezone-naive] - Y [any timezone]" is a valid
computation.

I know that "can be thought of" isn't very strong wording. But there's only
one UNIX epoch, so I don't think the spec will let me think of it as
anything else :).

Good thing, too. Because storing a datetime is very different from storing
a timestamp. When you store a datetime (SQL-style), you have two timezones:

1. The timezone in which the datetime is *stored*
2. The timezone in which the datetime is *displayed*

It would be strange for the spec to imply, "displayed NOT NULL means stored
= UTC" and "displayed NULL means stored = NULL" because that would prevent
Arrow from handling the most important combination: "stored = UTC,
displayed = NULL".

TL;DR by my reading, "timezone" is merely for presentation, and its value
(or null-ness) doesn't alter the data at all.

Maybe a way of framing: given an uint64 X in parquet with
> "isAdjustedToUTC=true", which arrow's datatype and value Y should it
> correspond to? A timestamp with or without a timezone? I have so far
> understood that "isAdjustedToUTC=false" corresponds to no timezone in
> arrow, and "isAdjustedToUTC=true" corresponds to "+00:00". (But maybe this
> is incorrect?)
>

I understand isAdjustedToUTC=true to mean "timestamp", and
isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
docs because
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
lists a whole slew of potential meanings and without extra metadata I'll
never be able to figure out what this column means."

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format] Timestamp timezone semantics?

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
That is my understanding as well, a timestamp either has a timezone or it
has not. If it does not have a timezone, it should be presented as is and
no assumptions can be made about its timezone. In particular, but given two
fields X and Y, one with a timezone and another without, e.g. it is not
meaningful to compute X - Y.

Maybe a way of framing: given an uint64 X in parquet with
"isAdjustedToUTC=true", which arrow's datatype and value Y should it
correspond to? A timestamp with or without a timezone? I have so far
understood that "isAdjustedToUTC=false" corresponds to no timezone in
arrow, and "isAdjustedToUTC=true" corresponds to "+00:00". (But maybe this
is incorrect?)

Best,
Jorge





On Thu, Jun 3, 2021 at 6:59 PM Julian Hyde <jh...@gmail.com> wrote:

> My answer to Antoine’s question would not be “kind of”, it would be “no”.
>
> In a system such as Joda-time, which I claim is the only system that Arrow
> should be considering, a timestamp-without-timezone does not have an
> implicit time zone of UTC. It has no time zone.
>
>
> > On Jun 3, 2021, at 8:52 AM, Micah Kornfield <em...@gmail.com>
> wrote:
> >
> >>
> >> Aren't those exactly the same (i.e. no timezone implicitly means UTC,
> >> not local time)?
> >
> >
> > Kind of, the reason we went with this approach is this sentence from the
> > specification:
> >
> > "the data is "time zone naive" and shall be displayed *as is* to the
> user,
> > not localized to the locale of the user.",  which seemed like the closest
> > match for what Date times actually does.  Computationally this isn't
> really
> > ideal.
> >
> > -Micah
> >
> > On Thu, Jun 3, 2021 at 12:32 AM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >>
> >> Le 02/06/2021 à 22:56, Micah Kornfield a écrit :
> >>>>
> >>>> Any SQL interface to Arrow should follow the SQL standard. So, for
> >>>> instance, if a column has TIMESTAMP type, it should behave as a
> >>>> date-time without a time-zone.
> >>>
> >>>
> >>> At least in bigquery we do the following mapping:
> >>> SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
> >>> SQL DATETIME -> Arrow Timestamp without a time-zone.
> >>
> >> Aren't those exactly the same (i.e. no timezone implicitly means UTC,
> >> not local time)?
> >>
> >> Regards
> >>
> >> Antoine.
> >>
>
>

Re: [Format] Timestamp timezone semantics?

Posted by Julian Hyde <jh...@gmail.com>.
My answer to Antoine’s question would not be “kind of”, it would be “no”.

In a system such as Joda-time, which I claim is the only system that Arrow should be considering, a timestamp-without-timezone does not have an implicit time zone of UTC. It has no time zone.


> On Jun 3, 2021, at 8:52 AM, Micah Kornfield <em...@gmail.com> wrote:
> 
>> 
>> Aren't those exactly the same (i.e. no timezone implicitly means UTC,
>> not local time)?
> 
> 
> Kind of, the reason we went with this approach is this sentence from the
> specification:
> 
> "the data is "time zone naive" and shall be displayed *as is* to the user,
> not localized to the locale of the user.",  which seemed like the closest
> match for what Date times actually does.  Computationally this isn't really
> ideal.
> 
> -Micah
> 
> On Thu, Jun 3, 2021 at 12:32 AM Antoine Pitrou <an...@python.org> wrote:
> 
>> 
>> Le 02/06/2021 à 22:56, Micah Kornfield a écrit :
>>>> 
>>>> Any SQL interface to Arrow should follow the SQL standard. So, for
>>>> instance, if a column has TIMESTAMP type, it should behave as a
>>>> date-time without a time-zone.
>>> 
>>> 
>>> At least in bigquery we do the following mapping:
>>> SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
>>> SQL DATETIME -> Arrow Timestamp without a time-zone.
>> 
>> Aren't those exactly the same (i.e. no timezone implicitly means UTC,
>> not local time)?
>> 
>> Regards
>> 
>> Antoine.
>> 


Re: [Format] Timestamp timezone semantics?

Posted by Micah Kornfield <em...@gmail.com>.
>
> Aren't those exactly the same (i.e. no timezone implicitly means UTC,
> not local time)?


Kind of, the reason we went with this approach is this sentence from the
specification:

"the data is "time zone naive" and shall be displayed *as is* to the user,
not localized to the locale of the user.",  which seemed like the closest
match for what Date times actually does.  Computationally this isn't really
ideal.

-Micah

On Thu, Jun 3, 2021 at 12:32 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 02/06/2021 à 22:56, Micah Kornfield a écrit :
> >>
> >> Any SQL interface to Arrow should follow the SQL standard. So, for
> >> instance, if a column has TIMESTAMP type, it should behave as a
> >> date-time without a time-zone.
> >
> >
> > At least in bigquery we do the following mapping:
> > SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
> > SQL DATETIME -> Arrow Timestamp without a time-zone.
>
> Aren't those exactly the same (i.e. no timezone implicitly means UTC,
> not local time)?
>
> Regards
>
> Antoine.
>

Re: [Format] Timestamp timezone semantics?

Posted by Antoine Pitrou <an...@python.org>.
Le 02/06/2021 à 22:56, Micah Kornfield a écrit :
>>
>> Any SQL interface to Arrow should follow the SQL standard. So, for
>> instance, if a column has TIMESTAMP type, it should behave as a
>> date-time without a time-zone.
> 
> 
> At least in bigquery we do the following mapping:
> SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
> SQL DATETIME -> Arrow Timestamp without a time-zone.

Aren't those exactly the same (i.e. no timezone implicitly means UTC, 
not local time)?

Regards

Antoine.

Re: [Format] Timestamp timezone semantics?

Posted by Julian Hyde <jh...@gmail.com>.
> On Jun 2, 2021, at 1:56 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
> 
> At least in bigquery we do the following mapping:
> SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
> SQL DATETIME -> Arrow Timestamp without a time-zone.

BigQuery was one of the systems I had in mind when I said "naming is a little inconsistent". BigQuery does have a type consistent with SQL-standard TIMESTAMP type but it’s called DATETIME. The TIMESTAMP type is something else.

I can literally count the number of hours and dollars that have been wasted because my colleagues assumed that BigQuery’s TIMESTAMP type would have the same semantics as TIMESTAMP in other databases.

Julian

Re: [Format] Timestamp timezone semantics?

Posted by Hongze Zhang <no...@126.com>.
On Wed, 2021-06-02 at 13:56 -0700, Micah Kornfield wrote:
> > 
> > Any SQL interface to Arrow should follow the SQL standard. So, for
> > instance, if a column has TIMESTAMP type, it should behave as a
> > date-time without a time-zone.
> 
> 
> At least in bigquery we do the following mapping:
> SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
> SQL DATETIME -> Arrow Timestamp without a time-zone.

It's likely that the Arrow Date32/Date64 type has a more interesting
definition that it represents "elapsed time since UNIX epoch". Does
that mean it's recommended to a user/developer to use Arrow Date64
rather than Arrow Timestamp to indicate an absolute point in time? If
so, based on which consideration Bigquery didn't choose Arrow Date64
for its TIMESTAMP type?

As represented in the SQL standard a "DATE" type should be as trivial
as a "TIMESTAMP WITHOUT TIMEZONE" despite the time-in-day part. If
"UNIX epoch" implies "UTC"/"GMT" undebatably, for example, Arrow Date32
will not be able to perfectly align with the SQL standard "DATE" which
is supposed to be localized. However if I am not wrong some of the
projects using Arrow already made the Date<->Date32 equivalence such as
Spark SQL[1][2]. I am not sure if it's a recommended practice but in
Arrow using the term "Date" for "UNIX epoch" does seem to be a bit of
counterintuitive to me.

Best,
Hongze

[1]
https://github.com/apache/spark/blob/2658bc590fec51e2266d03121c85b47f553022ec/sql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L48
[2]
https://github.com/apache/spark/blob/2658bc590fec51e2266d03121c85b47f553022ec/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java#L417-L430


Re: [Format] Timestamp timezone semantics?

Posted by Micah Kornfield <em...@gmail.com>.
>
> Any SQL interface to Arrow should follow the SQL standard. So, for
> instance, if a column has TIMESTAMP type, it should behave as a
> date-time without a time-zone.


At least in bigquery we do the following mapping:
SQL TIMESTAMP -> Arrow Timestamp with "UTC" timezone
SQL DATETIME -> Arrow Timestamp without a time-zone.

On Wed, Jun 2, 2021 at 12:39 PM Julian Hyde <jh...@apache.org> wrote:

> Good time libraries support all. E.g. Jodatime [1] has
>
> * Instant - an instantaneous point on the time-line
> * DateTime - full date and time with time-zone
> * LocalDateTime - date-time without a time-zone
>
> The SQL world isn't quite as much of a mess as Adam makes it out to
> be. The SQL standard defines TIMESTAMP, DATE and TIME as zoneless
> (like Joda's LocalDateTime) and most DBs have types that behave in
> that way. Often those DBs also have types that behave like Instant and
> DateTime (but naming is a little inconsistent).
>
> I recommend that Arrow supports all three. Choose clear, distinct
> names for all three, consistent with names used elsewhere in the
> industry.
>
> Any SQL interface to Arrow should follow the SQL standard. So, for
> instance, if a column has TIMESTAMP type, it should behave as a
> date-time without a time-zone.
>
> Julian
>
> [1] https://www.joda.org/joda-time/
>
> On Wed, Jun 2, 2021 at 10:43 AM Rok Mihevc <ro...@gmail.com> wrote:
> >
> > On Wed, Jun 2, 2021 at 3:23 PM Joris Peeters <joris.mg.peeters@gmail.com
> >
> > wrote:
> >
> > > You could store epoch offsets, but interpret them in the local
> timezone.
> > > E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New
> York
> > > timezone.
> > > At least one nasty problem with that is ambiguous times, i.e. when the
> > > clock turns back on going from DST to ST, as well as invalid times
> (when
> > > the clock moves forwards, meaning some epoch offsets never occur).
> > >
> >
> > Another problem is calendars change (see Adam's points) so the offset
> would
> > not be constant.
>

Re: [Format] Timestamp timezone semantics?

Posted by Julian Hyde <jh...@apache.org>.
Good time libraries support all. E.g. Jodatime [1] has

* Instant - an instantaneous point on the time-line
* DateTime - full date and time with time-zone
* LocalDateTime - date-time without a time-zone

The SQL world isn't quite as much of a mess as Adam makes it out to
be. The SQL standard defines TIMESTAMP, DATE and TIME as zoneless
(like Joda's LocalDateTime) and most DBs have types that behave in
that way. Often those DBs also have types that behave like Instant and
DateTime (but naming is a little inconsistent).

I recommend that Arrow supports all three. Choose clear, distinct
names for all three, consistent with names used elsewhere in the
industry.

Any SQL interface to Arrow should follow the SQL standard. So, for
instance, if a column has TIMESTAMP type, it should behave as a
date-time without a time-zone.

Julian

[1] https://www.joda.org/joda-time/

On Wed, Jun 2, 2021 at 10:43 AM Rok Mihevc <ro...@gmail.com> wrote:
>
> On Wed, Jun 2, 2021 at 3:23 PM Joris Peeters <jo...@gmail.com>
> wrote:
>
> > You could store epoch offsets, but interpret them in the local timezone.
> > E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New York
> > timezone.
> > At least one nasty problem with that is ambiguous times, i.e. when the
> > clock turns back on going from DST to ST, as well as invalid times (when
> > the clock moves forwards, meaning some epoch offsets never occur).
> >
>
> Another problem is calendars change (see Adam's points) so the offset would
> not be constant.

Re: [Format] Timestamp timezone semantics?

Posted by Rok Mihevc <ro...@gmail.com>.
On Wed, Jun 2, 2021 at 3:23 PM Joris Peeters <jo...@gmail.com>
wrote:

> You could store epoch offsets, but interpret them in the local timezone.
> E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New York
> timezone.
> At least one nasty problem with that is ambiguous times, i.e. when the
> clock turns back on going from DST to ST, as well as invalid times (when
> the clock moves forwards, meaning some epoch offsets never occur).
>

Another problem is calendars change (see Adam's points) so the offset would
not be constant.

Re: [Format] Timestamp timezone semantics?

Posted by Joris Peeters <jo...@gmail.com>.
You could store epoch offsets, but interpret them in the local timezone.
E.g. (0, "America/New_York") could mean 1970-01-01 00:00:00 in the New York
timezone.
At least one nasty problem with that is ambiguous times, i.e. when the
clock turns back on going from DST to ST, as well as invalid times (when
the clock moves forwards, meaning some epoch offsets never occur).

On Wed, Jun 2, 2021 at 1:58 PM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> On Wed, 2 Jun 2021 at 13:56, Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Hello,
> >
> > For the first time I notice this piece of information about the
> > timestamp type:
> >
> > >   /// * If the time zone is set to a valid value, values can be
> > displayed as
> > >   ///   "localized" to that time zone, even though the underlying
> 64-bit
> > >   ///   integers are identical to the same data stored in UTC.
> Converting
> > >   ///   between time zones is a metadata-only operation and does not
> > change the
> > >   ///   underlying values
> >
> > (from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223
> )
> >
> > This seems rather weird to me: timestamps always convey a UTC timestamp
> > value, optionally decorated with a local timezone?  What is the
> > motivation for such a representation?  It is unlike other systems such
> > as Python, where a timezone-aware timestamp really expresses a local
> > time value, not a UTC time value.
> >
>
> Just as reference: pandas uses the same model of storing UTC timestamps for
> timezone-aware data (I think numpy also stored it as UTC, before they
> removed support for it). And for example, I think also databases like
> Postgresql store it as UTC internally, AFAIK.
> The Python standard library datetime.datetime indeed stores localized
> timestamps. But important difference is that Python actually stores the
> year/month/day/hour/etc as separate values, so directly representing an
> actual moment in time in a certain timezone. While I think what we store is
> considered as "unix time"? (epoch since January 1st, 1970 at UTC) I am not
> sure how you would store a timestamp in a certain timezone in this model.
>
> Some advantages of storing UTC that come to mind: it makes converting from
> one timezone to another a trivial (metadata-only) operation, makes easier
> to do timestamp comparisons across timezones, and it makes
> timedelta-arithmetic easier.
>
> Joris
>
>
> > Thank you,
> >
> > Antoine.
> >
>

Re: [Format] Timestamp timezone semantics?

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Wed, 2 Jun 2021 at 13:56, Antoine Pitrou <an...@python.org> wrote:

>
> Hello,
>
> For the first time I notice this piece of information about the
> timestamp type:
>
> >   /// * If the time zone is set to a valid value, values can be
> displayed as
> >   ///   "localized" to that time zone, even though the underlying 64-bit
> >   ///   integers are identical to the same data stored in UTC. Converting
> >   ///   between time zones is a metadata-only operation and does not
> change the
> >   ///   underlying values
>
> (from https://github.com/apache/arrow/blob/master/format/Schema.fbs#L223 )
>
> This seems rather weird to me: timestamps always convey a UTC timestamp
> value, optionally decorated with a local timezone?  What is the
> motivation for such a representation?  It is unlike other systems such
> as Python, where a timezone-aware timestamp really expresses a local
> time value, not a UTC time value.
>

Just as reference: pandas uses the same model of storing UTC timestamps for
timezone-aware data (I think numpy also stored it as UTC, before they
removed support for it). And for example, I think also databases like
Postgresql store it as UTC internally, AFAIK.
The Python standard library datetime.datetime indeed stores localized
timestamps. But important difference is that Python actually stores the
year/month/day/hour/etc as separate values, so directly representing an
actual moment in time in a certain timezone. While I think what we store is
considered as "unix time"? (epoch since January 1st, 1970 at UTC) I am not
sure how you would store a timestamp in a certain timezone in this model.

Some advantages of storing UTC that come to mind: it makes converting from
one timezone to another a trivial (metadata-only) operation, makes easier
to do timestamp comparisons across timezones, and it makes
timedelta-arithmetic easier.

Joris


> Thank you,
>
> Antoine.
>