You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2021/06/14 15:57:00 UTC

[Format][Important] Needed clarification of timezone-less timestamps

Hello,

In ARROW-13033, there was a disagreement as to how the specification 
about timezone-less timestamps should be interpreted.

Here is the wording in the Schema specification:

>   /// * If the time zone is null or equal to an empty string, the data is "time
>   ///   zone naive" and shall be displayed *as is* to the user, not localized
>   ///   to the locale of the user. This data can be though of as UTC but
>   ///   without having "UTC" as the time zone, it is not considered to be
>   ///   localized to any time zone

My interpretation is that timestamp *values* are always expressed in 
UTC.  The timezone is an optional piece of metadata that describes the 
context in which they were obtained, but do not impact how the *values* 
should be interpreted.

Joris' interpretation is that timestamp *values* are expressed in an 
arbitrary "local time" that is unknown and unspecified. It is therefore 
difficult to exactly interpret them, since the timezone information is 
unavailable.

(I'll let Joris express his thoughts more accurately, but the gist of 
his opinion is that "can be thought of as UTC" is only an indication, 
not a prescription)


To me, the problem with the "unknown local timezone" interpretation is 
that it renders the data essentially ambiguous and useless.  The problem 
is very similar to the problem of having string data without a 
well-known encoding. This is well-known to Python users as the Python 2 
encoding hell (to the point that it motivated the heavy and disruptive 
Python 3 transition).

(note the problem is even worse for timestamps. At least, you can with a 
high degree of probability detect that an arbitrary binary string is 
*not* UTF8-encoded. You cannot do so with timestamp values: any 64-bit 
timestamp may or may not be a UTC timestamp. Once you have lost that 
information, you cannot regain it anymore.)

In any case, I think this must be clarified, first on this mailing-list, 
then by making the spec wording stronger and more prescriptive.

Regards

Antoine.

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Antoine Pitrou <an...@python.org>.

Le 15/06/2021 à 19:18, Weston Pace a écrit :
> Thanks for the excellent summary everyone.  I agree with these
> summaries that have been pointed out.  It seems like things are moving
> towards consensus.

Hmm, are we so sure? I don't think I've seen widespread agreement about 
how the spec should be interpreted (see Wes' and Joris' messages).

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Adam Hooper <ad...@adamhooper.com>.

On Thu, Jun 17, 2021 at 5:56 PM Micah Kornfield <em...@gmail.com>
wrote:

>
> Could others on the thread confirm this is the issue up for debate?  Are
> there subtleties/operations we need to consider?
>

Thrilled to take this to the doc
<https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit#>
you set up!

But to answer your question here: my understanding is we're debating how to
store an Instant in Arrow. Or conversely, how to interpret a timestamp that
has no timezone field.

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Wes McKinney <we...@gmail.com>.

Given that multiple external systems distinguish a UTC ZonedDateTime
from Instant, I'd be open to considering this as a first class Type
(in the Type union), even though in Arrow-land the data representation
is the same. As far as what computing operations we would support on
instants: arithmetic among instants and durations, along with casting
to ZonedDateTime would be about it, then. We would need to consider
the implications with respect to existing interoperability with other
systems (for example, reading a UTC-normalized timestamp from Parquet
should still yield UTC timestamp in Arrow, not an instant) .

On Tue, Jun 22, 2021 at 3:23 PM Julian Hyde <jh...@gmail.com> wrote:
>
> Thanks, Weston. I guess I’m in the camp that wants Instant to be a first-class type, on a par with ZonedDateTime and LocalDateTime. I have added my comments to the “Arguments for” section in the 2nd doc.
>
> > On Jun 22, 2021, at 10:38 AM, Weston Pace <we...@gmail.com> wrote:
> >
> > There's two (related) proposals that are being discussed I think.  I
> > created streamline documents for each.  Feel free to add arguments
> > for/against or to clarify.  If no one has any better ideas I'll start
> > a vote topic for each one in ~48 hours.
> >
> > # Proposal: Clarify meaning of timestamp without time zone
> > * https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing
> >
> > # Proposal: Arrow should define how an “Instant” is stored
> > * https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing
> >
> >
> > On Tue, Jun 22, 2021 at 6:01 AM Julian Hyde <jh...@gmail.com> wrote:
> >>
> >> My proposal is that Arrow should support three different kinds of date-times: zoneless, zoned, and instant. (Not necessarily with those names.)
> >>
> >> All three kinds occur frequently in the industry.
> >>
> >> Many systems only have two, and users of those systems have figured out how to make do. (For example, you can implement an instant using a zoneless, and vice versa, if you are careful.) But let’s suppose that Arrow has two of the three, and needs to interoperate with a system that has a different two of the three. Chaos ensues.
> >>
> >> Let’s just implement all three.
> >>
> >> Julian
> >>
> >>
> >>
> >>
> >>> On Jun 22, 2021, at 8:46 AM, Wes McKinney <we...@gmail.com> wrote:
> >>>
> >>> Let's see a streamlined document about what we are voting on — I have
> >>> limited bandwidth to read through and synthesize the discussion myself
> >>> and I am probably not the only one. It has always been my
> >>> understanding to represent time elapsed from the UNIX epoch
> >>> (1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
> >>> is what is being called an "instant").
> >>>
> >>> On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <ad...@adamhooper.com> wrote:
> >>>>
> >>>> Maybe Arrow should add a new type, "INSTANT"?
> >>>>
> >>>> Instant (seconds since the epoch) is the gold standard in storing moments
> >>>> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
> >>>> this.
> >>>>
> >>>> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
> >>>> Instant because I read the docs *assuming* Arrow stores Instant.
> >>>>
> >>>> I know, I know, no vote can stop me from interpreting "timestamp without
> >>>> timezone" however the heck I want. But it's painful for me to transition
> >>>> from happy user to heretic.
> >>>>
> >>>> Voting to clarify that Arrow doesn't store Instants is voting to clarify
> >>>> that Arrow *doesn't* do something extremely useful. It's voting for a
> >>>> negative. That sounds painful! What if there were positives to vote for? An
> >>>> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
> >>>> A fiat that timezone=UTC means Instant, not ZonedDateTime?
> >>>>
> >>>> Enjoy life,
> >>>> Adam
> >>>>
> >>>> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <we...@gmail.com> wrote:
> >>>>
> >>>>> I agree that a vote would be a good idea.  Do you want to start a
> >>>>> dedicated vote thread?  I can write one up too if you'd rather.
> >>>>>
> >>>>> -Weston
> >>>>>
> >>>>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <em...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> I think comments on the doc are tailing off.  Jorge's test cases I think
> >>>>>> still need some more careful analysis but Weston has provided an
> >>>>>> initial pass.
> >>>>>>
> >>>>>> The matter not resolved on the document is whether Timestamp with
> >>>>> timezone
> >>>>>> logically represents multi-field date and time (that does not represent a
> >>>>>> specific instant) or whether it logically represents an instant (some
> >>>>>> measurable offset from an epoch).   Based on comments on the
> >>>>> documentation
> >>>>>> both C++/Python implementations and the Java implementations (those that
> >>>>>> have generally been considered "reference") both have evidence the the
> >>>>>> former representation is what is intended (some links are in the
> >>>>> document).
> >>>>>>
> >>>>>> We can probably continue to debate what is useful but it seems ultimately
> >>>>>> we need to pick one or the other and clarify the specification.  Given
> >>>>> how
> >>>>>> the reference implementations currently work I think we should error on
> >>>>> the
> >>>>>> side of interpreting these values as date times.  Ultimately, given the
> >>>>>> contention here we will likely need to vote on this.
> >>>>>>
> >>>>>> More comments on the document or here are still useful in case we've
> >>>>> missed
> >>>>>> an interpretation or there are other facts to consider.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Micah
> >>>>>>
> >>>>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> >>>>>> jorgecarleitao@gmail.com> wrote:
> >>>>>>
> >>>>>>> Thank you everyone for participating so far; really important and
> >>>>>>> useful discussion.
> >>>>>>>
> >>>>>>> I think of this discussion as a set of test cases over behavior:
> >>>>>>>
> >>>>>>> parameterization:
> >>>>>>> * Timestamp(ms, None)
> >>>>>>> * Timestamp(ms, "00:00")
> >>>>>>> * Timestamp(ms, "01:00")
> >>>>>>>
> >>>>>>> Cases:
> >>>>>>> * its string representation equals to
> >>>>>>> * add a duration equals to
> >>>>>>> * add an interval equals to
> >>>>>>> * subtract a Timestamp(ms, None) equals to
> >>>>>>> * subtract a Timestamp(ms, "01:00") equals to
> >>>>>>> * subtract a Date32 equals to
> >>>>>>> * subtract a Time32(ms) equals to
> >>>>>>> * extract the day equals to
> >>>>>>> * extract the timezone equals to
> >>>>>>> * cast to Timestamp(ms, None) equals to
> >>>>>>> * cast to Timestamp(ms, "01:00") equals to
> >>>>>>> * write to parquet v2 equals to (physical value and logical type)
> >>>>>>>
> >>>>>>> In all cases, the result may either be valid or invalid. If valid, we
> >>>>>>> would need a datatype and an actual value.
> >>>>>>> I was hoping to be able to answer each of the above at the end of this
> >>>>>>> discussion.
> >>>>>>>
> >>>>>>> I've suggested adding these in the google docs.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jorge
> >>>>>>>
> >>>>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
> >>>>> emkornfield@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> I've posted the examples above in
> >>>>>>>>
> >>>>>>>
> >>>>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> >>>>>>>> because I think it would be better to collaborate there instead of
> >>>>> linear
> >>>>>>>> e-mail history and then bring the consensus back to the list.
> >>>>>>>>
> >>>>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
> >>>>> emkornfield@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> I feel like we might still be talking past each other here or at
> >>>>> least
> >>>>>>> I
> >>>>>>>>> don't understand the two sides of this.  I'll try to expand
> >>>>> Weston's
> >>>>>>>>> example because I think it provides the best clarification.
> >>>>>>>>>
> >>>>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> >>>>>>> assuming
> >>>>>>>>> ms) for a timestamp column without timezone (always).   This
> >>>>>>> represents an
> >>>>>>>>> offset from the unix epoch.  This interpretation should not change
> >>>>>>> based on
> >>>>>>>>> the local system timezone.  Extracting the hour field always
> >>>>> yields 14
> >>>>>>>>> (extraction is done relative to UTC).
> >>>>>>>>>
> >>>>>>>>> The alternative here seems to be that we can encode (1970, 1, 2,
> >>>>> 14,
> >>>>>>> 0) in
> >>>>>>>>> multiple different ways depending on what the current local system
> >>>>> time
> >>>>>>>>> is.  As a note, I think ORC and Spark do this, and it leads to
> >>>>>>>>> confusion/misinterpretation when trying to transfer data.
> >>>>>>>>>
> >>>>>>>>> If we then convert this column to a timestamp with a timezone in
> >>>>> "UTC"
> >>>>>>>>> timezone extracting the hour field still yields 14.  If the column
> >>>>> is
> >>>>>>>>> converted to Timezone with timestamp PST.  Extracting an hour would
> >>>>>>> yield 6
> >>>>>>>>> (assume PST = -8GMT).    Through all of these changes the data
> >>>>> bits do
> >>>>>>> not
> >>>>>>>>> change.
> >>>>>>>>>
> >>>>>>>>> Display is not mentioned because I think the points about how a
> >>>>> time
> >>>>>>>>> display is correct. Applications can choose what they feel makes
> >>>>> sense
> >>>>>>> to
> >>>>>>>>> them (as long as they don't start automatically tacking on
> >>>>> timezones to
> >>>>>>>>> naive timestamps).  My interpretation of the specification has been
> >>>>>>> display
> >>>>>>>>> was kind of shorthand for field extraction.
> >>>>>>>>>
> >>>>>>>>> Could others on the thread confirm this is the issue up for debate?
> >>>>>>> Are
> >>>>>>>>> there subtleties/operations we need to consider?
> >>>>>>>>>
> >>>>>>>>> I also agree that we should document recommended conversion
> >>>>> practices
> >>>>>>> from
> >>>>>>>>> other systems.
> >>>>>>>>>
> >>>>>>>>> -Micah
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> So let's invent a third way.  I could use
> >>>>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the
> >>>>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour,
> >>>>> the
> >>>>>>>>>> next 8 bits for the minute, and the remaining bits for the
> >>>>> seconds.
> >>>>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as
> >>>>>>>>>> 0x07B201020E000000.
> >>>>>>>>>
> >>>>>>>>> Aside, With some small variation this is what ZetaSql uses [2]
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>>
> >>>>>>>
> >>>>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> >>>>>>>>> [2]
> >>>>>>>>>
> >>>>>>>
> >>>>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmckinn@gmail.com
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types:
> >>>>>>>>>>> with/without time zone — in some SQL implementations each slot
> >>>>> can
> >>>>>>>>>>> have a different time zone
> >>>>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html
> >>>>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be
> >>>>>>> taken
> >>>>>>>>>>> or given as timezone local time"
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> RDBMSs conflict (universally) with ANSI.
> >>>>>>>>>>
> >>>>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
> >>>>> the
> >>>>>>> epoch.
> >>>>>>>>>> It has no timezone.
> >>>>>>>>>>
> >>>>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> >>>>>>> Instant
> >>>>>>>>>> since the epoch. It has no timezone.
> >>>>>>>>>>
> >>>>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> >>>>>>> datetime" in
> >>>>>>>>>> *function*, but not in implementation:
> >>>>>>>>>>
> >>>>>>>>>>  - MySQL DATETIME
> >>>>>>>>>>  <
> >>>>>>>>>>
> >>>>>>>
> >>>>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >>>>>>>>>>>
> >>>>>>>>>>  is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> >>>>>>>>>>  - MSSQL
> >>>>>>>>>>  <
> >>>>>>>>>>
> >>>>>>>
> >>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> >>>>>>>>>>>
> >>>>>>>>>>  uses 6, 7 or 8 bytes
> >>>>>>>>>>  - PostgreSQL stores an integer, but I think its epoch is still
> >>>>>>>>>> different
> >>>>>>>>>>  <
> >>>>>>>>>>
> >>>>>>>
> >>>>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> >>>>>>>>>>>
> >>>>>>>>>> (it
> >>>>>>>>>>  used to store doubles since 2000-01-01)
> >>>>>>>>>>
> >>>>>>>>>> ... so in general, moving datetimes from these systems into 64-bit
> >>>>>>>>>> integers
> >>>>>>>>>> is nontrivial and lossy.
> >>>>>>>>>>
> >>>>>>>>>> Spark / Databricks discusses how Spark handles this
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> >>>>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> >>>>>>> zone,
> >>>>>>>>>>> and are wall clock timestamps." — not UTC-normalized
> >>>>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time
> >>>>> that
> >>>>>>> the
> >>>>>>>>>>> timestamp represents, as that is fully represented by the UTC
> >>>>> time
> >>>>>>>>>>> instant given by the other timestamp components"
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I don't use Spark, but I read that page twice. First reading, I
> >>>>> got
> >>>>>>> the
> >>>>>>>>>> same thing out of it. But the second time I read it, I read the
> >>>>>>> opposite!
> >>>>>>>>>>
> >>>>>>>>>> The key part is: "*Spark SQL defines the timestamp type as
> >>>>> TIMESTAMP
> >>>>>>> WITH
> >>>>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
> >>>>> TIMESTAMP
> >>>>>>>>>> WITH
> >>>>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
> >>>>> Timestamp
> >>>>>>>>>> type:
> >>>>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.)
> >>>>>>>>>>
> >>>>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL
> >>>>>>> TIMESTAMP
> >>>>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
> >>>>> transmit
> >>>>>>>>>> timestamps as bare 64-bit integers since the epoch -- without
> >>>>>>> timezone.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >>>>>>>>>>> interpreted as UTC-normalized, that would force all of these
> >>>>> other
> >>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Are those systems' 64-bit integers interoperable in the first
> >>>>> place?
> >>>>>>>>>>
> >>>>>>>>>> As I understand it, there's a ton of variance out there when
> >>>>> encoding
> >>>>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> >>>>>>> encoding
> >>>>>>>>>> is one of many. As I mentioned in another thread, programming
> >>>>>>> languages
> >>>>>>>>>> all
> >>>>>>>>>> use structs.
> >>>>>>>>>>
> >>>>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
> >>>>> "naive"
> >>>>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE.
> >>>>>>>>>>>
> >>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >>>>>>>>>>> interpreted as UTC-normalized, that would force all of these
> >>>>> other
> >>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
> >>>>>>> (i.e.
> >>>>>>>>>>> calling the equivalent of pandas's tz_localize function) when
> >>>>> they
> >>>>>>>>>>> convert to Arrow.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Alternatives:
> >>>>>>>>>>
> >>>>>>>>>>  - int64
> >>>>>>>>>>  - date32+time64
> >>>>>>>>>>  - date32+time32
> >>>>>>>>>>
> >>>>>>>>>> This seems very harmful to me, and will make data
> >>>>>>>>>>> from these systems not accurately representable in Arrow and
> >>>>> unable
> >>>>>>> to
> >>>>>>>>>>> be round-tripped.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Certainly nobody wants to go backwards.
> >>>>>>>>>>
> >>>>>>>>>> We need to clarify: how do we store these *common* types -- MySQL
> >>>>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
> >>>>> -- in
> >>>>>>>>>> Arrow?
> >>>>>>>>>>
> >>>>>>>>>> Secondarily, I think: how do we recommend users store *datetimes*
> >>>>> in
> >>>>>>>>>> Arrow?
> >>>>>>>>>> (I'd expect this to be messier, since every system/language uses a
> >>>>>>>>>> different byte structure.)
> >>>>>>>>>>
> >>>>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how
> >>>>> many
> >>>>>>>>>>> use cases would be disenfranchised by requiring UTC
> >>>>> normalization
> >>>>>>>>>>> always.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hear, hear!
> >>>>>>>>>>
> >>>>>>>>>> Can we also poll people to find out how they're storing Instants
> >>>>>>> today?
> >>>>>>>>>>
> >>>>>>>>>> Enjoy life,
> >>>>>>>>>> Adam
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Adam Hooper
> >>>>>>>>>> +1-514-882-9694
> >>>>>>>>>> http://adamhooper.com
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Adam Hooper
> >>>> +1-514-882-9694
> >>>> http://adamhooper.com
> >>
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Julian Hyde <jh...@gmail.com>.

Thanks, Weston. I guess I’m in the camp that wants Instant to be a first-class type, on a par with ZonedDateTime and LocalDateTime. I have added my comments to the “Arguments for” section in the 2nd doc.

> On Jun 22, 2021, at 10:38 AM, Weston Pace <we...@gmail.com> wrote:
> 
> There's two (related) proposals that are being discussed I think.  I
> created streamline documents for each.  Feel free to add arguments
> for/against or to clarify.  If no one has any better ideas I'll start
> a vote topic for each one in ~48 hours.
> 
> # Proposal: Clarify meaning of timestamp without time zone
> * https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing
> 
> # Proposal: Arrow should define how an “Instant” is stored
> * https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing
> 
> 
> On Tue, Jun 22, 2021 at 6:01 AM Julian Hyde <jh...@gmail.com> wrote:
>> 
>> My proposal is that Arrow should support three different kinds of date-times: zoneless, zoned, and instant. (Not necessarily with those names.)
>> 
>> All three kinds occur frequently in the industry.
>> 
>> Many systems only have two, and users of those systems have figured out how to make do. (For example, you can implement an instant using a zoneless, and vice versa, if you are careful.) But let’s suppose that Arrow has two of the three, and needs to interoperate with a system that has a different two of the three. Chaos ensues.
>> 
>> Let’s just implement all three.
>> 
>> Julian
>> 
>> 
>> 
>> 
>>> On Jun 22, 2021, at 8:46 AM, Wes McKinney <we...@gmail.com> wrote:
>>> 
>>> Let's see a streamlined document about what we are voting on — I have
>>> limited bandwidth to read through and synthesize the discussion myself
>>> and I am probably not the only one. It has always been my
>>> understanding to represent time elapsed from the UNIX epoch
>>> (1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
>>> is what is being called an "instant").
>>> 
>>> On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <ad...@adamhooper.com> wrote:
>>>> 
>>>> Maybe Arrow should add a new type, "INSTANT"?
>>>> 
>>>> Instant (seconds since the epoch) is the gold standard in storing moments
>>>> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
>>>> this.
>>>> 
>>>> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
>>>> Instant because I read the docs *assuming* Arrow stores Instant.
>>>> 
>>>> I know, I know, no vote can stop me from interpreting "timestamp without
>>>> timezone" however the heck I want. But it's painful for me to transition
>>>> from happy user to heretic.
>>>> 
>>>> Voting to clarify that Arrow doesn't store Instants is voting to clarify
>>>> that Arrow *doesn't* do something extremely useful. It's voting for a
>>>> negative. That sounds painful! What if there were positives to vote for? An
>>>> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
>>>> A fiat that timezone=UTC means Instant, not ZonedDateTime?
>>>> 
>>>> Enjoy life,
>>>> Adam
>>>> 
>>>> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <we...@gmail.com> wrote:
>>>> 
>>>>> I agree that a vote would be a good idea.  Do you want to start a
>>>>> dedicated vote thread?  I can write one up too if you'd rather.
>>>>> 
>>>>> -Weston
>>>>> 
>>>>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <em...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> I think comments on the doc are tailing off.  Jorge's test cases I think
>>>>>> still need some more careful analysis but Weston has provided an
>>>>>> initial pass.
>>>>>> 
>>>>>> The matter not resolved on the document is whether Timestamp with
>>>>> timezone
>>>>>> logically represents multi-field date and time (that does not represent a
>>>>>> specific instant) or whether it logically represents an instant (some
>>>>>> measurable offset from an epoch).   Based on comments on the
>>>>> documentation
>>>>>> both C++/Python implementations and the Java implementations (those that
>>>>>> have generally been considered "reference") both have evidence the the
>>>>>> former representation is what is intended (some links are in the
>>>>> document).
>>>>>> 
>>>>>> We can probably continue to debate what is useful but it seems ultimately
>>>>>> we need to pick one or the other and clarify the specification.  Given
>>>>> how
>>>>>> the reference implementations currently work I think we should error on
>>>>> the
>>>>>> side of interpreting these values as date times.  Ultimately, given the
>>>>>> contention here we will likely need to vote on this.
>>>>>> 
>>>>>> More comments on the document or here are still useful in case we've
>>>>> missed
>>>>>> an interpretation or there are other facts to consider.
>>>>>> 
>>>>>> Cheers,
>>>>>> Micah
>>>>>> 
>>>>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
>>>>>> jorgecarleitao@gmail.com> wrote:
>>>>>> 
>>>>>>> Thank you everyone for participating so far; really important and
>>>>>>> useful discussion.
>>>>>>> 
>>>>>>> I think of this discussion as a set of test cases over behavior:
>>>>>>> 
>>>>>>> parameterization:
>>>>>>> * Timestamp(ms, None)
>>>>>>> * Timestamp(ms, "00:00")
>>>>>>> * Timestamp(ms, "01:00")
>>>>>>> 
>>>>>>> Cases:
>>>>>>> * its string representation equals to
>>>>>>> * add a duration equals to
>>>>>>> * add an interval equals to
>>>>>>> * subtract a Timestamp(ms, None) equals to
>>>>>>> * subtract a Timestamp(ms, "01:00") equals to
>>>>>>> * subtract a Date32 equals to
>>>>>>> * subtract a Time32(ms) equals to
>>>>>>> * extract the day equals to
>>>>>>> * extract the timezone equals to
>>>>>>> * cast to Timestamp(ms, None) equals to
>>>>>>> * cast to Timestamp(ms, "01:00") equals to
>>>>>>> * write to parquet v2 equals to (physical value and logical type)
>>>>>>> 
>>>>>>> In all cases, the result may either be valid or invalid. If valid, we
>>>>>>> would need a datatype and an actual value.
>>>>>>> I was hoping to be able to answer each of the above at the end of this
>>>>>>> discussion.
>>>>>>> 
>>>>>>> I've suggested adding these in the google docs.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jorge
>>>>>>> 
>>>>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
>>>>> emkornfield@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I've posted the examples above in
>>>>>>>> 
>>>>>>> 
>>>>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
>>>>>>>> because I think it would be better to collaborate there instead of
>>>>> linear
>>>>>>>> e-mail history and then bring the consensus back to the list.
>>>>>>>> 
>>>>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
>>>>> emkornfield@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I feel like we might still be talking past each other here or at
>>>>> least
>>>>>>> I
>>>>>>>>> don't understand the two sides of this.  I'll try to expand
>>>>> Weston's
>>>>>>>>> example because I think it provides the best clarification.
>>>>>>>>> 
>>>>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
>>>>>>> assuming
>>>>>>>>> ms) for a timestamp column without timezone (always).   This
>>>>>>> represents an
>>>>>>>>> offset from the unix epoch.  This interpretation should not change
>>>>>>> based on
>>>>>>>>> the local system timezone.  Extracting the hour field always
>>>>> yields 14
>>>>>>>>> (extraction is done relative to UTC).
>>>>>>>>> 
>>>>>>>>> The alternative here seems to be that we can encode (1970, 1, 2,
>>>>> 14,
>>>>>>> 0) in
>>>>>>>>> multiple different ways depending on what the current local system
>>>>> time
>>>>>>>>> is.  As a note, I think ORC and Spark do this, and it leads to
>>>>>>>>> confusion/misinterpretation when trying to transfer data.
>>>>>>>>> 
>>>>>>>>> If we then convert this column to a timestamp with a timezone in
>>>>> "UTC"
>>>>>>>>> timezone extracting the hour field still yields 14.  If the column
>>>>> is
>>>>>>>>> converted to Timezone with timestamp PST.  Extracting an hour would
>>>>>>> yield 6
>>>>>>>>> (assume PST = -8GMT).    Through all of these changes the data
>>>>> bits do
>>>>>>> not
>>>>>>>>> change.
>>>>>>>>> 
>>>>>>>>> Display is not mentioned because I think the points about how a
>>>>> time
>>>>>>>>> display is correct. Applications can choose what they feel makes
>>>>> sense
>>>>>>> to
>>>>>>>>> them (as long as they don't start automatically tacking on
>>>>> timezones to
>>>>>>>>> naive timestamps).  My interpretation of the specification has been
>>>>>>> display
>>>>>>>>> was kind of shorthand for field extraction.
>>>>>>>>> 
>>>>>>>>> Could others on the thread confirm this is the issue up for debate?
>>>>>>> Are
>>>>>>>>> there subtleties/operations we need to consider?
>>>>>>>>> 
>>>>>>>>> I also agree that we should document recommended conversion
>>>>> practices
>>>>>>> from
>>>>>>>>> other systems.
>>>>>>>>> 
>>>>>>>>> -Micah
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> So let's invent a third way.  I could use
>>>>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the
>>>>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour,
>>>>> the
>>>>>>>>>> next 8 bits for the minute, and the remaining bits for the
>>>>> seconds.
>>>>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as
>>>>>>>>>> 0x07B201020E000000.
>>>>>>>>> 
>>>>>>>>> Aside, With some small variation this is what ZetaSql uses [2]
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>> 
>>>>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
>>>>>>>>> [2]
>>>>>>>>> 
>>>>>>> 
>>>>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmckinn@gmail.com
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types:
>>>>>>>>>>> with/without time zone — in some SQL implementations each slot
>>>>> can
>>>>>>>>>>> have a different time zone
>>>>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html
>>>>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be
>>>>>>> taken
>>>>>>>>>>> or given as timezone local time"
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> RDBMSs conflict (universally) with ANSI.
>>>>>>>>>> 
>>>>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
>>>>> the
>>>>>>> epoch.
>>>>>>>>>> It has no timezone.
>>>>>>>>>> 
>>>>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
>>>>>>> Instant
>>>>>>>>>> since the epoch. It has no timezone.
>>>>>>>>>> 
>>>>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
>>>>>>> datetime" in
>>>>>>>>>> *function*, but not in implementation:
>>>>>>>>>> 
>>>>>>>>>>  - MySQL DATETIME
>>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>> 
>>>>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
>>>>>>>>>>> 
>>>>>>>>>>  is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>>>>>>>>>>  - MSSQL
>>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>> 
>>>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>>>>>>>>> 
>>>>>>>>>>  uses 6, 7 or 8 bytes
>>>>>>>>>>  - PostgreSQL stores an integer, but I think its epoch is still
>>>>>>>>>> different
>>>>>>>>>>  <
>>>>>>>>>> 
>>>>>>> 
>>>>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
>>>>>>>>>>> 
>>>>>>>>>> (it
>>>>>>>>>>  used to store doubles since 2000-01-01)
>>>>>>>>>> 
>>>>>>>>>> ... so in general, moving datetimes from these systems into 64-bit
>>>>>>>>>> integers
>>>>>>>>>> is nontrivial and lossy.
>>>>>>>>>> 
>>>>>>>>>> Spark / Databricks discusses how Spark handles this
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
>>>>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time
>>>>>>> zone,
>>>>>>>>>>> and are wall clock timestamps." — not UTC-normalized
>>>>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time
>>>>> that
>>>>>>> the
>>>>>>>>>>> timestamp represents, as that is fully represented by the UTC
>>>>> time
>>>>>>>>>>> instant given by the other timestamp components"
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I don't use Spark, but I read that page twice. First reading, I
>>>>> got
>>>>>>> the
>>>>>>>>>> same thing out of it. But the second time I read it, I read the
>>>>>>> opposite!
>>>>>>>>>> 
>>>>>>>>>> The key part is: "*Spark SQL defines the timestamp type as
>>>>> TIMESTAMP
>>>>>>> WITH
>>>>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
>>>>> TIMESTAMP
>>>>>>>>>> WITH
>>>>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
>>>>> Timestamp
>>>>>>>>>> type:
>>>>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.)
>>>>>>>>>> 
>>>>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL
>>>>>>> TIMESTAMP
>>>>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
>>>>> transmit
>>>>>>>>>> timestamps as bare 64-bit integers since the epoch -- without
>>>>>>> timezone.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>>>> other
>>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Are those systems' 64-bit integers interoperable in the first
>>>>> place?
>>>>>>>>>> 
>>>>>>>>>> As I understand it, there's a ton of variance out there when
>>>>> encoding
>>>>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
>>>>>>> encoding
>>>>>>>>>> is one of many. As I mentioned in another thread, programming
>>>>>>> languages
>>>>>>>>>> all
>>>>>>>>>> use structs.
>>>>>>>>>> 
>>>>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
>>>>> "naive"
>>>>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE.
>>>>>>>>>>> 
>>>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>>>> other
>>>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>>>> (i.e.
>>>>>>>>>>> calling the equivalent of pandas's tz_localize function) when
>>>>> they
>>>>>>>>>>> convert to Arrow.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Alternatives:
>>>>>>>>>> 
>>>>>>>>>>  - int64
>>>>>>>>>>  - date32+time64
>>>>>>>>>>  - date32+time32
>>>>>>>>>> 
>>>>>>>>>> This seems very harmful to me, and will make data
>>>>>>>>>>> from these systems not accurately representable in Arrow and
>>>>> unable
>>>>>>> to
>>>>>>>>>>> be round-tripped.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Certainly nobody wants to go backwards.
>>>>>>>>>> 
>>>>>>>>>> We need to clarify: how do we store these *common* types -- MySQL
>>>>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
>>>>> -- in
>>>>>>>>>> Arrow?
>>>>>>>>>> 
>>>>>>>>>> Secondarily, I think: how do we recommend users store *datetimes*
>>>>> in
>>>>>>>>>> Arrow?
>>>>>>>>>> (I'd expect this to be messier, since every system/language uses a
>>>>>>>>>> different byte structure.)
>>>>>>>>>> 
>>>>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how
>>>>> many
>>>>>>>>>>> use cases would be disenfranchised by requiring UTC
>>>>> normalization
>>>>>>>>>>> always.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hear, hear!
>>>>>>>>>> 
>>>>>>>>>> Can we also poll people to find out how they're storing Instants
>>>>>>> today?
>>>>>>>>>> 
>>>>>>>>>> Enjoy life,
>>>>>>>>>> Adam
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Adam Hooper
>>>>>>>>>> +1-514-882-9694
>>>>>>>>>> http://adamhooper.com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Adam Hooper
>>>> +1-514-882-9694
>>>> http://adamhooper.com
>>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Weston Pace <we...@gmail.com>.

There's two (related) proposals that are being discussed I think.  I
created streamline documents for each.  Feel free to add arguments
for/against or to clarify.  If no one has any better ideas I'll start
a vote topic for each one in ~48 hours.

# Proposal: Clarify meaning of timestamp without time zone
 * https://docs.google.com/document/d/1wDAuxEDVo3YxZx20fGUGqQxi3aoss7TJ-TzOUjaoZk8/edit?usp=sharing

# Proposal: Arrow should define how an “Instant” is stored
 * https://docs.google.com/document/d/1xEKRhs-GUSMwjMhgmQdnCNMXwZrA10226AcXRoP8g9E/edit?usp=sharing


On Tue, Jun 22, 2021 at 6:01 AM Julian Hyde <jh...@gmail.com> wrote:
>
> My proposal is that Arrow should support three different kinds of date-times: zoneless, zoned, and instant. (Not necessarily with those names.)
>
> All three kinds occur frequently in the industry.
>
> Many systems only have two, and users of those systems have figured out how to make do. (For example, you can implement an instant using a zoneless, and vice versa, if you are careful.) But let’s suppose that Arrow has two of the three, and needs to interoperate with a system that has a different two of the three. Chaos ensues.
>
> Let’s just implement all three.
>
> Julian
>
>
>
>
> > On Jun 22, 2021, at 8:46 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > Let's see a streamlined document about what we are voting on — I have
> > limited bandwidth to read through and synthesize the discussion myself
> > and I am probably not the only one. It has always been my
> > understanding to represent time elapsed from the UNIX epoch
> > (1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
> > is what is being called an "instant").
> >
> > On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <ad...@adamhooper.com> wrote:
> >>
> >> Maybe Arrow should add a new type, "INSTANT"?
> >>
> >> Instant (seconds since the epoch) is the gold standard in storing moments
> >> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
> >> this.
> >>
> >> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
> >> Instant because I read the docs *assuming* Arrow stores Instant.
> >>
> >> I know, I know, no vote can stop me from interpreting "timestamp without
> >> timezone" however the heck I want. But it's painful for me to transition
> >> from happy user to heretic.
> >>
> >> Voting to clarify that Arrow doesn't store Instants is voting to clarify
> >> that Arrow *doesn't* do something extremely useful. It's voting for a
> >> negative. That sounds painful! What if there were positives to vote for? An
> >> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
> >> A fiat that timezone=UTC means Instant, not ZonedDateTime?
> >>
> >> Enjoy life,
> >> Adam
> >>
> >> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <we...@gmail.com> wrote:
> >>
> >>> I agree that a vote would be a good idea.  Do you want to start a
> >>> dedicated vote thread?  I can write one up too if you'd rather.
> >>>
> >>> -Weston
> >>>
> >>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <em...@gmail.com>
> >>> wrote:
> >>>>
> >>>> I think comments on the doc are tailing off.  Jorge's test cases I think
> >>>> still need some more careful analysis but Weston has provided an
> >>>> initial pass.
> >>>>
> >>>> The matter not resolved on the document is whether Timestamp with
> >>> timezone
> >>>> logically represents multi-field date and time (that does not represent a
> >>>> specific instant) or whether it logically represents an instant (some
> >>>> measurable offset from an epoch).   Based on comments on the
> >>> documentation
> >>>> both C++/Python implementations and the Java implementations (those that
> >>>> have generally been considered "reference") both have evidence the the
> >>>> former representation is what is intended (some links are in the
> >>> document).
> >>>>
> >>>> We can probably continue to debate what is useful but it seems ultimately
> >>>> we need to pick one or the other and clarify the specification.  Given
> >>> how
> >>>> the reference implementations currently work I think we should error on
> >>> the
> >>>> side of interpreting these values as date times.  Ultimately, given the
> >>>> contention here we will likely need to vote on this.
> >>>>
> >>>> More comments on the document or here are still useful in case we've
> >>> missed
> >>>> an interpretation or there are other facts to consider.
> >>>>
> >>>> Cheers,
> >>>> Micah
> >>>>
> >>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> >>>> jorgecarleitao@gmail.com> wrote:
> >>>>
> >>>>> Thank you everyone for participating so far; really important and
> >>>>> useful discussion.
> >>>>>
> >>>>> I think of this discussion as a set of test cases over behavior:
> >>>>>
> >>>>> parameterization:
> >>>>> * Timestamp(ms, None)
> >>>>> * Timestamp(ms, "00:00")
> >>>>> * Timestamp(ms, "01:00")
> >>>>>
> >>>>> Cases:
> >>>>> * its string representation equals to
> >>>>> * add a duration equals to
> >>>>> * add an interval equals to
> >>>>> * subtract a Timestamp(ms, None) equals to
> >>>>> * subtract a Timestamp(ms, "01:00") equals to
> >>>>> * subtract a Date32 equals to
> >>>>> * subtract a Time32(ms) equals to
> >>>>> * extract the day equals to
> >>>>> * extract the timezone equals to
> >>>>> * cast to Timestamp(ms, None) equals to
> >>>>> * cast to Timestamp(ms, "01:00") equals to
> >>>>> * write to parquet v2 equals to (physical value and logical type)
> >>>>>
> >>>>> In all cases, the result may either be valid or invalid. If valid, we
> >>>>> would need a datatype and an actual value.
> >>>>> I was hoping to be able to answer each of the above at the end of this
> >>>>> discussion.
> >>>>>
> >>>>> I've suggested adding these in the google docs.
> >>>>>
> >>>>> Best,
> >>>>> Jorge
> >>>>>
> >>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
> >>> emkornfield@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> I've posted the examples above in
> >>>>>>
> >>>>>
> >>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> >>>>>> because I think it would be better to collaborate there instead of
> >>> linear
> >>>>>> e-mail history and then bring the consensus back to the list.
> >>>>>>
> >>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
> >>> emkornfield@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I feel like we might still be talking past each other here or at
> >>> least
> >>>>> I
> >>>>>>> don't understand the two sides of this.  I'll try to expand
> >>> Weston's
> >>>>>>> example because I think it provides the best clarification.
> >>>>>>>
> >>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> >>>>> assuming
> >>>>>>> ms) for a timestamp column without timezone (always).   This
> >>>>> represents an
> >>>>>>> offset from the unix epoch.  This interpretation should not change
> >>>>> based on
> >>>>>>> the local system timezone.  Extracting the hour field always
> >>> yields 14
> >>>>>>> (extraction is done relative to UTC).
> >>>>>>>
> >>>>>>> The alternative here seems to be that we can encode (1970, 1, 2,
> >>> 14,
> >>>>> 0) in
> >>>>>>> multiple different ways depending on what the current local system
> >>> time
> >>>>>>> is.  As a note, I think ORC and Spark do this, and it leads to
> >>>>>>> confusion/misinterpretation when trying to transfer data.
> >>>>>>>
> >>>>>>> If we then convert this column to a timestamp with a timezone in
> >>> "UTC"
> >>>>>>> timezone extracting the hour field still yields 14.  If the column
> >>> is
> >>>>>>> converted to Timezone with timestamp PST.  Extracting an hour would
> >>>>> yield 6
> >>>>>>> (assume PST = -8GMT).    Through all of these changes the data
> >>> bits do
> >>>>> not
> >>>>>>> change.
> >>>>>>>
> >>>>>>> Display is not mentioned because I think the points about how a
> >>> time
> >>>>>>> display is correct. Applications can choose what they feel makes
> >>> sense
> >>>>> to
> >>>>>>> them (as long as they don't start automatically tacking on
> >>> timezones to
> >>>>>>> naive timestamps).  My interpretation of the specification has been
> >>>>> display
> >>>>>>> was kind of shorthand for field extraction.
> >>>>>>>
> >>>>>>> Could others on the thread confirm this is the issue up for debate?
> >>>>> Are
> >>>>>>> there subtleties/operations we need to consider?
> >>>>>>>
> >>>>>>> I also agree that we should document recommended conversion
> >>> practices
> >>>>> from
> >>>>>>> other systems.
> >>>>>>>
> >>>>>>> -Micah
> >>>>>>>
> >>>>>>>
> >>>>>>> So let's invent a third way.  I could use
> >>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the
> >>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour,
> >>> the
> >>>>>>>> next 8 bits for the minute, and the remaining bits for the
> >>> seconds.
> >>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as
> >>>>>>>> 0x07B201020E000000.
> >>>>>>>
> >>>>>>> Aside, With some small variation this is what ZetaSql uses [2]
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>>
> >>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> >>>>>>> [2]
> >>>>>>>
> >>>>>
> >>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmckinn@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types:
> >>>>>>>>> with/without time zone — in some SQL implementations each slot
> >>> can
> >>>>>>>>> have a different time zone
> >>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html
> >>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be
> >>>>> taken
> >>>>>>>>> or given as timezone local time"
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> RDBMSs conflict (universally) with ANSI.
> >>>>>>>>
> >>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
> >>> the
> >>>>> epoch.
> >>>>>>>> It has no timezone.
> >>>>>>>>
> >>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> >>>>> Instant
> >>>>>>>> since the epoch. It has no timezone.
> >>>>>>>>
> >>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> >>>>> datetime" in
> >>>>>>>> *function*, but not in implementation:
> >>>>>>>>
> >>>>>>>>   - MySQL DATETIME
> >>>>>>>>   <
> >>>>>>>>
> >>>>>
> >>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >>>>>>>>>
> >>>>>>>>   is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> >>>>>>>>   - MSSQL
> >>>>>>>>   <
> >>>>>>>>
> >>>>>
> >>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> >>>>>>>>>
> >>>>>>>>   uses 6, 7 or 8 bytes
> >>>>>>>>   - PostgreSQL stores an integer, but I think its epoch is still
> >>>>>>>> different
> >>>>>>>>   <
> >>>>>>>>
> >>>>>
> >>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> >>>>>>>>>
> >>>>>>>> (it
> >>>>>>>>   used to store doubles since 2000-01-01)
> >>>>>>>>
> >>>>>>>> ... so in general, moving datetimes from these systems into 64-bit
> >>>>>>>> integers
> >>>>>>>> is nontrivial and lossy.
> >>>>>>>>
> >>>>>>>> Spark / Databricks discusses how Spark handles this
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> >>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> >>>>> zone,
> >>>>>>>>> and are wall clock timestamps." — not UTC-normalized
> >>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time
> >>> that
> >>>>> the
> >>>>>>>>> timestamp represents, as that is fully represented by the UTC
> >>> time
> >>>>>>>>> instant given by the other timestamp components"
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't use Spark, but I read that page twice. First reading, I
> >>> got
> >>>>> the
> >>>>>>>> same thing out of it. But the second time I read it, I read the
> >>>>> opposite!
> >>>>>>>>
> >>>>>>>> The key part is: "*Spark SQL defines the timestamp type as
> >>> TIMESTAMP
> >>>>> WITH
> >>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
> >>> TIMESTAMP
> >>>>>>>> WITH
> >>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
> >>> Timestamp
> >>>>>>>> type:
> >>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.)
> >>>>>>>>
> >>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL
> >>>>> TIMESTAMP
> >>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
> >>> transmit
> >>>>>>>> timestamps as bare 64-bit integers since the epoch -- without
> >>>>> timezone.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >>>>>>>>> interpreted as UTC-normalized, that would force all of these
> >>> other
> >>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Are those systems' 64-bit integers interoperable in the first
> >>> place?
> >>>>>>>>
> >>>>>>>> As I understand it, there's a ton of variance out there when
> >>> encoding
> >>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> >>>>> encoding
> >>>>>>>> is one of many. As I mentioned in another thread, programming
> >>>>> languages
> >>>>>>>> all
> >>>>>>>> use structs.
> >>>>>>>>
> >>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
> >>> "naive"
> >>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE.
> >>>>>>>>>
> >>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >>>>>>>>> interpreted as UTC-normalized, that would force all of these
> >>> other
> >>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
> >>>>> (i.e.
> >>>>>>>>> calling the equivalent of pandas's tz_localize function) when
> >>> they
> >>>>>>>>> convert to Arrow.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Alternatives:
> >>>>>>>>
> >>>>>>>>   - int64
> >>>>>>>>   - date32+time64
> >>>>>>>>   - date32+time32
> >>>>>>>>
> >>>>>>>> This seems very harmful to me, and will make data
> >>>>>>>>> from these systems not accurately representable in Arrow and
> >>> unable
> >>>>> to
> >>>>>>>>> be round-tripped.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Certainly nobody wants to go backwards.
> >>>>>>>>
> >>>>>>>> We need to clarify: how do we store these *common* types -- MySQL
> >>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
> >>> -- in
> >>>>>>>> Arrow?
> >>>>>>>>
> >>>>>>>> Secondarily, I think: how do we recommend users store *datetimes*
> >>> in
> >>>>>>>> Arrow?
> >>>>>>>> (I'd expect this to be messier, since every system/language uses a
> >>>>>>>> different byte structure.)
> >>>>>>>>
> >>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how
> >>> many
> >>>>>>>>> use cases would be disenfranchised by requiring UTC
> >>> normalization
> >>>>>>>>> always.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hear, hear!
> >>>>>>>>
> >>>>>>>> Can we also poll people to find out how they're storing Instants
> >>>>> today?
> >>>>>>>>
> >>>>>>>> Enjoy life,
> >>>>>>>> Adam
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Adam Hooper
> >>>>>>>> +1-514-882-9694
> >>>>>>>> http://adamhooper.com
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> >>
> >>
> >> --
> >> Adam Hooper
> >> +1-514-882-9694
> >> http://adamhooper.com
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Julian Hyde <jh...@gmail.com>.

My proposal is that Arrow should support three different kinds of date-times: zoneless, zoned, and instant. (Not necessarily with those names.)

All three kinds occur frequently in the industry.

Many systems only have two, and users of those systems have figured out how to make do. (For example, you can implement an instant using a zoneless, and vice versa, if you are careful.) But let’s suppose that Arrow has two of the three, and needs to interoperate with a system that has a different two of the three. Chaos ensues.

Let’s just implement all three.

Julian




> On Jun 22, 2021, at 8:46 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> Let's see a streamlined document about what we are voting on — I have
> limited bandwidth to read through and synthesize the discussion myself
> and I am probably not the only one. It has always been my
> understanding to represent time elapsed from the UNIX epoch
> (1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
> is what is being called an "instant").
> 
> On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <ad...@adamhooper.com> wrote:
>> 
>> Maybe Arrow should add a new type, "INSTANT"?
>> 
>> Instant (seconds since the epoch) is the gold standard in storing moments
>> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
>> this.
>> 
>> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
>> Instant because I read the docs *assuming* Arrow stores Instant.
>> 
>> I know, I know, no vote can stop me from interpreting "timestamp without
>> timezone" however the heck I want. But it's painful for me to transition
>> from happy user to heretic.
>> 
>> Voting to clarify that Arrow doesn't store Instants is voting to clarify
>> that Arrow *doesn't* do something extremely useful. It's voting for a
>> negative. That sounds painful! What if there were positives to vote for? An
>> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
>> A fiat that timezone=UTC means Instant, not ZonedDateTime?
>> 
>> Enjoy life,
>> Adam
>> 
>> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <we...@gmail.com> wrote:
>> 
>>> I agree that a vote would be a good idea.  Do you want to start a
>>> dedicated vote thread?  I can write one up too if you'd rather.
>>> 
>>> -Weston
>>> 
>>> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <em...@gmail.com>
>>> wrote:
>>>> 
>>>> I think comments on the doc are tailing off.  Jorge's test cases I think
>>>> still need some more careful analysis but Weston has provided an
>>>> initial pass.
>>>> 
>>>> The matter not resolved on the document is whether Timestamp with
>>> timezone
>>>> logically represents multi-field date and time (that does not represent a
>>>> specific instant) or whether it logically represents an instant (some
>>>> measurable offset from an epoch).   Based on comments on the
>>> documentation
>>>> both C++/Python implementations and the Java implementations (those that
>>>> have generally been considered "reference") both have evidence the the
>>>> former representation is what is intended (some links are in the
>>> document).
>>>> 
>>>> We can probably continue to debate what is useful but it seems ultimately
>>>> we need to pick one or the other and clarify the specification.  Given
>>> how
>>>> the reference implementations currently work I think we should error on
>>> the
>>>> side of interpreting these values as date times.  Ultimately, given the
>>>> contention here we will likely need to vote on this.
>>>> 
>>>> More comments on the document or here are still useful in case we've
>>> missed
>>>> an interpretation or there are other facts to consider.
>>>> 
>>>> Cheers,
>>>> Micah
>>>> 
>>>> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
>>>> jorgecarleitao@gmail.com> wrote:
>>>> 
>>>>> Thank you everyone for participating so far; really important and
>>>>> useful discussion.
>>>>> 
>>>>> I think of this discussion as a set of test cases over behavior:
>>>>> 
>>>>> parameterization:
>>>>> * Timestamp(ms, None)
>>>>> * Timestamp(ms, "00:00")
>>>>> * Timestamp(ms, "01:00")
>>>>> 
>>>>> Cases:
>>>>> * its string representation equals to
>>>>> * add a duration equals to
>>>>> * add an interval equals to
>>>>> * subtract a Timestamp(ms, None) equals to
>>>>> * subtract a Timestamp(ms, "01:00") equals to
>>>>> * subtract a Date32 equals to
>>>>> * subtract a Time32(ms) equals to
>>>>> * extract the day equals to
>>>>> * extract the timezone equals to
>>>>> * cast to Timestamp(ms, None) equals to
>>>>> * cast to Timestamp(ms, "01:00") equals to
>>>>> * write to parquet v2 equals to (physical value and logical type)
>>>>> 
>>>>> In all cases, the result may either be valid or invalid. If valid, we
>>>>> would need a datatype and an actual value.
>>>>> I was hoping to be able to answer each of the above at the end of this
>>>>> discussion.
>>>>> 
>>>>> I've suggested adding these in the google docs.
>>>>> 
>>>>> Best,
>>>>> Jorge
>>>>> 
>>>>> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
>>> emkornfield@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> I've posted the examples above in
>>>>>> 
>>>>> 
>>> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
>>>>>> because I think it would be better to collaborate there instead of
>>> linear
>>>>>> e-mail history and then bring the consensus back to the list.
>>>>>> 
>>>>>> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
>>> emkornfield@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I feel like we might still be talking past each other here or at
>>> least
>>>>> I
>>>>>>> don't understand the two sides of this.  I'll try to expand
>>> Weston's
>>>>>>> example because I think it provides the best clarification.
>>>>>>> 
>>>>>>> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
>>>>> assuming
>>>>>>> ms) for a timestamp column without timezone (always).   This
>>>>> represents an
>>>>>>> offset from the unix epoch.  This interpretation should not change
>>>>> based on
>>>>>>> the local system timezone.  Extracting the hour field always
>>> yields 14
>>>>>>> (extraction is done relative to UTC).
>>>>>>> 
>>>>>>> The alternative here seems to be that we can encode (1970, 1, 2,
>>> 14,
>>>>> 0) in
>>>>>>> multiple different ways depending on what the current local system
>>> time
>>>>>>> is.  As a note, I think ORC and Spark do this, and it leads to
>>>>>>> confusion/misinterpretation when trying to transfer data.
>>>>>>> 
>>>>>>> If we then convert this column to a timestamp with a timezone in
>>> "UTC"
>>>>>>> timezone extracting the hour field still yields 14.  If the column
>>> is
>>>>>>> converted to Timezone with timestamp PST.  Extracting an hour would
>>>>> yield 6
>>>>>>> (assume PST = -8GMT).    Through all of these changes the data
>>> bits do
>>>>> not
>>>>>>> change.
>>>>>>> 
>>>>>>> Display is not mentioned because I think the points about how a
>>> time
>>>>>>> display is correct. Applications can choose what they feel makes
>>> sense
>>>>> to
>>>>>>> them (as long as they don't start automatically tacking on
>>> timezones to
>>>>>>> naive timestamps).  My interpretation of the specification has been
>>>>> display
>>>>>>> was kind of shorthand for field extraction.
>>>>>>> 
>>>>>>> Could others on the thread confirm this is the issue up for debate?
>>>>> Are
>>>>>>> there subtleties/operations we need to consider?
>>>>>>> 
>>>>>>> I also agree that we should document recommended conversion
>>> practices
>>>>> from
>>>>>>> other systems.
>>>>>>> 
>>>>>>> -Micah
>>>>>>> 
>>>>>>> 
>>>>>>> So let's invent a third way.  I could use
>>>>>>>> the first 16 bits for the year, the next 8 bits for the month, the
>>>>>>>> next 8 bits for the day of month, the next 8 bits for the hour,
>>> the
>>>>>>>> next 8 bits for the minute, and the remaining bits for the
>>> seconds.
>>>>>>>> Using this method I would store (1970, 1, 2, 14, 0) as
>>>>>>>> 0x07B201020E000000.
>>>>>>> 
>>>>>>> Aside, With some small variation this is what ZetaSql uses [2]
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
>>>>>>> [2]
>>>>>>> 
>>>>> 
>>> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmckinn@gmail.com
>>>> 
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The SQL standard (e.g. PostgresSQL) has two timestamp types:
>>>>>>>>> with/without time zone — in some SQL implementations each slot
>>> can
>>>>>>>>> have a different time zone
>>>>>>>>> https://www.postgresql.org/docs/9.1/datatype-datetime.html
>>>>>>>>> WITHOUT TIME ZONE: "timestamp without time zone value should be
>>>>> taken
>>>>>>>>> or given as timezone local time"
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> RDBMSs conflict (universally) with ANSI.
>>>>>>>> 
>>>>>>>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
>>> the
>>>>> epoch.
>>>>>>>> It has no timezone.
>>>>>>>> 
>>>>>>>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
>>>>> Instant
>>>>>>>> since the epoch. It has no timezone.
>>>>>>>> 
>>>>>>>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
>>>>> datetime" in
>>>>>>>> *function*, but not in implementation:
>>>>>>>> 
>>>>>>>>   - MySQL DATETIME
>>>>>>>>   <
>>>>>>>> 
>>>>> 
>>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
>>>>>>>>> 
>>>>>>>>   is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>>>>>>>>   - MSSQL
>>>>>>>>   <
>>>>>>>> 
>>>>> 
>>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>>>>>>>>> 
>>>>>>>>   uses 6, 7 or 8 bytes
>>>>>>>>   - PostgreSQL stores an integer, but I think its epoch is still
>>>>>>>> different
>>>>>>>>   <
>>>>>>>> 
>>>>> 
>>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
>>>>>>>>> 
>>>>>>>> (it
>>>>>>>>   used to store doubles since 2000-01-01)
>>>>>>>> 
>>>>>>>> ... so in general, moving datetimes from these systems into 64-bit
>>>>>>>> integers
>>>>>>>> is nontrivial and lossy.
>>>>>>>> 
>>>>>>>> Spark / Databricks discusses how Spark handles this
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
>>>>>>>>> * WITHOUT TIME ZONE: "These timestamps are not bound to any time
>>>>> zone,
>>>>>>>>> and are wall clock timestamps." — not UTC-normalized
>>>>>>>>> * WITH TIME ZONE: "does not affect the physical point in time
>>> that
>>>>> the
>>>>>>>>> timestamp represents, as that is fully represented by the UTC
>>> time
>>>>>>>>> instant given by the other timestamp components"
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't use Spark, but I read that page twice. First reading, I
>>> got
>>>>> the
>>>>>>>> same thing out of it. But the second time I read it, I read the
>>>>> opposite!
>>>>>>>> 
>>>>>>>> The key part is: "*Spark SQL defines the timestamp type as
>>> TIMESTAMP
>>>>> WITH
>>>>>>>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
>>> TIMESTAMP
>>>>>>>> WITH
>>>>>>>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
>>> Timestamp
>>>>>>>> type:
>>>>>>>> a 64-bit Instant since the epoch. (It also has a Date type.)
>>>>>>>> 
>>>>>>>> If I'm reading correctly, this is exactly the same as PostgreSQL
>>>>> TIMESTAMP
>>>>>>>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
>>> transmit
>>>>>>>> timestamps as bare 64-bit integers since the epoch -- without
>>>>> timezone.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>> other
>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Are those systems' 64-bit integers interoperable in the first
>>> place?
>>>>>>>> 
>>>>>>>> As I understand it, there's a ton of variance out there when
>>> encoding
>>>>>>>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
>>>>> encoding
>>>>>>>> is one of many. As I mentioned in another thread, programming
>>>>> languages
>>>>>>>> all
>>>>>>>> use structs.
>>>>>>>> 
>>>>>>>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
>>> "naive"
>>>>>>>>> timestamps and UTC-normalized WITH TIME ZONE.
>>>>>>>>> 
>>>>>>>>> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>>>>>>>>> interpreted as UTC-normalized, that would force all of these
>>> other
>>>>>>>>> systems (and more) to serialize their data to be UTC-normalized
>>>>> (i.e.
>>>>>>>>> calling the equivalent of pandas's tz_localize function) when
>>> they
>>>>>>>>> convert to Arrow.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Alternatives:
>>>>>>>> 
>>>>>>>>   - int64
>>>>>>>>   - date32+time64
>>>>>>>>   - date32+time32
>>>>>>>> 
>>>>>>>> This seems very harmful to me, and will make data
>>>>>>>>> from these systems not accurately representable in Arrow and
>>> unable
>>>>> to
>>>>>>>>> be round-tripped.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Certainly nobody wants to go backwards.
>>>>>>>> 
>>>>>>>> We need to clarify: how do we store these *common* types -- MySQL
>>>>>>>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
>>> -- in
>>>>>>>> Arrow?
>>>>>>>> 
>>>>>>>> Secondarily, I think: how do we recommend users store *datetimes*
>>> in
>>>>>>>> Arrow?
>>>>>>>> (I'd expect this to be messier, since every system/language uses a
>>>>>>>> different byte structure.)
>>>>>>>> 
>>>>>>>> Perhaps we can make a spreadsheet and look comprehensively at how
>>> many
>>>>>>>>> use cases would be disenfranchised by requiring UTC
>>> normalization
>>>>>>>>> always.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hear, hear!
>>>>>>>> 
>>>>>>>> Can we also poll people to find out how they're storing Instants
>>>>> today?
>>>>>>>> 
>>>>>>>> Enjoy life,
>>>>>>>> Adam
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Adam Hooper
>>>>>>>> +1-514-882-9694
>>>>>>>> http://adamhooper.com
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 
>> 
>> --
>> Adam Hooper
>> +1-514-882-9694
>> http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Wes McKinney <we...@gmail.com>.

Let's see a streamlined document about what we are voting on — I have
limited bandwidth to read through and synthesize the discussion myself
and I am probably not the only one. It has always been my
understanding to represent time elapsed from the UNIX epoch
(1970-01-01 00:00:00 UTC)  as a timestamp with tz=UTC (which I believe
is what is being called an "instant").

On Tue, Jun 22, 2021 at 9:01 AM Adam Hooper <ad...@adamhooper.com> wrote:
>
> Maybe Arrow should add a new type, "INSTANT"?
>
> Instant (seconds since the epoch) is the gold standard in storing moments
> in time. All programming languages; all RDBMSs; Parquet ... everybody uses
> this.
>
> I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
> Instant because I read the docs *assuming* Arrow stores Instant.
>
> I know, I know, no vote can stop me from interpreting "timestamp without
> timezone" however the heck I want. But it's painful for me to transition
> from happy user to heretic.
>
> Voting to clarify that Arrow doesn't store Instants is voting to clarify
> that Arrow *doesn't* do something extremely useful. It's voting for a
> negative. That sounds painful! What if there were positives to vote for? An
> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
> A fiat that timezone=UTC means Instant, not ZonedDateTime?
>
> Enjoy life,
> Adam
>
> On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <we...@gmail.com> wrote:
>
> > I agree that a vote would be a good idea.  Do you want to start a
> > dedicated vote thread?  I can write one up too if you'd rather.
> >
> > -Weston
> >
> > On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > I think comments on the doc are tailing off.  Jorge's test cases I think
> > > still need some more careful analysis but Weston has provided an
> > > initial pass.
> > >
> > > The matter not resolved on the document is whether Timestamp with
> > timezone
> > > logically represents multi-field date and time (that does not represent a
> > > specific instant) or whether it logically represents an instant (some
> > > measurable offset from an epoch).   Based on comments on the
> > documentation
> > > both C++/Python implementations and the Java implementations (those that
> > > have generally been considered "reference") both have evidence the the
> > > former representation is what is intended (some links are in the
> > document).
> > >
> > > We can probably continue to debate what is useful but it seems ultimately
> > > we need to pick one or the other and clarify the specification.  Given
> > how
> > > the reference implementations currently work I think we should error on
> > the
> > > side of interpreting these values as date times.  Ultimately, given the
> > > contention here we will likely need to vote on this.
> > >
> > > More comments on the document or here are still useful in case we've
> > missed
> > > an interpretation or there are other facts to consider.
> > >
> > > Cheers,
> > > Micah
> > >
> > > On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> > > jorgecarleitao@gmail.com> wrote:
> > >
> > > > Thank you everyone for participating so far; really important and
> > > > useful discussion.
> > > >
> > > > I think of this discussion as a set of test cases over behavior:
> > > >
> > > > parameterization:
> > > > * Timestamp(ms, None)
> > > > * Timestamp(ms, "00:00")
> > > > * Timestamp(ms, "01:00")
> > > >
> > > > Cases:
> > > > * its string representation equals to
> > > > * add a duration equals to
> > > > * add an interval equals to
> > > > * subtract a Timestamp(ms, None) equals to
> > > > * subtract a Timestamp(ms, "01:00") equals to
> > > > * subtract a Date32 equals to
> > > > * subtract a Time32(ms) equals to
> > > > * extract the day equals to
> > > > * extract the timezone equals to
> > > > * cast to Timestamp(ms, None) equals to
> > > > * cast to Timestamp(ms, "01:00") equals to
> > > > * write to parquet v2 equals to (physical value and logical type)
> > > >
> > > > In all cases, the result may either be valid or invalid. If valid, we
> > > > would need a datatype and an actual value.
> > > > I was hoping to be able to answer each of the above at the end of this
> > > > discussion.
> > > >
> > > > I've suggested adding these in the google docs.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > > On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
> > emkornfield@gmail.com>
> > > > wrote:
> > > > >
> > > > > I've posted the examples above in
> > > > >
> > > >
> > https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > > > > because I think it would be better to collaborate there instead of
> > linear
> > > > > e-mail history and then bring the consensus back to the list.
> > > > >
> > > > > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I feel like we might still be talking past each other here or at
> > least
> > > > I
> > > > > > don't understand the two sides of this.  I'll try to expand
> > Weston's
> > > > > > example because I think it provides the best clarification.
> > > > > >
> > > > > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> > > > assuming
> > > > > > ms) for a timestamp column without timezone (always).   This
> > > > represents an
> > > > > > offset from the unix epoch.  This interpretation should not change
> > > > based on
> > > > > > the local system timezone.  Extracting the hour field always
> > yields 14
> > > > > > (extraction is done relative to UTC).
> > > > > >
> > > > > > The alternative here seems to be that we can encode (1970, 1, 2,
> > 14,
> > > > 0) in
> > > > > > multiple different ways depending on what the current local system
> > time
> > > > > > is.  As a note, I think ORC and Spark do this, and it leads to
> > > > > > confusion/misinterpretation when trying to transfer data.
> > > > > >
> > > > > > If we then convert this column to a timestamp with a timezone in
> > "UTC"
> > > > > > timezone extracting the hour field still yields 14.  If the column
> > is
> > > > > > converted to Timezone with timestamp PST.  Extracting an hour would
> > > > yield 6
> > > > > > (assume PST = -8GMT).    Through all of these changes the data
> > bits do
> > > > not
> > > > > > change.
> > > > > >
> > > > > > Display is not mentioned because I think the points about how a
> > time
> > > > > > display is correct. Applications can choose what they feel makes
> > sense
> > > > to
> > > > > > them (as long as they don't start automatically tacking on
> > timezones to
> > > > > > naive timestamps).  My interpretation of the specification has been
> > > > display
> > > > > > was kind of shorthand for field extraction.
> > > > > >
> > > > > > Could others on the thread confirm this is the issue up for debate?
> > > > Are
> > > > > > there subtleties/operations we need to consider?
> > > > > >
> > > > > > I also agree that we should document recommended conversion
> > practices
> > > > from
> > > > > > other systems.
> > > > > >
> > > > > > -Micah
> > > > > >
> > > > > >
> > > > > >  So let's invent a third way.  I could use
> > > > > >> the first 16 bits for the year, the next 8 bits for the month, the
> > > > > >> next 8 bits for the day of month, the next 8 bits for the hour,
> > the
> > > > > >> next 8 bits for the minute, and the remaining bits for the
> > seconds.
> > > > > >> Using this method I would store (1970, 1, 2, 14, 0) as
> > > > > >> 0x07B201020E000000.
> > > > > >
> > > > > > Aside, With some small variation this is what ZetaSql uses [2]
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > > > > > [2]
> > > > > >
> > > >
> > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
> > > > wrote:
> > > > > >
> > > > > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmckinn@gmail.com
> > >
> > > > wrote:
> > > > > >>
> > > > > >> >
> > > > > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > > > > >> > with/without time zone — in some SQL implementations each slot
> > can
> > > > > >> > have a different time zone
> > > > > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > > > > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be
> > > > taken
> > > > > >> > or given as timezone local time"
> > > > > >> >
> > > > > >>
> > > > > >> RDBMSs conflict (universally) with ANSI.
> > > > > >>
> > > > > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
> > the
> > > > epoch.
> > > > > >> It has no timezone.
> > > > > >>
> > > > > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> > > > Instant
> > > > > >> since the epoch. It has no timezone.
> > > > > >>
> > > > > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> > > > datetime" in
> > > > > >> *function*, but not in implementation:
> > > > > >>
> > > > > >>    - MySQL DATETIME
> > > > > >>    <
> > > > > >>
> > > >
> > https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> > > > > >> >
> > > > > >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> > > > > >>    - MSSQL
> > > > > >>    <
> > > > > >>
> > > >
> > https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> > > > > >> >
> > > > > >>    uses 6, 7 or 8 bytes
> > > > > >>    - PostgreSQL stores an integer, but I think its epoch is still
> > > > > >> different
> > > > > >>    <
> > > > > >>
> > > >
> > https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> > > > > >> >
> > > > > >> (it
> > > > > >>    used to store doubles since 2000-01-01)
> > > > > >>
> > > > > >> ... so in general, moving datetimes from these systems into 64-bit
> > > > > >> integers
> > > > > >> is nontrivial and lossy.
> > > > > >>
> > > > > >> Spark / Databricks discusses how Spark handles this
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > >
> > https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > > > > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> > > > zone,
> > > > > >> > and are wall clock timestamps." — not UTC-normalized
> > > > > >> > * WITH TIME ZONE: "does not affect the physical point in time
> > that
> > > > the
> > > > > >> > timestamp represents, as that is fully represented by the UTC
> > time
> > > > > >> > instant given by the other timestamp components"
> > > > > >> >
> > > > > >>
> > > > > >> I don't use Spark, but I read that page twice. First reading, I
> > got
> > > > the
> > > > > >> same thing out of it. But the second time I read it, I read the
> > > > opposite!
> > > > > >>
> > > > > >> The key part is: "*Spark SQL defines the timestamp type as
> > TIMESTAMP
> > > > WITH
> > > > > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
> > TIMESTAMP
> > > > > >> WITH
> > > > > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
> > Timestamp
> > > > > >> type:
> > > > > >> a 64-bit Instant since the epoch. (It also has a Date type.)
> > > > > >>
> > > > > >> If I'm reading correctly, this is exactly the same as PostgreSQL
> > > > TIMESTAMP
> > > > > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
> > transmit
> > > > > >> timestamps as bare 64-bit integers since the epoch -- without
> > > > timezone.
> > > > > >>
> > > > > >>
> > > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > > > >> > interpreted as UTC-normalized, that would force all of these
> > other
> > > > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > > > >>
> > > > > >>
> > > > > >> Are those systems' 64-bit integers interoperable in the first
> > place?
> > > > > >>
> > > > > >> As I understand it, there's a ton of variance out there when
> > encoding
> > > > > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> > > > encoding
> > > > > >> is one of many. As I mentioned in another thread, programming
> > > > languages
> > > > > >> all
> > > > > >> use structs.
> > > > > >>
> > > > > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
> > "naive"
> > > > > >> > timestamps and UTC-normalized WITH TIME ZONE.
> > > > > >> >
> > > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > > > >> > interpreted as UTC-normalized, that would force all of these
> > other
> > > > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > > (i.e.
> > > > > >> > calling the equivalent of pandas's tz_localize function) when
> > they
> > > > > >> > convert to Arrow.
> > > > > >>
> > > > > >>
> > > > > >> Alternatives:
> > > > > >>
> > > > > >>    - int64
> > > > > >>    - date32+time64
> > > > > >>    - date32+time32
> > > > > >>
> > > > > >> This seems very harmful to me, and will make data
> > > > > >> > from these systems not accurately representable in Arrow and
> > unable
> > > > to
> > > > > >> > be round-tripped.
> > > > > >> >
> > > > > >>
> > > > > >> Certainly nobody wants to go backwards.
> > > > > >>
> > > > > >> We need to clarify: how do we store these *common* types -- MySQL
> > > > > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
> > -- in
> > > > > >> Arrow?
> > > > > >>
> > > > > >> Secondarily, I think: how do we recommend users store *datetimes*
> > in
> > > > > >> Arrow?
> > > > > >> (I'd expect this to be messier, since every system/language uses a
> > > > > >> different byte structure.)
> > > > > >>
> > > > > >> Perhaps we can make a spreadsheet and look comprehensively at how
> > many
> > > > > >> > use cases would be disenfranchised by requiring UTC
> > normalization
> > > > > >> > always.
> > > > > >>
> > > > > >>
> > > > > >> Hear, hear!
> > > > > >>
> > > > > >> Can we also poll people to find out how they're storing Instants
> > > > today?
> > > > > >>
> > > > > >> Enjoy life,
> > > > > >> Adam
> > > > > >>
> > > > > >> --
> > > > > >> Adam Hooper
> > > > > >> +1-514-882-9694
> > > > > >> http://adamhooper.com
> > > > > >>
> > > > > >
> > > >
> >
>
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Adam Hooper <ad...@adamhooper.com>.

Maybe Arrow should add a new type, "INSTANT"?

Instant (seconds since the epoch) is the gold standard in storing moments
in time. All programming languages; all RDBMSs; Parquet ... everybody uses
this.

I use Instants, too. I interpreted TIMESTAMP with no metadata to mean
Instant because I read the docs *assuming* Arrow stores Instant.

I know, I know, no vote can stop me from interpreting "timestamp without
timezone" however the heck I want. But it's painful for me to transition
from happy user to heretic.

Voting to clarify that Arrow doesn't store Instants is voting to clarify
that Arrow *doesn't* do something extremely useful. It's voting for a
negative. That sounds painful! What if there were positives to vote for? An
"INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)?
A fiat that timezone=UTC means Instant, not ZonedDateTime?

Enjoy life,
Adam

On Mon, Jun 21, 2021 at 11:39 PM Weston Pace <we...@gmail.com> wrote:

> I agree that a vote would be a good idea.  Do you want to start a
> dedicated vote thread?  I can write one up too if you'd rather.
>
> -Weston
>
> On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I think comments on the doc are tailing off.  Jorge's test cases I think
> > still need some more careful analysis but Weston has provided an
> > initial pass.
> >
> > The matter not resolved on the document is whether Timestamp with
> timezone
> > logically represents multi-field date and time (that does not represent a
> > specific instant) or whether it logically represents an instant (some
> > measurable offset from an epoch).   Based on comments on the
> documentation
> > both C++/Python implementations and the Java implementations (those that
> > have generally been considered "reference") both have evidence the the
> > former representation is what is intended (some links are in the
> document).
> >
> > We can probably continue to debate what is useful but it seems ultimately
> > we need to pick one or the other and clarify the specification.  Given
> how
> > the reference implementations currently work I think we should error on
> the
> > side of interpreting these values as date times.  Ultimately, given the
> > contention here we will likely need to vote on this.
> >
> > More comments on the document or here are still useful in case we've
> missed
> > an interpretation or there are other facts to consider.
> >
> > Cheers,
> > Micah
> >
> > On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> > jorgecarleitao@gmail.com> wrote:
> >
> > > Thank you everyone for participating so far; really important and
> > > useful discussion.
> > >
> > > I think of this discussion as a set of test cases over behavior:
> > >
> > > parameterization:
> > > * Timestamp(ms, None)
> > > * Timestamp(ms, "00:00")
> > > * Timestamp(ms, "01:00")
> > >
> > > Cases:
> > > * its string representation equals to
> > > * add a duration equals to
> > > * add an interval equals to
> > > * subtract a Timestamp(ms, None) equals to
> > > * subtract a Timestamp(ms, "01:00") equals to
> > > * subtract a Date32 equals to
> > > * subtract a Time32(ms) equals to
> > > * extract the day equals to
> > > * extract the timezone equals to
> > > * cast to Timestamp(ms, None) equals to
> > > * cast to Timestamp(ms, "01:00") equals to
> > > * write to parquet v2 equals to (physical value and logical type)
> > >
> > > In all cases, the result may either be valid or invalid. If valid, we
> > > would need a datatype and an actual value.
> > > I was hoping to be able to answer each of the above at the end of this
> > > discussion.
> > >
> > > I've suggested adding these in the google docs.
> > >
> > > Best,
> > > Jorge
> > >
> > > On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <
> emkornfield@gmail.com>
> > > wrote:
> > > >
> > > > I've posted the examples above in
> > > >
> > >
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > > > because I think it would be better to collaborate there instead of
> linear
> > > > e-mail history and then bring the consensus back to the list.
> > > >
> > > > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > > wrote:
> > > >
> > > > > I feel like we might still be talking past each other here or at
> least
> > > I
> > > > > don't understand the two sides of this.  I'll try to expand
> Weston's
> > > > > example because I think it provides the best clarification.
> > > > >
> > > > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> > > assuming
> > > > > ms) for a timestamp column without timezone (always).   This
> > > represents an
> > > > > offset from the unix epoch.  This interpretation should not change
> > > based on
> > > > > the local system timezone.  Extracting the hour field always
> yields 14
> > > > > (extraction is done relative to UTC).
> > > > >
> > > > > The alternative here seems to be that we can encode (1970, 1, 2,
> 14,
> > > 0) in
> > > > > multiple different ways depending on what the current local system
> time
> > > > > is.  As a note, I think ORC and Spark do this, and it leads to
> > > > > confusion/misinterpretation when trying to transfer data.
> > > > >
> > > > > If we then convert this column to a timestamp with a timezone in
> "UTC"
> > > > > timezone extracting the hour field still yields 14.  If the column
> is
> > > > > converted to Timezone with timestamp PST.  Extracting an hour would
> > > yield 6
> > > > > (assume PST = -8GMT).    Through all of these changes the data
> bits do
> > > not
> > > > > change.
> > > > >
> > > > > Display is not mentioned because I think the points about how a
> time
> > > > > display is correct. Applications can choose what they feel makes
> sense
> > > to
> > > > > them (as long as they don't start automatically tacking on
> timezones to
> > > > > naive timestamps).  My interpretation of the specification has been
> > > display
> > > > > was kind of shorthand for field extraction.
> > > > >
> > > > > Could others on the thread confirm this is the issue up for debate?
> > > Are
> > > > > there subtleties/operations we need to consider?
> > > > >
> > > > > I also agree that we should document recommended conversion
> practices
> > > from
> > > > > other systems.
> > > > >
> > > > > -Micah
> > > > >
> > > > >
> > > > >  So let's invent a third way.  I could use
> > > > >> the first 16 bits for the year, the next 8 bits for the month, the
> > > > >> next 8 bits for the day of month, the next 8 bits for the hour,
> the
> > > > >> next 8 bits for the minute, and the remaining bits for the
> seconds.
> > > > >> Using this method I would store (1970, 1, 2, 14, 0) as
> > > > >> 0x07B201020E000000.
> > > > >
> > > > > Aside, With some small variation this is what ZetaSql uses [2]
> > > > >
> > > > > [1]
> > > > >
> > >
> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > > > > [2]
> > > > >
> > >
> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
> > > wrote:
> > > > >
> > > > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <wesmckinn@gmail.com
> >
> > > wrote:
> > > > >>
> > > > >> >
> > > > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > > > >> > with/without time zone — in some SQL implementations each slot
> can
> > > > >> > have a different time zone
> > > > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > > > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be
> > > taken
> > > > >> > or given as timezone local time"
> > > > >> >
> > > > >>
> > > > >> RDBMSs conflict (universally) with ANSI.
> > > > >>
> > > > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since
> the
> > > epoch.
> > > > >> It has no timezone.
> > > > >>
> > > > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> > > Instant
> > > > >> since the epoch. It has no timezone.
> > > > >>
> > > > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> > > datetime" in
> > > > >> *function*, but not in implementation:
> > > > >>
> > > > >>    - MySQL DATETIME
> > > > >>    <
> > > > >>
> > >
> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> > > > >> >
> > > > >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> > > > >>    - MSSQL
> > > > >>    <
> > > > >>
> > >
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> > > > >> >
> > > > >>    uses 6, 7 or 8 bytes
> > > > >>    - PostgreSQL stores an integer, but I think its epoch is still
> > > > >> different
> > > > >>    <
> > > > >>
> > >
> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> > > > >> >
> > > > >> (it
> > > > >>    used to store doubles since 2000-01-01)
> > > > >>
> > > > >> ... so in general, moving datetimes from these systems into 64-bit
> > > > >> integers
> > > > >> is nontrivial and lossy.
> > > > >>
> > > > >> Spark / Databricks discusses how Spark handles this
> > > > >> >
> > > > >> >
> > > > >>
> > >
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > > > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> > > zone,
> > > > >> > and are wall clock timestamps." — not UTC-normalized
> > > > >> > * WITH TIME ZONE: "does not affect the physical point in time
> that
> > > the
> > > > >> > timestamp represents, as that is fully represented by the UTC
> time
> > > > >> > instant given by the other timestamp components"
> > > > >> >
> > > > >>
> > > > >> I don't use Spark, but I read that page twice. First reading, I
> got
> > > the
> > > > >> same thing out of it. But the second time I read it, I read the
> > > opposite!
> > > > >>
> > > > >> The key part is: "*Spark SQL defines the timestamp type as
> TIMESTAMP
> > > WITH
> > > > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a
> TIMESTAMP
> > > > >> WITH
> > > > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one
> Timestamp
> > > > >> type:
> > > > >> a 64-bit Instant since the epoch. (It also has a Date type.)
> > > > >>
> > > > >> If I'm reading correctly, this is exactly the same as PostgreSQL
> > > TIMESTAMP
> > > > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and
> transmit
> > > > >> timestamps as bare 64-bit integers since the epoch -- without
> > > timezone.
> > > > >>
> > > > >>
> > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > > >> > interpreted as UTC-normalized, that would force all of these
> other
> > > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > > >>
> > > > >>
> > > > >> Are those systems' 64-bit integers interoperable in the first
> place?
> > > > >>
> > > > >> As I understand it, there's a ton of variance out there when
> encoding
> > > > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> > > encoding
> > > > >> is one of many. As I mentioned in another thread, programming
> > > languages
> > > > >> all
> > > > >> use structs.
> > > > >>
> > > > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE
> "naive"
> > > > >> > timestamps and UTC-normalized WITH TIME ZONE.
> > > > >> >
> > > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > > >> > interpreted as UTC-normalized, that would force all of these
> other
> > > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > (i.e.
> > > > >> > calling the equivalent of pandas's tz_localize function) when
> they
> > > > >> > convert to Arrow.
> > > > >>
> > > > >>
> > > > >> Alternatives:
> > > > >>
> > > > >>    - int64
> > > > >>    - date32+time64
> > > > >>    - date32+time32
> > > > >>
> > > > >> This seems very harmful to me, and will make data
> > > > >> > from these systems not accurately representable in Arrow and
> unable
> > > to
> > > > >> > be round-tripped.
> > > > >> >
> > > > >>
> > > > >> Certainly nobody wants to go backwards.
> > > > >>
> > > > >> We need to clarify: how do we store these *common* types -- MySQL
> > > > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE
> -- in
> > > > >> Arrow?
> > > > >>
> > > > >> Secondarily, I think: how do we recommend users store *datetimes*
> in
> > > > >> Arrow?
> > > > >> (I'd expect this to be messier, since every system/language uses a
> > > > >> different byte structure.)
> > > > >>
> > > > >> Perhaps we can make a spreadsheet and look comprehensively at how
> many
> > > > >> > use cases would be disenfranchised by requiring UTC
> normalization
> > > > >> > always.
> > > > >>
> > > > >>
> > > > >> Hear, hear!
> > > > >>
> > > > >> Can we also poll people to find out how they're storing Instants
> > > today?
> > > > >>
> > > > >> Enjoy life,
> > > > >> Adam
> > > > >>
> > > > >> --
> > > > >> Adam Hooper
> > > > >> +1-514-882-9694
> > > > >> http://adamhooper.com
> > > > >>
> > > > >
> > >
>


-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Weston Pace <we...@gmail.com>.

I agree that a vote would be a good idea.  Do you want to start a
dedicated vote thread?  I can write one up too if you'd rather.

-Weston

On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield <em...@gmail.com> wrote:
>
> I think comments on the doc are tailing off.  Jorge's test cases I think
> still need some more careful analysis but Weston has provided an
> initial pass.
>
> The matter not resolved on the document is whether Timestamp with timezone
> logically represents multi-field date and time (that does not represent a
> specific instant) or whether it logically represents an instant (some
> measurable offset from an epoch).   Based on comments on the documentation
> both C++/Python implementations and the Java implementations (those that
> have generally been considered "reference") both have evidence the the
> former representation is what is intended (some links are in the document).
>
> We can probably continue to debate what is useful but it seems ultimately
> we need to pick one or the other and clarify the specification.  Given how
> the reference implementations currently work I think we should error on the
> side of interpreting these values as date times.  Ultimately, given the
> contention here we will likely need to vote on this.
>
> More comments on the document or here are still useful in case we've missed
> an interpretation or there are other facts to consider.
>
> Cheers,
> Micah
>
> On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Thank you everyone for participating so far; really important and
> > useful discussion.
> >
> > I think of this discussion as a set of test cases over behavior:
> >
> > parameterization:
> > * Timestamp(ms, None)
> > * Timestamp(ms, "00:00")
> > * Timestamp(ms, "01:00")
> >
> > Cases:
> > * its string representation equals to
> > * add a duration equals to
> > * add an interval equals to
> > * subtract a Timestamp(ms, None) equals to
> > * subtract a Timestamp(ms, "01:00") equals to
> > * subtract a Date32 equals to
> > * subtract a Time32(ms) equals to
> > * extract the day equals to
> > * extract the timezone equals to
> > * cast to Timestamp(ms, None) equals to
> > * cast to Timestamp(ms, "01:00") equals to
> > * write to parquet v2 equals to (physical value and logical type)
> >
> > In all cases, the result may either be valid or invalid. If valid, we
> > would need a datatype and an actual value.
> > I was hoping to be able to answer each of the above at the end of this
> > discussion.
> >
> > I've suggested adding these in the google docs.
> >
> > Best,
> > Jorge
> >
> > On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > I've posted the examples above in
> > >
> > https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > > because I think it would be better to collaborate there instead of linear
> > > e-mail history and then bring the consensus back to the list.
> > >
> > > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > >
> > > > I feel like we might still be talking past each other here or at least
> > I
> > > > don't understand the two sides of this.  I'll try to expand Weston's
> > > > example because I think it provides the best clarification.
> > > >
> > > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> > assuming
> > > > ms) for a timestamp column without timezone (always).   This
> > represents an
> > > > offset from the unix epoch.  This interpretation should not change
> > based on
> > > > the local system timezone.  Extracting the hour field always yields 14
> > > > (extraction is done relative to UTC).
> > > >
> > > > The alternative here seems to be that we can encode (1970, 1, 2, 14,
> > 0) in
> > > > multiple different ways depending on what the current local system time
> > > > is.  As a note, I think ORC and Spark do this, and it leads to
> > > > confusion/misinterpretation when trying to transfer data.
> > > >
> > > > If we then convert this column to a timestamp with a timezone in "UTC"
> > > > timezone extracting the hour field still yields 14.  If the column is
> > > > converted to Timezone with timestamp PST.  Extracting an hour would
> > yield 6
> > > > (assume PST = -8GMT).    Through all of these changes the data bits do
> > not
> > > > change.
> > > >
> > > > Display is not mentioned because I think the points about how a time
> > > > display is correct. Applications can choose what they feel makes sense
> > to
> > > > them (as long as they don't start automatically tacking on timezones to
> > > > naive timestamps).  My interpretation of the specification has been
> > display
> > > > was kind of shorthand for field extraction.
> > > >
> > > > Could others on the thread confirm this is the issue up for debate?
> > Are
> > > > there subtleties/operations we need to consider?
> > > >
> > > > I also agree that we should document recommended conversion practices
> > from
> > > > other systems.
> > > >
> > > > -Micah
> > > >
> > > >
> > > >  So let's invent a third way.  I could use
> > > >> the first 16 bits for the year, the next 8 bits for the month, the
> > > >> next 8 bits for the day of month, the next 8 bits for the hour, the
> > > >> next 8 bits for the minute, and the remaining bits for the seconds.
> > > >> Using this method I would store (1970, 1, 2, 14, 0) as
> > > >> 0x07B201020E000000.
> > > >
> > > > Aside, With some small variation this is what ZetaSql uses [2]
> > > >
> > > > [1]
> > > >
> > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > > > [2]
> > > >
> > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> > > >
> > > >
> > > >
> > > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
> > wrote:
> > > >
> > > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >>
> > > >> >
> > > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > > >> > with/without time zone — in some SQL implementations each slot can
> > > >> > have a different time zone
> > > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be
> > taken
> > > >> > or given as timezone local time"
> > > >> >
> > > >>
> > > >> RDBMSs conflict (universally) with ANSI.
> > > >>
> > > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the
> > epoch.
> > > >> It has no timezone.
> > > >>
> > > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> > Instant
> > > >> since the epoch. It has no timezone.
> > > >>
> > > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> > datetime" in
> > > >> *function*, but not in implementation:
> > > >>
> > > >>    - MySQL DATETIME
> > > >>    <
> > > >>
> > https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> > > >> >
> > > >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> > > >>    - MSSQL
> > > >>    <
> > > >>
> > https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> > > >> >
> > > >>    uses 6, 7 or 8 bytes
> > > >>    - PostgreSQL stores an integer, but I think its epoch is still
> > > >> different
> > > >>    <
> > > >>
> > https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> > > >> >
> > > >> (it
> > > >>    used to store doubles since 2000-01-01)
> > > >>
> > > >> ... so in general, moving datetimes from these systems into 64-bit
> > > >> integers
> > > >> is nontrivial and lossy.
> > > >>
> > > >> Spark / Databricks discusses how Spark handles this
> > > >> >
> > > >> >
> > > >>
> > https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> > zone,
> > > >> > and are wall clock timestamps." — not UTC-normalized
> > > >> > * WITH TIME ZONE: "does not affect the physical point in time that
> > the
> > > >> > timestamp represents, as that is fully represented by the UTC time
> > > >> > instant given by the other timestamp components"
> > > >> >
> > > >>
> > > >> I don't use Spark, but I read that page twice. First reading, I got
> > the
> > > >> same thing out of it. But the second time I read it, I read the
> > opposite!
> > > >>
> > > >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP
> > WITH
> > > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
> > > >> WITH
> > > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
> > > >> type:
> > > >> a 64-bit Instant since the epoch. (It also has a Date type.)
> > > >>
> > > >> If I'm reading correctly, this is exactly the same as PostgreSQL
> > TIMESTAMP
> > > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> > > >> timestamps as bare 64-bit integers since the epoch -- without
> > timezone.
> > > >>
> > > >>
> > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > >> > interpreted as UTC-normalized, that would force all of these other
> > > >> > systems (and more) to serialize their data to be UTC-normalized
> > > >>
> > > >>
> > > >> Are those systems' 64-bit integers interoperable in the first place?
> > > >>
> > > >> As I understand it, there's a ton of variance out there when encoding
> > > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> > encoding
> > > >> is one of many. As I mentioned in another thread, programming
> > languages
> > > >> all
> > > >> use structs.
> > > >>
> > > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > > >> > timestamps and UTC-normalized WITH TIME ZONE.
> > > >> >
> > > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > > >> > interpreted as UTC-normalized, that would force all of these other
> > > >> > systems (and more) to serialize their data to be UTC-normalized
> > (i.e.
> > > >> > calling the equivalent of pandas's tz_localize function) when they
> > > >> > convert to Arrow.
> > > >>
> > > >>
> > > >> Alternatives:
> > > >>
> > > >>    - int64
> > > >>    - date32+time64
> > > >>    - date32+time32
> > > >>
> > > >> This seems very harmful to me, and will make data
> > > >> > from these systems not accurately representable in Arrow and unable
> > to
> > > >> > be round-tripped.
> > > >> >
> > > >>
> > > >> Certainly nobody wants to go backwards.
> > > >>
> > > >> We need to clarify: how do we store these *common* types -- MySQL
> > > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> > > >> Arrow?
> > > >>
> > > >> Secondarily, I think: how do we recommend users store *datetimes* in
> > > >> Arrow?
> > > >> (I'd expect this to be messier, since every system/language uses a
> > > >> different byte structure.)
> > > >>
> > > >> Perhaps we can make a spreadsheet and look comprehensively at how many
> > > >> > use cases would be disenfranchised by requiring UTC normalization
> > > >> > always.
> > > >>
> > > >>
> > > >> Hear, hear!
> > > >>
> > > >> Can we also poll people to find out how they're storing Instants
> > today?
> > > >>
> > > >> Enjoy life,
> > > >> Adam
> > > >>
> > > >> --
> > > >> Adam Hooper
> > > >> +1-514-882-9694
> > > >> http://adamhooper.com
> > > >>
> > > >
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Micah Kornfield <em...@gmail.com>.

I think comments on the doc are tailing off.  Jorge's test cases I think
still need some more careful analysis but Weston has provided an
initial pass.

The matter not resolved on the document is whether Timestamp with timezone
logically represents multi-field date and time (that does not represent a
specific instant) or whether it logically represents an instant (some
measurable offset from an epoch).   Based on comments on the documentation
both C++/Python implementations and the Java implementations (those that
have generally been considered "reference") both have evidence the the
former representation is what is intended (some links are in the document).

We can probably continue to debate what is useful but it seems ultimately
we need to pick one or the other and clarify the specification.  Given how
the reference implementations currently work I think we should error on the
side of interpreting these values as date times.  Ultimately, given the
contention here we will likely need to vote on this.

More comments on the document or here are still useful in case we've missed
an interpretation or there are other facts to consider.

Cheers,
Micah

On Thu, Jun 17, 2021 at 9:59 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Thank you everyone for participating so far; really important and
> useful discussion.
>
> I think of this discussion as a set of test cases over behavior:
>
> parameterization:
> * Timestamp(ms, None)
> * Timestamp(ms, "00:00")
> * Timestamp(ms, "01:00")
>
> Cases:
> * its string representation equals to
> * add a duration equals to
> * add an interval equals to
> * subtract a Timestamp(ms, None) equals to
> * subtract a Timestamp(ms, "01:00") equals to
> * subtract a Date32 equals to
> * subtract a Time32(ms) equals to
> * extract the day equals to
> * extract the timezone equals to
> * cast to Timestamp(ms, None) equals to
> * cast to Timestamp(ms, "01:00") equals to
> * write to parquet v2 equals to (physical value and logical type)
>
> In all cases, the result may either be valid or invalid. If valid, we
> would need a datatype and an actual value.
> I was hoping to be able to answer each of the above at the end of this
> discussion.
>
> I've suggested adding these in the google docs.
>
> Best,
> Jorge
>
> On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I've posted the examples above in
> >
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> > because I think it would be better to collaborate there instead of linear
> > e-mail history and then bring the consensus back to the list.
> >
> > On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > I feel like we might still be talking past each other here or at least
> I
> > > don't understand the two sides of this.  I'll try to expand Weston's
> > > example because I think it provides the best clarification.
> > >
> > > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000,
> assuming
> > > ms) for a timestamp column without timezone (always).   This
> represents an
> > > offset from the unix epoch.  This interpretation should not change
> based on
> > > the local system timezone.  Extracting the hour field always yields 14
> > > (extraction is done relative to UTC).
> > >
> > > The alternative here seems to be that we can encode (1970, 1, 2, 14,
> 0) in
> > > multiple different ways depending on what the current local system time
> > > is.  As a note, I think ORC and Spark do this, and it leads to
> > > confusion/misinterpretation when trying to transfer data.
> > >
> > > If we then convert this column to a timestamp with a timezone in "UTC"
> > > timezone extracting the hour field still yields 14.  If the column is
> > > converted to Timezone with timestamp PST.  Extracting an hour would
> yield 6
> > > (assume PST = -8GMT).    Through all of these changes the data bits do
> not
> > > change.
> > >
> > > Display is not mentioned because I think the points about how a time
> > > display is correct. Applications can choose what they feel makes sense
> to
> > > them (as long as they don't start automatically tacking on timezones to
> > > naive timestamps).  My interpretation of the specification has been
> display
> > > was kind of shorthand for field extraction.
> > >
> > > Could others on the thread confirm this is the issue up for debate?
> Are
> > > there subtleties/operations we need to consider?
> > >
> > > I also agree that we should document recommended conversion practices
> from
> > > other systems.
> > >
> > > -Micah
> > >
> > >
> > >  So let's invent a third way.  I could use
> > >> the first 16 bits for the year, the next 8 bits for the month, the
> > >> next 8 bits for the day of month, the next 8 bits for the hour, the
> > >> next 8 bits for the minute, and the remaining bits for the seconds.
> > >> Using this method I would store (1970, 1, 2, 14, 0) as
> > >> 0x07B201020E000000.
> > >
> > > Aside, With some small variation this is what ZetaSql uses [2]
> > >
> > > [1]
> > >
> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > > [2]
> > >
> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> > >
> > >
> > >
> > > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com>
> wrote:
> > >
> > >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >>
> > >> >
> > >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > >> > with/without time zone — in some SQL implementations each slot can
> > >> > have a different time zone
> > >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > >> > WITHOUT TIME ZONE: "timestamp without time zone value should be
> taken
> > >> > or given as timezone local time"
> > >> >
> > >>
> > >> RDBMSs conflict (universally) with ANSI.
> > >>
> > >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the
> epoch.
> > >> It has no timezone.
> > >>
> > >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int
> Instant
> > >> since the epoch. It has no timezone.
> > >>
> > >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive
> datetime" in
> > >> *function*, but not in implementation:
> > >>
> > >>    - MySQL DATETIME
> > >>    <
> > >>
> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> > >> >
> > >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> > >>    - MSSQL
> > >>    <
> > >>
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> > >> >
> > >>    uses 6, 7 or 8 bytes
> > >>    - PostgreSQL stores an integer, but I think its epoch is still
> > >> different
> > >>    <
> > >>
> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> > >> >
> > >> (it
> > >>    used to store doubles since 2000-01-01)
> > >>
> > >> ... so in general, moving datetimes from these systems into 64-bit
> > >> integers
> > >> is nontrivial and lossy.
> > >>
> > >> Spark / Databricks discusses how Spark handles this
> > >> >
> > >> >
> > >>
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time
> zone,
> > >> > and are wall clock timestamps." — not UTC-normalized
> > >> > * WITH TIME ZONE: "does not affect the physical point in time that
> the
> > >> > timestamp represents, as that is fully represented by the UTC time
> > >> > instant given by the other timestamp components"
> > >> >
> > >>
> > >> I don't use Spark, but I read that page twice. First reading, I got
> the
> > >> same thing out of it. But the second time I read it, I read the
> opposite!
> > >>
> > >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP
> WITH
> > >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
> > >> WITH
> > >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
> > >> type:
> > >> a 64-bit Instant since the epoch. (It also has a Date type.)
> > >>
> > >> If I'm reading correctly, this is exactly the same as PostgreSQL
> TIMESTAMP
> > >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> > >> timestamps as bare 64-bit integers since the epoch -- without
> timezone.
> > >>
> > >>
> > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > >> > interpreted as UTC-normalized, that would force all of these other
> > >> > systems (and more) to serialize their data to be UTC-normalized
> > >>
> > >>
> > >> Are those systems' 64-bit integers interoperable in the first place?
> > >>
> > >> As I understand it, there's a ton of variance out there when encoding
> > >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres
> encoding
> > >> is one of many. As I mentioned in another thread, programming
> languages
> > >> all
> > >> use structs.
> > >>
> > >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > >> > timestamps and UTC-normalized WITH TIME ZONE.
> > >> >
> > >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > >> > interpreted as UTC-normalized, that would force all of these other
> > >> > systems (and more) to serialize their data to be UTC-normalized
> (i.e.
> > >> > calling the equivalent of pandas's tz_localize function) when they
> > >> > convert to Arrow.
> > >>
> > >>
> > >> Alternatives:
> > >>
> > >>    - int64
> > >>    - date32+time64
> > >>    - date32+time32
> > >>
> > >> This seems very harmful to me, and will make data
> > >> > from these systems not accurately representable in Arrow and unable
> to
> > >> > be round-tripped.
> > >> >
> > >>
> > >> Certainly nobody wants to go backwards.
> > >>
> > >> We need to clarify: how do we store these *common* types -- MySQL
> > >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> > >> Arrow?
> > >>
> > >> Secondarily, I think: how do we recommend users store *datetimes* in
> > >> Arrow?
> > >> (I'd expect this to be messier, since every system/language uses a
> > >> different byte structure.)
> > >>
> > >> Perhaps we can make a spreadsheet and look comprehensively at how many
> > >> > use cases would be disenfranchised by requiring UTC normalization
> > >> > always.
> > >>
> > >>
> > >> Hear, hear!
> > >>
> > >> Can we also poll people to find out how they're storing Instants
> today?
> > >>
> > >> Enjoy life,
> > >> Adam
> > >>
> > >> --
> > >> Adam Hooper
> > >> +1-514-882-9694
> > >> http://adamhooper.com
> > >>
> > >
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.

Thank you everyone for participating so far; really important and
useful discussion.

I think of this discussion as a set of test cases over behavior:

parameterization:
* Timestamp(ms, None)
* Timestamp(ms, "00:00")
* Timestamp(ms, "01:00")

Cases:
* its string representation equals to
* add a duration equals to
* add an interval equals to
* subtract a Timestamp(ms, None) equals to
* subtract a Timestamp(ms, "01:00") equals to
* subtract a Date32 equals to
* subtract a Time32(ms) equals to
* extract the day equals to
* extract the timezone equals to
* cast to Timestamp(ms, None) equals to
* cast to Timestamp(ms, "01:00") equals to
* write to parquet v2 equals to (physical value and logical type)

In all cases, the result may either be valid or invalid. If valid, we
would need a datatype and an actual value.
I was hoping to be able to answer each of the above at the end of this
discussion.

I've suggested adding these in the google docs.

Best,
Jorge

On Fri, Jun 18, 2021 at 12:15 AM Micah Kornfield <em...@gmail.com> wrote:
>
> I've posted the examples above in
> https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
> because I think it would be better to collaborate there instead of linear
> e-mail history and then bring the consensus back to the list.
>
> On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > I feel like we might still be talking past each other here or at least I
> > don't understand the two sides of this.  I'll try to expand Weston's
> > example because I think it provides the best clarification.
> >
> > (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, assuming
> > ms) for a timestamp column without timezone (always).   This represents an
> > offset from the unix epoch.  This interpretation should not change based on
> > the local system timezone.  Extracting the hour field always yields 14
> > (extraction is done relative to UTC).
> >
> > The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in
> > multiple different ways depending on what the current local system time
> > is.  As a note, I think ORC and Spark do this, and it leads to
> > confusion/misinterpretation when trying to transfer data.
> >
> > If we then convert this column to a timestamp with a timezone in "UTC"
> > timezone extracting the hour field still yields 14.  If the column is
> > converted to Timezone with timestamp PST.  Extracting an hour would yield 6
> > (assume PST = -8GMT).    Through all of these changes the data bits do not
> > change.
> >
> > Display is not mentioned because I think the points about how a time
> > display is correct. Applications can choose what they feel makes sense to
> > them (as long as they don't start automatically tacking on timezones to
> > naive timestamps).  My interpretation of the specification has been display
> > was kind of shorthand for field extraction.
> >
> > Could others on the thread confirm this is the issue up for debate?  Are
> > there subtleties/operations we need to consider?
> >
> > I also agree that we should document recommended conversion practices from
> > other systems.
> >
> > -Micah
> >
> >
> >  So let's invent a third way.  I could use
> >> the first 16 bits for the year, the next 8 bits for the month, the
> >> next 8 bits for the day of month, the next 8 bits for the hour, the
> >> next 8 bits for the minute, and the remaining bits for the seconds.
> >> Using this method I would store (1970, 1, 2, 14, 0) as
> >> 0x07B201020E000000.
> >
> > Aside, With some small variation this is what ZetaSql uses [2]
> >
> > [1]
> > https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> > [2]
> > https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
> >
> >
> >
> > On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com> wrote:
> >
> >> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <we...@gmail.com> wrote:
> >>
> >> >
> >> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> >> > with/without time zone — in some SQL implementations each slot can
> >> > have a different time zone
> >> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> >> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> >> > or given as timezone local time"
> >> >
> >>
> >> RDBMSs conflict (universally) with ANSI.
> >>
> >> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
> >> It has no timezone.
> >>
> >> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
> >> since the epoch. It has no timezone.
> >>
> >> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
> >> *function*, but not in implementation:
> >>
> >>    - MySQL DATETIME
> >>    <
> >> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >> >
> >>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
> >>    - MSSQL
> >>    <
> >> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> >> >
> >>    uses 6, 7 or 8 bytes
> >>    - PostgreSQL stores an integer, but I think its epoch is still
> >> different
> >>    <
> >> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> >> >
> >> (it
> >>    used to store doubles since 2000-01-01)
> >>
> >> ... so in general, moving datetimes from these systems into 64-bit
> >> integers
> >> is nontrivial and lossy.
> >>
> >> Spark / Databricks discusses how Spark handles this
> >> >
> >> >
> >> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> >> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> >> > and are wall clock timestamps." — not UTC-normalized
> >> > * WITH TIME ZONE: "does not affect the physical point in time that the
> >> > timestamp represents, as that is fully represented by the UTC time
> >> > instant given by the other timestamp components"
> >> >
> >>
> >> I don't use Spark, but I read that page twice. First reading, I got the
> >> same thing out of it. But the second time I read it, I read the opposite!
> >>
> >> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
> >> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
> >> WITH
> >> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
> >> type:
> >> a 64-bit Instant since the epoch. (It also has a Date type.)
> >>
> >> If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
> >> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> >> timestamps as bare 64-bit integers since the epoch -- without timezone.
> >>
> >>
> >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >> > interpreted as UTC-normalized, that would force all of these other
> >> > systems (and more) to serialize their data to be UTC-normalized
> >>
> >>
> >> Are those systems' 64-bit integers interoperable in the first place?
> >>
> >> As I understand it, there's a ton of variance out there when encoding
> >> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
> >> is one of many. As I mentioned in another thread, programming languages
> >> all
> >> use structs.
> >>
> >> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> >> > timestamps and UTC-normalized WITH TIME ZONE.
> >> >
> >> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> >> > interpreted as UTC-normalized, that would force all of these other
> >> > systems (and more) to serialize their data to be UTC-normalized (i.e.
> >> > calling the equivalent of pandas's tz_localize function) when they
> >> > convert to Arrow.
> >>
> >>
> >> Alternatives:
> >>
> >>    - int64
> >>    - date32+time64
> >>    - date32+time32
> >>
> >> This seems very harmful to me, and will make data
> >> > from these systems not accurately representable in Arrow and unable to
> >> > be round-tripped.
> >> >
> >>
> >> Certainly nobody wants to go backwards.
> >>
> >> We need to clarify: how do we store these *common* types -- MySQL
> >> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> >> Arrow?
> >>
> >> Secondarily, I think: how do we recommend users store *datetimes* in
> >> Arrow?
> >> (I'd expect this to be messier, since every system/language uses a
> >> different byte structure.)
> >>
> >> Perhaps we can make a spreadsheet and look comprehensively at how many
> >> > use cases would be disenfranchised by requiring UTC normalization
> >> > always.
> >>
> >>
> >> Hear, hear!
> >>
> >> Can we also poll people to find out how they're storing Instants today?
> >>
> >> Enjoy life,
> >> Adam
> >>
> >> --
> >> Adam Hooper
> >> +1-514-882-9694
> >> http://adamhooper.com
> >>
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Micah Kornfield <em...@gmail.com>.

I've posted the examples above in
https://docs.google.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing
because I think it would be better to collaborate there instead of linear
e-mail history and then bring the consensus back to the list.

On Thu, Jun 17, 2021 at 2:56 PM Micah Kornfield <em...@gmail.com>
wrote:

> I feel like we might still be talking past each other here or at least I
> don't understand the two sides of this.  I'll try to expand Weston's
> example because I think it provides the best clarification.
>
> (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, assuming
> ms) for a timestamp column without timezone (always).   This represents an
> offset from the unix epoch.  This interpretation should not change based on
> the local system timezone.  Extracting the hour field always yields 14
> (extraction is done relative to UTC).
>
> The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in
> multiple different ways depending on what the current local system time
> is.  As a note, I think ORC and Spark do this, and it leads to
> confusion/misinterpretation when trying to transfer data.
>
> If we then convert this column to a timestamp with a timezone in "UTC"
> timezone extracting the hour field still yields 14.  If the column is
> converted to Timezone with timestamp PST.  Extracting an hour would yield 6
> (assume PST = -8GMT).    Through all of these changes the data bits do not
> change.
>
> Display is not mentioned because I think the points about how a time
> display is correct. Applications can choose what they feel makes sense to
> them (as long as they don't start automatically tacking on timezones to
> naive timestamps).  My interpretation of the specification has been display
> was kind of shorthand for field extraction.
>
> Could others on the thread confirm this is the issue up for debate?  Are
> there subtleties/operations we need to consider?
>
> I also agree that we should document recommended conversion practices from
> other systems.
>
> -Micah
>
>
>  So let's invent a third way.  I could use
>> the first 16 bits for the year, the next 8 bits for the month, the
>> next 8 bits for the day of month, the next 8 bits for the hour, the
>> next 8 bits for the minute, and the remaining bits for the seconds.
>> Using this method I would store (1970, 1, 2, 14, 0) as
>> 0x07B201020E000000.
>
> Aside, With some small variation this is what ZetaSql uses [2]
>
> [1]
> https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
> [2]
> https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62
>
>
>
> On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com> wrote:
>
>> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> >
>> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
>> > with/without time zone — in some SQL implementations each slot can
>> > have a different time zone
>> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
>> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
>> > or given as timezone local time"
>> >
>>
>> RDBMSs conflict (universally) with ANSI.
>>
>> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
>> It has no timezone.
>>
>> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
>> since the epoch. It has no timezone.
>>
>> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
>> *function*, but not in implementation:
>>
>>    - MySQL DATETIME
>>    <
>> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
>> >
>>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>>    - MSSQL
>>    <
>> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
>> >
>>    uses 6, 7 or 8 bytes
>>    - PostgreSQL stores an integer, but I think its epoch is still
>> different
>>    <
>> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
>> >
>> (it
>>    used to store doubles since 2000-01-01)
>>
>> ... so in general, moving datetimes from these systems into 64-bit
>> integers
>> is nontrivial and lossy.
>>
>> Spark / Databricks discusses how Spark handles this
>> >
>> >
>> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
>> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
>> > and are wall clock timestamps." — not UTC-normalized
>> > * WITH TIME ZONE: "does not affect the physical point in time that the
>> > timestamp represents, as that is fully represented by the UTC time
>> > instant given by the other timestamp components"
>> >
>>
>> I don't use Spark, but I read that page twice. First reading, I got the
>> same thing out of it. But the second time I read it, I read the opposite!
>>
>> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
>> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP
>> WITH
>> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp
>> type:
>> a 64-bit Instant since the epoch. (It also has a Date type.)
>>
>> If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
>> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
>> timestamps as bare 64-bit integers since the epoch -- without timezone.
>>
>>
>> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>> > interpreted as UTC-normalized, that would force all of these other
>> > systems (and more) to serialize their data to be UTC-normalized
>>
>>
>> Are those systems' 64-bit integers interoperable in the first place?
>>
>> As I understand it, there's a ton of variance out there when encoding
>> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
>> is one of many. As I mentioned in another thread, programming languages
>> all
>> use structs.
>>
>> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
>> > timestamps and UTC-normalized WITH TIME ZONE.
>> >
>> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
>> > interpreted as UTC-normalized, that would force all of these other
>> > systems (and more) to serialize their data to be UTC-normalized (i.e.
>> > calling the equivalent of pandas's tz_localize function) when they
>> > convert to Arrow.
>>
>>
>> Alternatives:
>>
>>    - int64
>>    - date32+time64
>>    - date32+time32
>>
>> This seems very harmful to me, and will make data
>> > from these systems not accurately representable in Arrow and unable to
>> > be round-tripped.
>> >
>>
>> Certainly nobody wants to go backwards.
>>
>> We need to clarify: how do we store these *common* types -- MySQL
>> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
>> Arrow?
>>
>> Secondarily, I think: how do we recommend users store *datetimes* in
>> Arrow?
>> (I'd expect this to be messier, since every system/language uses a
>> different byte structure.)
>>
>> Perhaps we can make a spreadsheet and look comprehensively at how many
>> > use cases would be disenfranchised by requiring UTC normalization
>> > always.
>>
>>
>> Hear, hear!
>>
>> Can we also poll people to find out how they're storing Instants today?
>>
>> Enjoy life,
>> Adam
>>
>> --
>> Adam Hooper
>> +1-514-882-9694
>> http://adamhooper.com
>>
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Micah Kornfield <em...@gmail.com>.

I feel like we might still be talking past each other here or at least I
don't understand the two sides of this.  I'll try to expand Weston's
example because I think it provides the best clarification.

(1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800 (172800000, assuming
ms) for a timestamp column without timezone (always).   This represents an
offset from the unix epoch.  This interpretation should not change based on
the local system timezone.  Extracting the hour field always yields 14
(extraction is done relative to UTC).

The alternative here seems to be that we can encode (1970, 1, 2, 14, 0) in
multiple different ways depending on what the current local system time
is.  As a note, I think ORC and Spark do this, and it leads to
confusion/misinterpretation when trying to transfer data.

If we then convert this column to a timestamp with a timezone in "UTC"
timezone extracting the hour field still yields 14.  If the column is
converted to Timezone with timestamp PST.  Extracting an hour would yield 6
(assume PST = -8GMT).    Through all of these changes the data bits do not
change.

Display is not mentioned because I think the points about how a time
display is correct. Applications can choose what they feel makes sense to
them (as long as they don't start automatically tacking on timezones to
naive timestamps).  My interpretation of the specification has been display
was kind of shorthand for field extraction.

Could others on the thread confirm this is the issue up for debate?  Are
there subtleties/operations we need to consider?

I also agree that we should document recommended conversion practices from
other systems.

-Micah


 So let's invent a third way.  I could use
> the first 16 bits for the year, the next 8 bits for the month, the
> next 8 bits for the day of month, the next 8 bits for the hour, the
> next 8 bits for the minute, and the remaining bits for the seconds.
> Using this method I would store (1970, 1, 2, 14, 0) as
> 0x07B201020E000000.

Aside, With some small variation this is what ZetaSql uses [2]

[1] https://arrow.apache.org/docs/python/timestamps.html#pandas-arrow-spark
[2]
https://github.com/google/zetasql/blob/master/zetasql/public/civil_time.h#L62



On Thu, Jun 17, 2021 at 1:58 PM Adam Hooper <ad...@adamhooper.com> wrote:

> On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <we...@gmail.com> wrote:
>
> >
> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > with/without time zone — in some SQL implementations each slot can
> > have a different time zone
> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> > or given as timezone local time"
> >
>
> RDBMSs conflict (universally) with ANSI.
>
> PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
> It has no timezone.
>
> MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
> since the epoch. It has no timezone.
>
> TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
> *function*, but not in implementation:
>
>    - MySQL DATETIME
>    <
> https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html
> >
>    is weird: 1-bit sign, 17-bit month, 5-bit day, ....
>    - MSSQL
>    <
> https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15
> >
>    uses 6, 7 or 8 bytes
>    - PostgreSQL stores an integer, but I think its epoch is still different
>    <
> https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h
> >
> (it
>    used to store doubles since 2000-01-01)
>
> ... so in general, moving datetimes from these systems into 64-bit integers
> is nontrivial and lossy.
>
> Spark / Databricks discusses how Spark handles this
> >
> >
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> > and are wall clock timestamps." — not UTC-normalized
> > * WITH TIME ZONE: "does not affect the physical point in time that the
> > timestamp represents, as that is fully represented by the UTC time
> > instant given by the other timestamp components"
> >
>
> I don't use Spark, but I read that page twice. First reading, I got the
> same thing out of it. But the second time I read it, I read the opposite!
>
> The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
> SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP WITH
> TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp type:
> a 64-bit Instant since the epoch. (It also has a Date type.)
>
> If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
> WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
> timestamps as bare 64-bit integers since the epoch -- without timezone.
>
>
> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > interpreted as UTC-normalized, that would force all of these other
> > systems (and more) to serialize their data to be UTC-normalized
>
>
> Are those systems' 64-bit integers interoperable in the first place?
>
> As I understand it, there's a ton of variance out there when encoding
> datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
> is one of many. As I mentioned in another thread, programming languages all
> use structs.
>
> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > timestamps and UTC-normalized WITH TIME ZONE.
> >
> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > interpreted as UTC-normalized, that would force all of these other
> > systems (and more) to serialize their data to be UTC-normalized (i.e.
> > calling the equivalent of pandas's tz_localize function) when they
> > convert to Arrow.
>
>
> Alternatives:
>
>    - int64
>    - date32+time64
>    - date32+time32
>
> This seems very harmful to me, and will make data
> > from these systems not accurately representable in Arrow and unable to
> > be round-tripped.
> >
>
> Certainly nobody wants to go backwards.
>
> We need to clarify: how do we store these *common* types -- MySQL
> TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in
> Arrow?
>
> Secondarily, I think: how do we recommend users store *datetimes* in Arrow?
> (I'd expect this to be messier, since every system/language uses a
> different byte structure.)
>
> Perhaps we can make a spreadsheet and look comprehensively at how many
> > use cases would be disenfranchised by requiring UTC normalization
> > always.
>
>
> Hear, hear!
>
> Can we also poll people to find out how they're storing Instants today?
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Adam Hooper <ad...@adamhooper.com>.

On Thu, Jun 17, 2021 at 2:59 PM Wes McKinney <we...@gmail.com> wrote:

>
> The SQL standard (e.g. PostgresSQL) has two timestamp types:
> with/without time zone — in some SQL implementations each slot can
> have a different time zone
> https://www.postgresql.org/docs/9.1/datatype-datetime.html
> WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> or given as timezone local time"
>

RDBMSs conflict (universally) with ANSI.

PostgreSQL TIMESTAMP WITH TIME ZONE is 64-bit int Instant since the epoch.
It has no timezone.

MySQL/MariaDB/BigTable/[your fork here] TIMESTAMP is also an int Instant
since the epoch. It has no timezone.

TIMESTAMP *WITHOUT* TIME ZONE is indeed akin to Numpy "naive datetime" in
*function*, but not in implementation:

   - MySQL DATETIME
   <https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html>
   is weird: 1-bit sign, 17-bit month, 5-bit day, ....
   - MSSQL
   <https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-ver15>
   uses 6, 7 or 8 bytes
   - PostgreSQL stores an integer, but I think its epoch is still different
   <https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/include/datatype/timestamp.h>
(it
   used to store doubles since 2000-01-01)

... so in general, moving datetimes from these systems into 64-bit integers
is nontrivial and lossy.

Spark / Databricks discusses how Spark handles this
>
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> and are wall clock timestamps." — not UTC-normalized
> * WITH TIME ZONE: "does not affect the physical point in time that the
> timestamp represents, as that is fully represented by the UTC time
> instant given by the other timestamp components"
>

I don't use Spark, but I read that page twice. First reading, I got the
same thing out of it. But the second time I read it, I read the opposite!

The key part is: "*Spark SQL defines the timestamp type as TIMESTAMP WITH
SESSION TIME ZONE*," -- in other words, Spark doesn't have a TIMESTAMP WITH
TIME ZONE *or* TIMESTAMP WITHOUT TIME ZONE. It has just one Timestamp type:
a 64-bit Instant since the epoch. (It also has a Date type.)

If I'm reading correctly, this is exactly the same as PostgreSQL TIMESTAMP
WITH TIME ZONE and MySQL TIMESTAMP: servers always store and transmit
timestamps as bare 64-bit integers since the epoch -- without timezone.

> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> interpreted as UTC-normalized, that would force all of these other
> systems (and more) to serialize their data to be UTC-normalized

Are those systems' 64-bit integers interoperable in the first place?

As I understand it, there's a ton of variance out there when encoding
datetimes as 64-bit integers. The Numpy+Arrow+Parquet+Postgres encoding
is one of many. As I mentioned in another thread, programming languages all
use structs.

pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> timestamps and UTC-normalized WITH TIME ZONE.
>
> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> interpreted as UTC-normalized, that would force all of these other
> systems (and more) to serialize their data to be UTC-normalized (i.e.
> calling the equivalent of pandas's tz_localize function) when they
> convert to Arrow.

Alternatives:

   - int64
   - date32+time64
   - date32+time32

This seems very harmful to me, and will make data
> from these systems not accurately representable in Arrow and unable to
> be round-tripped.
>

Certainly nobody wants to go backwards.

We need to clarify: how do we store these *common* types -- MySQL
TIMESTAMP, Spark TIMESTAMP, PostgreSQL TIMESTAMP WITH TIME ZONE -- in Arrow?

Secondarily, I think: how do we recommend users store *datetimes* in Arrow?
(I'd expect this to be messier, since every system/language uses a
different byte structure.)

Perhaps we can make a spreadsheet and look comprehensively at how many
> use cases would be disenfranchised by requiring UTC normalization
> always.

Hear, hear!

Can we also poll people to find out how they're storing Instants today?

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Curt Hagenlocher <cu...@hagenlocher.org>.

At some point, you just have to trust that a user is doing
semantically-meaningful operations. After all, they could also choose to
subtract a temperature from an elevation, or add feet to meters. It's
important to define the precise semantics of an operation, including the
assumptions it makes about the input(s). After that, it's up to a user to
ensure proper use.

On Thu, Jun 17, 2021 at 1:39 PM Weston Pace <we...@gmail.com> wrote:

> If a system does not store a local datetime using the UTC-normalized
> representation and they put it in an Arrow timestamp column without
> timezone then how should an Arrow compute function extract a field.
>
> For a concrete example, let's assume I have the number 172800000 in a
> timestamp(ms) column with no time zone and the user has asked to
> extract the day of month.  I use 172800000 because it is in the
> parquet docs example[1].  I thought I could assume that the source
> system had normalized this value to UTC and so I could run something
> like `datetime.fromtimestamp(172800).day` and find out that it is 2.
>
> Perhaps, more concretely:
>
> There are many ways that one could store a datetime into a single
> number.  The parquet docs mention two different ways but they are
> really the same thing, figure out the epoch timestamp for that
> datetime in the UTC timezone (the instant at which a wall clock in UTC
> would show the desired wall clock time).  With this method the
> datetime (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800
> (172800000, assuming ms). So let's invent a third way.  I could use
> the first 16 bits for the year, the next 8 bits for the month, the
> next 8 bits for the day of month, the next 8 bits for the hour, the
> next 8 bits for the minute, and the remaining bits for the seconds.
> Using this method I would store (1970, 1, 2, 14, 0) as
> 0x07B201020E000000.
>
> If I understand your argument correctly it is that Arrow is not going
> to govern how these other systems encode a local datetime into an 8
> byte value and so both of those are valid representations of (1970, 1,
> 2, 14, 0).  As a result, there would be no possible way to write a
> uniform kernel for field extraction that would work in Arrow.
>
> Am I understanding you correctly?  Or have I misinterpted things again
> as I've already done that several times on this thread alone :)
>
> [1]
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
>
> On Thu, Jun 17, 2021 at 8:59 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > To take a step back to focus on some concrete issues
> >
> > Parquet has two timestamp types: with (UTC-normalized)/without time
> > zone (non-UTC-normalized)
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L268
> >
> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > with/without time zone — in some SQL implementations each slot can
> > have a different time zone
> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> > or given as timezone local time"
> >
> > Spark / Databricks discusses how Spark handles this
> >
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> > and are wall clock timestamps." — not UTC-normalized
> > * WITH TIME ZONE: "does not affect the physical point in time that the
> > timestamp represents, as that is fully represented by the UTC time
> > instant given by the other timestamp components"
> >
> > pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > timestamps and UTC-normalized WITH TIME ZONE.
> >
> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > interpreted as UTC-normalized, that would force all of these other
> > systems (and more) to serialize their data to be UTC-normalized (i.e.
> > calling the equivalent of pandas's tz_localize function) when they
> > convert to Arrow. This seems very harmful to me, and will make data
> > from these systems not accurately representable in Arrow and unable to
> > be round-tripped.
> >
> > Perhaps we can make a spreadsheet and look comprehensively at how many
> > use cases would be disenfranchised by requiring UTC normalization
> > always.
> >
> > On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper <ad...@adamhooper.com> wrote:
> > >
> > > On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <we...@gmail.com>
> wrote:
> > >
> > > > Arrow's "Timestamp with Timezone" can have fields extracted
> > > > from it.
> > > >
> > >
> > > Sure, one *can* extract fields from timestamp+tz. But I don't feel
> > > timestamp+tz is *designed* for extracting fields:
> > >
> > >    - Extracting fields from int64+tz is inefficient, because it
> bundles two
> > >    steps: 1) convert to datetime struct; and 2) return one field from
> the
> > >    datetime struct. (If I want to extract Year, Month, Day, is that
> three
> > >    function calls that *each* convert to datetime struct?)
> > >    - Extracting fields from int64+tz is awkward, because it's not
> obvious
> > >    which timezone is being used. (To extract fields in a custom
> timezone, must
> > >    I 1) clone the column with a new timezone; and 2) call the
> function?)
> > >
> > > My understanding of "best practice" for extracting multiple fields
> using
> > > Arrow's timestamp columns is:
> > >
> > > 1. Convert from timestamp column to date32 and/or time32/time64
> columns in
> > > one pass (one of three operations, perhaps: timestamp=>date32,
> > > timestamp=>time64, or timestamp=>struct{date32,time64})
> > > 2. Extract fields from those date32 and time64 columns.
> > >
> > > Only step 1 needs a timezone. In C, the analogue is localtime().
> > >
> > > We do step 1 at Workbench -- see converttimestamptodate
> > > <
> https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py
> >
> > > for
> > > our implementation. We haven't had much demand for step 2, so we'll
> get to
> > > it later.
> > >
> > > I think of this "best practice" as a compromise:
> > >
> > >    - date32+time64 aren't as time-efficient as C's struct tm, but
> together
> > >    they use 12 bytes whereas the C struct costs 50-100 bytes.
> > >    - date32+time64 are 50% less space-efficient than int64, but they're
> > >    intuitive and they save time.
> > >
> > > A small benchmark to prove that "save time" assertion in Python:
> > >
> > > >>> import datetime, os, time, timeit
> > > >>> os.environ['TZ'] = 'America/Montreal'
> > > >>> time.tzset()
> > > >>> timestamp = time.time()
> > > >>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
> > > 0.2955563920113491
> > > >>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year)  # baseline:
> > > timeit overhead + tuple construction
> > > 0.2509278700017603
> > >
> > > Most of the test is overhead; but certainly the timestamp=>date
> conversion
> > > takes time, and it's sane to try and minimize that overhead.
> > >
> > > Enjoy life,
> > > Adam
> > >
> > > --
> > > Adam Hooper
> > > +1-514-882-9694
> > > http://adamhooper.com
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Weston Pace <we...@gmail.com>.

If a system does not store a local datetime using the UTC-normalized
representation and they put it in an Arrow timestamp column without
timezone then how should an Arrow compute function extract a field.

For a concrete example, let's assume I have the number 172800000 in a
timestamp(ms) column with no time zone and the user has asked to
extract the day of month.  I use 172800000 because it is in the
parquet docs example[1].  I thought I could assume that the source
system had normalized this value to UTC and so I could run something
like `datetime.fromtimestamp(172800).day` and find out that it is 2.

Perhaps, more concretely:

There are many ways that one could store a datetime into a single
number.  The parquet docs mention two different ways but they are
really the same thing, figure out the epoch timestamp for that
datetime in the UTC timezone (the instant at which a wall clock in UTC
would show the desired wall clock time).  With this method the
datetime (1970, 1, 2, 14, 0) is stored as 0x000000000A4CB800
(172800000, assuming ms). So let's invent a third way.  I could use
the first 16 bits for the year, the next 8 bits for the month, the
next 8 bits for the day of month, the next 8 bits for the hour, the
next 8 bits for the minute, and the remaining bits for the seconds.
Using this method I would store (1970, 1, 2, 14, 0) as
0x07B201020E000000.

If I understand your argument correctly it is that Arrow is not going
to govern how these other systems encode a local datetime into an 8
byte value and so both of those are valid representations of (1970, 1,
2, 14, 0).  As a result, there would be no possible way to write a
uniform kernel for field extraction that would work in Arrow.

Am I understanding you correctly?  Or have I misinterpted things again
as I've already done that several times on this thread alone :)

[1] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc

On Thu, Jun 17, 2021 at 8:59 AM Wes McKinney <we...@gmail.com> wrote:
>
> To take a step back to focus on some concrete issues
>
> Parquet has two timestamp types: with (UTC-normalized)/without time
> zone (non-UTC-normalized)
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L268
>
> The SQL standard (e.g. PostgresSQL) has two timestamp types:
> with/without time zone — in some SQL implementations each slot can
> have a different time zone
> https://www.postgresql.org/docs/9.1/datatype-datetime.html
> WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> or given as timezone local time"
>
> Spark / Databricks discusses how Spark handles this
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> and are wall clock timestamps." — not UTC-normalized
> * WITH TIME ZONE: "does not affect the physical point in time that the
> timestamp represents, as that is fully represented by the UTC time
> instant given by the other timestamp components"
>
> pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> timestamps and UTC-normalized WITH TIME ZONE.
>
> If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> interpreted as UTC-normalized, that would force all of these other
> systems (and more) to serialize their data to be UTC-normalized (i.e.
> calling the equivalent of pandas's tz_localize function) when they
> convert to Arrow. This seems very harmful to me, and will make data
> from these systems not accurately representable in Arrow and unable to
> be round-tripped.
>
> Perhaps we can make a spreadsheet and look comprehensively at how many
> use cases would be disenfranchised by requiring UTC normalization
> always.
>
> On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper <ad...@adamhooper.com> wrote:
> >
> > On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <we...@gmail.com> wrote:
> >
> > > Arrow's "Timestamp with Timezone" can have fields extracted
> > > from it.
> > >
> >
> > Sure, one *can* extract fields from timestamp+tz. But I don't feel
> > timestamp+tz is *designed* for extracting fields:
> >
> >    - Extracting fields from int64+tz is inefficient, because it bundles two
> >    steps: 1) convert to datetime struct; and 2) return one field from the
> >    datetime struct. (If I want to extract Year, Month, Day, is that three
> >    function calls that *each* convert to datetime struct?)
> >    - Extracting fields from int64+tz is awkward, because it's not obvious
> >    which timezone is being used. (To extract fields in a custom timezone, must
> >    I 1) clone the column with a new timezone; and 2) call the function?)
> >
> > My understanding of "best practice" for extracting multiple fields using
> > Arrow's timestamp columns is:
> >
> > 1. Convert from timestamp column to date32 and/or time32/time64 columns in
> > one pass (one of three operations, perhaps: timestamp=>date32,
> > timestamp=>time64, or timestamp=>struct{date32,time64})
> > 2. Extract fields from those date32 and time64 columns.
> >
> > Only step 1 needs a timezone. In C, the analogue is localtime().
> >
> > We do step 1 at Workbench -- see converttimestamptodate
> > <https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py>
> > for
> > our implementation. We haven't had much demand for step 2, so we'll get to
> > it later.
> >
> > I think of this "best practice" as a compromise:
> >
> >    - date32+time64 aren't as time-efficient as C's struct tm, but together
> >    they use 12 bytes whereas the C struct costs 50-100 bytes.
> >    - date32+time64 are 50% less space-efficient than int64, but they're
> >    intuitive and they save time.
> >
> > A small benchmark to prove that "save time" assertion in Python:
> >
> > >>> import datetime, os, time, timeit
> > >>> os.environ['TZ'] = 'America/Montreal'
> > >>> time.tzset()
> > >>> timestamp = time.time()
> > >>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
> > 0.2955563920113491
> > >>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year)  # baseline:
> > timeit overhead + tuple construction
> > 0.2509278700017603
> >
> > Most of the test is overhead; but certainly the timestamp=>date conversion
> > takes time, and it's sane to try and minimize that overhead.
> >
> > Enjoy life,
> > Adam
> >
> > --
> > Adam Hooper
> > +1-514-882-9694
> > http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Wes McKinney <we...@gmail.com>.

To take a step back to focus on some concrete issues

Parquet has two timestamp types: with (UTC-normalized)/without time
zone (non-UTC-normalized)
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L268

The SQL standard (e.g. PostgresSQL) has two timestamp types:
with/without time zone — in some SQL implementations each slot can
have a different time zone
https://www.postgresql.org/docs/9.1/datatype-datetime.html
WITHOUT TIME ZONE: "timestamp without time zone value should be taken
or given as timezone local time"

Spark / Databricks discusses how Spark handles this
https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
* WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
and are wall clock timestamps." — not UTC-normalized
* WITH TIME ZONE: "does not affect the physical point in time that the
timestamp represents, as that is fully represented by the UTC time
instant given by the other timestamp components"

pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
timestamps and UTC-normalized WITH TIME ZONE.

If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
interpreted as UTC-normalized, that would force all of these other
systems (and more) to serialize their data to be UTC-normalized (i.e.
calling the equivalent of pandas's tz_localize function) when they
convert to Arrow. This seems very harmful to me, and will make data
from these systems not accurately representable in Arrow and unable to
be round-tripped.

Perhaps we can make a spreadsheet and look comprehensively at how many
use cases would be disenfranchised by requiring UTC normalization
always.

On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper <ad...@adamhooper.com> wrote:
>
> On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <we...@gmail.com> wrote:
>
> > Arrow's "Timestamp with Timezone" can have fields extracted
> > from it.
> >
>
> Sure, one *can* extract fields from timestamp+tz. But I don't feel
> timestamp+tz is *designed* for extracting fields:
>
>    - Extracting fields from int64+tz is inefficient, because it bundles two
>    steps: 1) convert to datetime struct; and 2) return one field from the
>    datetime struct. (If I want to extract Year, Month, Day, is that three
>    function calls that *each* convert to datetime struct?)
>    - Extracting fields from int64+tz is awkward, because it's not obvious
>    which timezone is being used. (To extract fields in a custom timezone, must
>    I 1) clone the column with a new timezone; and 2) call the function?)
>
> My understanding of "best practice" for extracting multiple fields using
> Arrow's timestamp columns is:
>
> 1. Convert from timestamp column to date32 and/or time32/time64 columns in
> one pass (one of three operations, perhaps: timestamp=>date32,
> timestamp=>time64, or timestamp=>struct{date32,time64})
> 2. Extract fields from those date32 and time64 columns.
>
> Only step 1 needs a timezone. In C, the analogue is localtime().
>
> We do step 1 at Workbench -- see converttimestamptodate
> <https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py>
> for
> our implementation. We haven't had much demand for step 2, so we'll get to
> it later.
>
> I think of this "best practice" as a compromise:
>
>    - date32+time64 aren't as time-efficient as C's struct tm, but together
>    they use 12 bytes whereas the C struct costs 50-100 bytes.
>    - date32+time64 are 50% less space-efficient than int64, but they're
>    intuitive and they save time.
>
> A small benchmark to prove that "save time" assertion in Python:
>
> >>> import datetime, os, time, timeit
> >>> os.environ['TZ'] = 'America/Montreal'
> >>> time.tzset()
> >>> timestamp = time.time()
> >>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
> 0.2955563920113491
> >>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year)  # baseline:
> timeit overhead + tuple construction
> 0.2509278700017603
>
> Most of the test is overhead; but certainly the timestamp=>date conversion
> takes time, and it's sane to try and minimize that overhead.
>
> Enjoy life,
> Adam
>
> --
> Adam Hooper
> +1-514-882-9694
> http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Adam Hooper <ad...@adamhooper.com>.

On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <we...@gmail.com> wrote:

> Arrow's "Timestamp with Timezone" can have fields extracted
> from it.
>

Sure, one *can* extract fields from timestamp+tz. But I don't feel
timestamp+tz is *designed* for extracting fields:

   - Extracting fields from int64+tz is inefficient, because it bundles two
   steps: 1) convert to datetime struct; and 2) return one field from the
   datetime struct. (If I want to extract Year, Month, Day, is that three
   function calls that *each* convert to datetime struct?)
   - Extracting fields from int64+tz is awkward, because it's not obvious
   which timezone is being used. (To extract fields in a custom timezone, must
   I 1) clone the column with a new timezone; and 2) call the function?)

My understanding of "best practice" for extracting multiple fields using
Arrow's timestamp columns is:

1. Convert from timestamp column to date32 and/or time32/time64 columns in
one pass (one of three operations, perhaps: timestamp=>date32,
timestamp=>time64, or timestamp=>struct{date32,time64})
2. Extract fields from those date32 and time64 columns.

Only step 1 needs a timezone. In C, the analogue is localtime().

We do step 1 at Workbench -- see converttimestamptodate
<https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py>
for
our implementation. We haven't had much demand for step 2, so we'll get to
it later.

I think of this "best practice" as a compromise:

   - date32+time64 aren't as time-efficient as C's struct tm, but together
   they use 12 bytes whereas the C struct costs 50-100 bytes.
   - date32+time64 are 50% less space-efficient than int64, but they're
   intuitive and they save time.

A small benchmark to prove that "save time" assertion in Python:

>>> import datetime, os, time, timeit
>>> os.environ['TZ'] = 'America/Montreal'
>>> time.tzset()
>>> timestamp = time.time()
>>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
0.2955563920113491
>>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year)  # baseline:
timeit overhead + tuple construction
0.2509278700017603

Most of the test is overhead; but certainly the timestamp=>date conversion
takes time, and it's sane to try and minimize that overhead.

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Weston Pace <we...@gmail.com>.

Thanks for the excellent summary everyone.  I agree with these
summaries that have been pointed out.  It seems like things are moving
towards consensus.

> I think Instant is what is represented as Arrow's Timestamp with Timezone.
> I don't think Arrow has a type for DateTime because we don't have any type
> that allows for different time zones per slot in an Array (which I think
> would be the expectation for DateTime).

I'm not sure it's fair to call Arrow's Timestamp with Timezone an
Instant.  An Instant is limited, it cannot have fields extracted from
it, etc..  Arrow's "Timestamp with Timezone" can have fields extracted
from it.  Technically the types Instant / DateTime (ZonedDateTime) /
LocalDateTime are all scalar monikers so it's unclear that requiring
the timestamp be consistent across a column would disqualify you from
calling it DateTime / ZonedDateTime.  Maybe instead something like
FixedZonedDateTime and VariableZonedDateTime in the same way "list"
becomes FixedList or VariableList when you store a column of them.

As it seems we are approaching consensus I think we should clarify the
docs.  First, the schema.fbs should probably be changed is to change
"This data can be though of as UTC" to something like "This data will
be treated as UTC" and that libraries will need to convert a datetime
to UTC before storing it in this field.  For example, if I read a list
of integers from some JSON file in python and I knew they were Unix
timestamps in seconds my naive assumption would be to do
pa.array(timestamp, type=pa.timestamp('s')) and that would be a bad
idea.

However, it seems like there is a lot of information in these two
email chains that is not going to easily fit into schema.fbs.  I think
it'd be worthwhile to create some kind of standalone FAQ page for
working with temporal data.  I'll take a stab at this today before I
forget everything.

On Tue, Jun 15, 2021 at 6:38 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Actually after some thought
>
> I think Instant is what is represented as Arrow's Timestamp with Timezone.
> I don't think Arrow has a type for DateTime because we don't have any type
> that allows for different time zones per slot in an Array (which I think
> would be the expectation for DateTime).
>
> On Tue, Jun 15, 2021 at 9:23 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > My interpretation has always been:
> >
> > > * Instant - an instantaneous point on the time-line
> > > * DateTime - full date and time with time-zone
> >
> > These two do not have distinct types and are both handled via timestamp
> > with a timezone.
> >
> > > * LocalDateTime - date-time without a time-zone
> >
> > Arrow Timestamp without timezone (although as noted above the
> > representation we've used makes this the cause for debate).
> >
> > I guess the alternative would be to have a first class LocalDateTime type.
> >
> >
> >
> > On Tue, Jun 15, 2021 at 9:14 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >>
> >> Le 15/06/2021 à 16:53, Adam Hooper a écrit :
> >> >     - *"Datetime"* lets you extract fields, parse strings, format to
> >> string.
> >> >     You can't sort (because clocks sometimes go backwards). You can't
> >> convert
> >> >     between timestamps and future datetimes (because timezones change).
> >>
> >> Not true if the timezone is UTC, though.
> >> (which is a strong argument, IMHO, for representing all values in the
> >> UTC reference, regardless of any optional "timezone" information)
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Micah Kornfield <em...@gmail.com>.

Actually after some thought

I think Instant is what is represented as Arrow's Timestamp with Timezone.
I don't think Arrow has a type for DateTime because we don't have any type
that allows for different time zones per slot in an Array (which I think
would be the expectation for DateTime).

On Tue, Jun 15, 2021 at 9:23 AM Micah Kornfield <em...@gmail.com>
wrote:

> My interpretation has always been:
>
> > * Instant - an instantaneous point on the time-line
> > * DateTime - full date and time with time-zone
>
> These two do not have distinct types and are both handled via timestamp
> with a timezone.
>
> > * LocalDateTime - date-time without a time-zone
>
> Arrow Timestamp without timezone (although as noted above the
> representation we've used makes this the cause for debate).
>
> I guess the alternative would be to have a first class LocalDateTime type.
>
>
>
> On Tue, Jun 15, 2021 at 9:14 AM Antoine Pitrou <an...@python.org> wrote:
>
>>
>> Le 15/06/2021 à 16:53, Adam Hooper a écrit :
>> >     - *"Datetime"* lets you extract fields, parse strings, format to
>> string.
>> >     You can't sort (because clocks sometimes go backwards). You can't
>> convert
>> >     between timestamps and future datetimes (because timezones change).
>>
>> Not true if the timezone is UTC, though.
>> (which is a strong argument, IMHO, for representing all values in the
>> UTC reference, regardless of any optional "timezone" information)
>>
>> Regards
>>
>> Antoine.
>>
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Micah Kornfield <em...@gmail.com>.

My interpretation has always been:

> * Instant - an instantaneous point on the time-line
> * DateTime - full date and time with time-zone

These two do not have distinct types and are both handled via timestamp
with a timezone.

> * LocalDateTime - date-time without a time-zone

Arrow Timestamp without timezone (although as noted above the
representation we've used makes this the cause for debate).

I guess the alternative would be to have a first class LocalDateTime type.

On Tue, Jun 15, 2021 at 9:14 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 15/06/2021 à 16:53, Adam Hooper a écrit :
> >     - *"Datetime"* lets you extract fields, parse strings, format to
> string.
> >     You can't sort (because clocks sometimes go backwards). You can't
> convert
> >     between timestamps and future datetimes (because timezones change).
>
> Not true if the timezone is UTC, though.
> (which is a strong argument, IMHO, for representing all values in the
> UTC reference, regardless of any optional "timezone" information)
>
> Regards
>
> Antoine.
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Antoine Pitrou <an...@python.org>.

Le 15/06/2021 à 16:53, Adam Hooper a écrit :
>     - *"Datetime"* lets you extract fields, parse strings, format to string.
>     You can't sort (because clocks sometimes go backwards). You can't convert
>     between timestamps and future datetimes (because timezones change).

Not true if the timezone is UTC, though.
(which is a strong argument, IMHO, for representing all values in the 
UTC reference, regardless of any optional "timezone" information)

Regards

Antoine.

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Adam Hooper <ad...@adamhooper.com>.

A lot of terms in this thread confuse me. Here are my interpretations:

*"Timestamp" vs "datetime"*: most languages/libraries have two types (with
various names). "Timestamp" is a moment on the timeline; "datetime" is a
date plus a time, as seen on a calendar and clock.

The only outliers that spring to mind are SQL (more on this later) and
Arrow's own ecosystem. Arrow mentions "naive timestamps" (more on this
later); Parquet calls its datetimes "timestamp not adjusted for UTC";
Pandas calls its datetimes "Timestamp".

The names aren't always "timestamp" and "datetime". (C calls them "time_t"
and "struct tm"; Python calls them "time" and "datetime"; Java calls them
"Instant" and "LocalDateTime"; and so on.) But the *operations* are
universal:

   - *"Timestamp"* lets you sort times and compute differences along the
   timeline. You can't extract fields.
   - *"Datetime"* lets you extract fields, parse strings, format to string.
   You can't sort (because clocks sometimes go backwards). You can't convert
   between timestamps and future datetimes (because timezones change).

*"Naive timestamp"*: I googled this, and I think it's unique to Arrow and
Pandas (not even Parquet or Numpy or Python).

Python and other libraries have "naive *datetime*". That seems common.

But by the above definitions, "naive timestamp" is an oxymoron.

I propose that whatever the outcome of this discussion, all Arrow-related
documentation should avoid the term "naive timestamp". (Parquet spec
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc>
allows
int64-encoded datetimes, and it avoids the word "naive".)

*"TIMESTAMP WITH TIME ZONE" vs "TIMESTAMP WITHOUT TIME ZONE"*: I've seen
these terms in SQL. They mean different things on different RDBMSs. I don't
know who uses which RDBMS on this mailing list, so I'm not sure what people
mean.

Oracle
<https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch4datetime.htm>
"TIMESTAMP WITH TIME ZONE" stores (timestamp, tz) tuple; "TIMESTAMP WITH
LOCAL TIME ZONE" means (datetime stored as server timezone); "TIMESTAMP"
means (datetime). I suppose the de-facto way to store UTC timestamps
without timezone information is to set server timezone to UTC and store
"TIMESTAMP WITH LOCAL TIME ZONE"?

PostgreSQL <https://www.postgresql.org/docs/13/datatype-datetime.html>
"TIMESTAMP WITHOUT TIME ZONE" means (datetime) and "TIMESTAMP WITH TIME
ZONE" means (timestamp). The way to store UTC timestamps *without timezone
information* is "TIMESTAMP WITH TIME ZONE".

MySQL <https://dev.mysql.com/doc/refman/8.0/en/datetime.html> "TIMESTAMP"
means (timestamp) and "DATETIME" means (datetime). The way to store UTC
timestamps without timezone information is "TIMESTAMP".

I think all RDBMSs encode datetimes as int64, though not always as epoch
offsets.

None of these RDBMSs embed timezone information on timestamp columns.

*Why we use time zones*

Time zones are for converting timestamp=>datetime and datetime=>timestamp.

In my experience, the easiest approach is to specify the timezone during
conversion.

I propose Arrow's documentation should encourage explicit timezone
parameters during conversions.

*Timestamps without timezone information*

My selfish aim here is to figure out: how do I store a timestamp without
timezone information?

I'm +1 Antoine's suggestion of using naive timestamps, because A) my
pedantic reading of the existing spec supports this; B) it's a
near-universal best practice; and C) I'm already doing it ;).

I dislike storing timezone=UTC because timezone=UTC suggests the values are
simultaneously "timestamp" and "datetime". In Java, for instance: would a
timezone=UTC column convert to ZonedDateTime or to Instant?

I hope we all agree that UTC timestamps without timezones are best practice
in many domains. I hope we also agree that Arrow should heavily encourage
UTC timestamps without timezones.

*Parquet, Pandas*

Another important piece of the puzzle is, "how to convert between
Parquet/Pandas and Arrow?"

Perhaps another metadata field could be added to TIMESTAMP: a Parquet-alike
"isAdjustedToUTC". And users could be advised to avoid the "timestamp"
field.

Or datetimes could be stored as int64 (without metadata, or with
Pandas-specific metadata).

Or Arrow could introduce a new type, datetime64 (without metadata).

I think the most *confusing* outcome would be for timezone=UTC to mean
"Instant" and timezone=null to mean "ZonedDateTime". That would require
very, very clear documentation because the documentation would contradict
the actual terms: "Timestamp is only timestamp if you set timezone=UTC";
"Timestamp with timezone=null means datetime, not timestamp"; etc. Arrow
would be just another voice in SQL's sea of confusion.

Enjoy life,
Adam

On Tue, Jun 15, 2021 at 8:33 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 15/06/2021 à 12:57, Joris Van den Bossche a écrit :
> > A general observation: it might be useful to get back to the message
> > of Julian Hyde in the previous email thread about this 2 weeks ago
> > (
> https://lists.apache.org/thread.html/r5a89aa20b1cb812dc01a3817a5bfb365971577986d586dcc7ee21e72%40%3Cdev.arrow.apache.org%3E
> ).
> > Quoting part of that email:
> >
> > On Wed, 2 Jun 2021 at 21:39, Julian Hyde <jh...@apache.org> wrote:
> >>
> >> Good time libraries support all. E.g. Jodatime [1] has
> >>
> >> * Instant - an instantaneous point on the time-line
> >> * DateTime - full date and time with time-zone
> >> * LocalDateTime - date-time without a time-zone
> >>
> >> ...
> >>
> >> I recommend that Arrow supports all three. Choose clear, distinct
> >> names for all three, consistent with names used elsewhere in the
> >> industry.
> >
> > It seems to me that we are discussing whether our "timestamp without
> > timezone" should be interpreted as a LocalDateTime or as an Instant
> > (since interpreting it as UTC makes it an Instant, I think). Is that a
> > correct / helpful framing?
>
> That is correct, IMHO.
>
>

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Antoine Pitrou <an...@python.org>.

Le 15/06/2021 à 12:57, Joris Van den Bossche a écrit :
> A general observation: it might be useful to get back to the message
> of Julian Hyde in the previous email thread about this 2 weeks ago
> (https://lists.apache.org/thread.html/r5a89aa20b1cb812dc01a3817a5bfb365971577986d586dcc7ee21e72%40%3Cdev.arrow.apache.org%3E).
> Quoting part of that email:
> 
> On Wed, 2 Jun 2021 at 21:39, Julian Hyde <jh...@apache.org> wrote:
>>
>> Good time libraries support all. E.g. Jodatime [1] has
>>
>> * Instant - an instantaneous point on the time-line
>> * DateTime - full date and time with time-zone
>> * LocalDateTime - date-time without a time-zone
>>
>> ...
>>
>> I recommend that Arrow supports all three. Choose clear, distinct
>> names for all three, consistent with names used elsewhere in the
>> industry.
> 
> It seems to me that we are discussing whether our "timestamp without
> timezone" should be interpreted as a LocalDateTime or as an Instant
> (since interpreting it as UTC makes it an Instant, I think). Is that a
> correct / helpful framing?

That is correct, IMHO.

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Joris Van den Bossche <jo...@gmail.com>.

A general observation: it might be useful to get back to the message
of Julian Hyde in the previous email thread about this 2 weeks ago
(https://lists.apache.org/thread.html/r5a89aa20b1cb812dc01a3817a5bfb365971577986d586dcc7ee21e72%40%3Cdev.arrow.apache.org%3E).
Quoting part of that email:

On Wed, 2 Jun 2021 at 21:39, Julian Hyde <jh...@apache.org> wrote:
>
> Good time libraries support all. E.g. Jodatime [1] has
>
> * Instant - an instantaneous point on the time-line
> * DateTime - full date and time with time-zone
> * LocalDateTime - date-time without a time-zone
>
> ...
>
> I recommend that Arrow supports all three. Choose clear, distinct
> names for all three, consistent with names used elsewhere in the
> industry.

It seems to me that we are discussing whether our "timestamp without
timezone" should be interpreted as a LocalDateTime or as an Instant
(since interpreting it as UTC makes it an Instant, I think). Is that a
correct / helpful framing?

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Joris Van den Bossche <jo...@gmail.com>.

On Tue, 15 Jun 2021 at 10:11, Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 15/06/2021 à 09:31, Joris Van den Bossche a écrit :
> >
> > (but I also don't fully understand your point here, as your "they
> > would get the correct histogram" seems to imply a positive statemenent
> > for tz-naive timestamps, while your email starts with a +1 on
> > Antoine's proposal which, as far as I understand it, says that
> > timestamps without timezone are useless / should be interpreted as UTC
> > instead (which makes your above described scenario impossible)).
>
> My proposal is that timestamps without timezone should be interpreted as
> UTC.  I don't get how that makes them "useless".  In my view, that makes
> them far more useful than if we don't know their base of reference
> (because then most operations you can do on them will give
> uninterpretable data).
>

Note that the "useless" was your wording about my interpretation of
timestamps without timezone as "unknown local timezone" (so my above
statement should probably have been phrased as ".. are either useless
or should be interpreted as UTC").
So I didn't want to imply that interpreting timestamps without
timezone as UTC is useless. That's certainly a clear interpretation
(and a useful abstraction, given earlier references to Java's
"instant" which is kind of similar AFAIU), but it's a *different*
interpretation as how I understand the current spec, and changing our
interpretation has consequences.

First, there are systems that have the notion of tz-naive local
timestamps / TIMESTAMP WITHOUT TIMEZONE (and without interpreting it
as UTC).
Some examples I am aware of are pandas, most database systems
(although with varying names), Jodatime's LocalDateTime, etc. If we
want to support those systems, Arrow needs to have an equivalent
timezone-less type. To quote Wes from his last email about dropping
the timezone-less timestamp: "I don't think that is something we can
do at this time lest we lose the ability to have high-fidelity
interoperability with other systems."

In addition, I will continue to argue that, depending on your
application, it *can* be reasonable to work with timestamps without a
timezone. Certainly, such timestamps don't contain information about
the absolute time point, and thus are inherently ambiguous for certain
operations. But as Wes mentioned before, there are still many
analytical operations that you can do on timezone-less data without
any problem or ambiguity (such as aggregating by year or month, or
even the hour of the day).

Joris

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Antoine Pitrou <an...@python.org>.

Le 15/06/2021 à 09:31, Joris Van den Bossche a écrit :
> 
> (but I also don't fully understand your point here, as your "they
> would get the correct histogram" seems to imply a positive statemenent
> for tz-naive timestamps, while your email starts with a +1 on
> Antoine's proposal which, as far as I understand it, says that
> timestamps without timezone are useless / should be interpreted as UTC
> instead (which makes your above described scenario impossible)).

My proposal is that timestamps without timezone should be interpreted as 
UTC.  I don't get how that makes them "useless".  In my view, that makes 
them far more useful than if we don't know their base of reference 
(because then most operations you can do on them will give 
uninterpretable data).

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Joris Van den Bossche <jo...@gmail.com>.

Some inline answers to Weston's email below:

On Tue, 15 Jun 2021 at 07:34, Weston Pace <we...@gmail.com> wrote:
> ...
> Let's pretend two astronomers observe a meteoroid impact on the moon.
> We are talking about two different ways they can record the time.  The
> first method, universal time, is done by recording the seconds since
> the epoch.  The second, wall clock time, is done by writing down the
> time seen on a clock (and nearby calendar).
>
> In both cases we do not know the full picture without the time zone
> information.  If we have two universal times (but no time zones) we
> can say whether the two astronomers witnessed the same event (assuming
> the impact site is equal) but we can't say whether they saw it at the
> same time of day (e.g. whether the two astronomers had both just
> finished dinner).

That's the reason we have a TIMESTAMP WITH TIME ZONE type, with which
you can have this full picture.

> ...
> Rather than store wall clock time as a string (which is inefficient)
> Arrow stores wall clock time as the epoch timestamp at the point a
> wall clock in the UTC time zone would display the given time.  In
> other words, converting datetime.datetime.now to an Arrow timestamp
> does NOT give the current UNIX epoch.  The value that is stored is
> different for every time zone.  Or to put it yet another way.  The
> output of the following program...
>
> import pyarrow as pa
> import datetime
> pa.array([datetime.datetime.strptime('Jun 28 2018 7:40AM',
>          '%b %d %Y %I:%M%p')]).cast(pa.int64()).to_pylist()[0]
>
> ...will be identical on every machine.  But the output of...
>
> import pyarrow as pa
> import datetime
> pa.array([datetime.datetime.now()]).cast(pa.int64()).to_pylist()[0]
>
> ...will depend on the system time zone (ostensibly because the output
> of datetime.datetime.now() depends on the system time zone).

What you describe here is the behaviour of Python's datetime module,
not of Arrow. It's datetime.datetime.now() that is dependent on the
system time zone, but from Arrow's perspective, it just gets a naive
datetime in both cases, and handles those consistently.
So it's the responsibility of the user to decide whether they are OK
with the behaviour of datetime.datetime.now().

> ---
>
> So given my previous concrete example I said...
>
> > For each observation they record the unix timestamp (or maybe
> > they build up an instance of datetime objects created with
> > datetime.datetime.now())
>
> These two methods would actually yield different results.  If they
> created a pa.array([ts1, ts2], type=pa.timestamp('s')) with unix
> timestamps recorded at the time of the event then they would get the
> wrong histogram.
>
> If they created a pa.array([dt1, dt2], type=pa.timestamp('s')) with
> datetime.datetime objects created with datetime.datetime.now at the
> time of the event then they would get the correct histogram.

And is it a useful application that they can get the correct histogram
by using naive timestamps? It's probably debatable whether this is
"best practice", or whether it should rather be recommended to use
timestamps with timezones to get the same effect. But IMO it is not up
to Arrow to be opinionated in this "how should I use timezones"
debate, but to enable widespread behaviour/usage patterns for
downstream libraries.

(but I also don't fully understand your point here, as your "they
would get the correct histogram" seems to imply a positive statemenent
for tz-naive timestamps, while your email starts with a +1 on
Antoine's proposal which, as far as I understand it, says that
timestamps without timezone are useless / should be interpreted as UTC
instead (which makes your above described scenario impossible)).

> >
> > >  TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> > > field is not set. We have stated that we want systems to use
> > > system-locale-independent choices for functions that act on this data
> > > (like stringification or field extraction)
> >
> > This is indeed a rehash of an earlier discussion where I agreed with
> > you but I think I understand the subtleties a bit more and now I
> > disagree, particularly on field extraction.  Field extraction can be
> > done on a naive "datetime" without assuming UTC which I think makes it
> > safer for Python.  Field extraction cannot be done on a naive
> > "timestamp" without assuming UTC.
> >
> > # Stringification
> >
> > I think we can get away with stringification.  It seems like the
> > consensus is to always output UTC format.  I will point out that
> > pyarrow does not do that today.  Currently in pyarrow I get
> >
> > >>> pa.array([datetime.datetime.now()])
> > <pyarrow.lib.TimestampArray object at 0x7f8ae865d520>
> > [
> >   2021-06-14 17:30:52.260044  # Local time
> > ]

Pyarrow displays the tz-naive timestamp "as is" (as described in the
spec), so I think the above behaviour is correct. You created a naive
datetime representing your local time with datetime.datetime.now(),
and pyarrow preserves that information on the conversion, and displays
the data as is. It will give the same string representation as
printing the datetime.datetime object, and it will preserve the fields
of the datetime.datetime object:

>>> dt = datetime.datetime.now()
>>> dt
datetime.datetime(2021, 6, 15, 9, 18, 48, 108988)
>>> print(dt)
2021-06-15 09:18:48.108988

>>> arr = pa.array([dt])
>>> arr
<pyarrow.lib.TimestampArray object at 0x7ffa35459d60>
[
  2021-06-15 09:18:48.108988
]
>>> pc.hour(arr)
<pyarrow.lib.Int64Array object at 0x7ff9ecc7c340>
[
  9
]

> >
> > # Field extraction
> >
> > Here is a concrete example demonstrating the problems of field
> > extraction.  Consider a user that runs an experiment over several
> > weeks.  For each observation they record the unix timestamp (or maybe
> > they build up an instance of datetime objects created with
> > datetime.datetime.now()).  Then, using Arrow as a backend for
> > analysis, they create a histogram to show events by weekday.   If
> > Arrow is assuming UTC then the histogram is going to have the wrong
> > days of the week (unless the user happens to be in UTC).
> >
> > Simple queries like "Give me all events that happened on Tuesday" or
> > "Group rows by year" will not necessarily work on naive columns in the
> > way that a user expects (and yet these only require field extraction).

If Arrow doesn't assume UTC for timestamps without timezone (as it
does now), then those queries will do what a user expects (it gives
the field that matches the local time). See my "pc.hour(..)" field
extraction example (full example above):

>>> pc.hour(arr)
<pyarrow.lib.Int64Array object at 0x7ff9ecc7c340>
[
  9
]

This currently works on master (and `arr` is a timestamp without
timezone), and IMO gives the expected behaviour for the user.

> >
> > So, my particular resolution (what I am arguing for), is that arrow
> > libraries that perform field extraction should return an error when
> > presented with a timestamp that does not have a timezone.
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Weston Pace <we...@gmail.com>.

In retrospect, I should probably go for a long run before I write an
email and not after.  I had some time to mull it over and I realize
now that I was wrong.  My vote now changes to +1 for Antoine.  Can
someone please verify my understanding of the two formats below?

---
Let's pretend two astronomers observe a meteoroid impact on the moon.
We are talking about two different ways they can record the time.  The
first method, universal time, is done by recording the seconds since
the epoch.  The second, wall clock time, is done by writing down the
time seen on a clock (and nearby calendar).

In both cases we do not know the full picture without the time zone
information.  If we have two universal times (but no time zones) we
can say whether the two astronomers witnessed the same event (assuming
the impact site is equal) but we can't say whether they saw it at the
same time of day (e.g. whether the two astronomers had both just
finished dinner).

If we have two wall clock times (but no time zones) we can say whether
the two astronomers witnessed the impact at the same time of day but
we can't say if they witnessed the same event.

Rather than store wall clock time as a string (which is inefficient)
Arrow stores wall clock time as the epoch timestamp at the point a
wall clock in the UTC time zone would display the given time.  In
other words, converting datetime.datetime.now to an Arrow timestamp
does NOT give the current UNIX epoch.  The value that is stored is
different for every time zone.  Or to put it yet another way.  The
output of the following program...

import pyarrow as pa
import datetime
pa.array([datetime.datetime.strptime('Jun 28 2018 7:40AM',
         '%b %d %Y %I:%M%p')]).cast(pa.int64()).to_pylist()[0]

...will be identical on every machine.  But the output of...

import pyarrow as pa
import datetime
pa.array([datetime.datetime.now()]).cast(pa.int64()).to_pylist()[0]

...will depend on the system time zone (ostensibly because the output
of datetime.datetime.now() depends on the system time zone).
---

So given my previous concrete example I said...

> For each observation they record the unix timestamp (or maybe
> they build up an instance of datetime objects created with
> datetime.datetime.now())

These two methods would actually yield different results.  If they
created a pa.array([ts1, ts2], type=pa.timestamp('s')) with unix
timestamps recorded at the time of the event then they would get the
wrong histogram.

If they created a pa.array([dt1, dt2], type=pa.timestamp('s')) with
datetime.datetime objects created with datetime.datetime.now at the
time of the event then they would get the correct histogram.

On Mon, Jun 14, 2021 at 5:34 PM Weston Pace <we...@gmail.com> wrote:
>
> I'm in no rush, so feel free to respond when you have time.
>
> > If the timezone field doesn't say how to display data to the user, and we
> > agree it doesn't describe how data is stored (since its very presence means
> > data is stored as UTC) ... well ... what *is* the meaning of the timezone
> > field?
>
> In retrospect, my comment (Arrow probably isn't used for the final
> formatting to the user) isn't relevant to the discussion and I agree
> with your original point.    Even if we aren't supporting "display" I
> do think these exact sort of "formatting of time zone" type tasks are
> useful for analysis.  So it does seem like something Arrow will need
> to consider.  For example, creating a histogram by day-of-week
> requires formatting a timestamp into a weekday which requires a time
> zone.   This is what the timezone field is used for.
>
> >  TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> > field is not set. We have stated that we want systems to use
> > system-locale-independent choices for functions that act on this data
> > (like stringification or field extraction)
>
> This is indeed a rehash of an earlier discussion where I agreed with
> you but I think I understand the subtleties a bit more and now I
> disagree, particularly on field extraction.  Field extraction can be
> done on a naive "datetime" without assuming UTC which I think makes it
> safer for Python.  Field extraction cannot be done on a naive
> "timestamp" without assuming UTC.
>
> # Stringification
>
> I think we can get away with stringification.  It seems like the
> consensus is to always output UTC format.  I will point out that
> pyarrow does not do that today.  Currently in pyarrow I get
>
> >>> pa.array([datetime.datetime.now()])
> <pyarrow.lib.TimestampArray object at 0x7f8ae865d520>
> [
>   2021-06-14 17:30:52.260044  # Local time
> ]
>
> # Field extraction
>
> Here is a concrete example demonstrating the problems of field
> extraction.  Consider a user that runs an experiment over several
> weeks.  For each observation they record the unix timestamp (or maybe
> they build up an instance of datetime objects created with
> datetime.datetime.now()).  Then, using Arrow as a backend for
> analysis, they create a histogram to show events by weekday.   If
> Arrow is assuming UTC then the histogram is going to have the wrong
> days of the week (unless the user happens to be in UTC).
>
> Simple queries like "Give me all events that happened on Tuesday" or
> "Group rows by year" will not necessarily work on naive columns in the
> way that a user expects (and yet these only require field extraction).
>
> So, my particular resolution (what I am arguing for), is that arrow
> libraries that perform field extraction should return an error when
> presented with a timestamp that does not have a timezone.
>
> On Mon, Jun 14, 2021 at 4:45 PM Micah Kornfield <em...@gmail.com> wrote:
> >
> > >
> > > I will have to beg you all to give me some time to review all the
> > > information when I am not on vacation, but Arrow has in essence two
> > > timestamp data types
> >
> >
> > This is how I always interpreted with and without timezone (really two
> > distinct types).  I also thought we had covered this on the prior thread
> > with Julian, but I guess we never reached consensus.  I think if we had to
> > do it over again, perhaps different modelling would have made this clearer
> > (e.g. storing separate fields (year, month, day, etc) for naive timestamps.
> >
> > If the timezone field doesn't say how to display data to the user, and we
> > > agree it doesn't describe how data is stored (since its very presence means
> > > data is stored as UTC) ... well ... what *is* the meaning of the timezone
> > > field?
> >
> >
> > I think it just so happens this is mostly a hold-over from Pandas and for
> > some reason not everyone in the community looked too closely at it.  A long
> > time ago there was a thread of possibly introducing a per slot/cell
> > timezone like type as well but there hasn't seemed to be a lot of interest.
> >
> > (In my opinion, there shouldn't be a field at all.)
> >
> > This really isn't an option at this point due to compatibility guarantees,
> > at best we could discourage use.
> >
> > On Mon, Jun 14, 2021 at 1:38 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Mon, Jun 14, 2021 at 2:47 PM Andrew Lamb <al...@influxdata.com> wrote:
> > > >
> > > > I think the world is headed towards using canonical UTC timestamps (some
> > > > period of time from the unix epoch at UTC) so that timestamps can be
> > > > interpreted as absolute times without requiring additional metadata. This
> > > > is not yet universal yet, but it seems to me to be where things are
> > > heading
> > > > (and what new systems do)
> > > >
> > > > This feels like the same basic transition the world did from "strings"
> > > with
> > > > encodings as metadata which can be messed up to "UTF-8"
> > > >
> > > > Thus, I prefer Antoine's interpretation that timestamp values are always
> > > > relative to UTC and the timezone metadata can be used to render them to
> > > > local times if desired
> > >
> > > I will have to beg you all to give me some time to review all the
> > > information when I am not on vacation, but Arrow has in essence two
> > > timestamp data types:
> > >
> > > TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> > > field is not set. We have stated that we want systems to use
> > > system-locale-independent choices for functions that act on this data
> > > (like stringification or field extraction)
> > >
> > > TIMESTAMP WITH TIME ZONE: the time zone field is set. The storage is
> > > UTC-normalized, and time zone changes are metadata only operations.
> > >
> > > Localization is the action that converts between the first type to the
> > > second type.
> > >
> > > It sounds to me like it is being proposed to eliminate the first of
> > > these two data types. I understand the principles that might motivate
> > > that, but I don't think that is something we can do at this time lest
> > > we lose the ability to have high-fidelity interoperability with other
> > > systems. A system that uses Arrow is certainly free to exclusively use
> > > TIMESTAMP WITH TIME ZONE in its implementation (and ensure that the
> > > time zone field is always set to UTC or another non-UTC time zone).
> > >
> > > > Andrew
> > > >
> > > > On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <we...@gmail.com>
> > > wrote:
> > > >
> > > > > > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> > > > > field
> > > > > > means, "how the data is *displayed*." The data isn't displayed as
> > > UTC.
> > > > >
> > > > > I don't think users will generally be using Arrow to format timestamps
> > > > > for display to the user.  However, if it is, the correct thing to do
> > > > > here would be to store and transport timestamps with timezone=UTC.
> > > > > Then, when it comes time to display, first convert to timezone=local
> > > > > and then convert to string.
> > > > >
> > > > > > If you parse a timestamp string, then you can extract all of the
> > > fields
> > > > > > (including hour and day) from the resulting int64 values and they
> > > will be
> > > > > > the same as they appeared in the strings. Many users never need to
> > > worry
> > > > > > about time zone isn’t their analyses.
> > > > >
> > > > > If Arrow required a timezone (didn't allow naive timestamps) then
> > > > > users that truly don't care about the timezone could simply specify
> > > > > UTC or any other timezone and carry on with their analysis.
> > > > >
> > > > > Personally, I think the safest option would be to only allow
> > > > > timestamps to be stored with a timezone.   I'd agree with Antoine's
> > > > > earlier point and say that the timezone should always be supplied at
> > > > > the boundary.  However, it may be too late for that.
> > > > >
> > > > > Given that we have "timestamps without timezone" it seems to me the
> > > > > safest thing is to consider them as naive and fail any function that
> > > > > required the time zone.  So +1 for Joris' interpretation.  Yes, this
> > > > > renders them useless for any purpose other than pass-through.  If a
> > > > > user truly wants to do something with them then it seems the burden
> > > > > should be on the user to supply a timezone and not for Arrow to infer
> > > > > anything.
> > > > >
> > > > > -Weston
> > > > >
> > > > >
> > > > > On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
> > > > > <jo...@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <an...@python.org>
> > > wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > Joris' interpretation is that timestamp *values* are expressed in
> > > an
> > > > > > > arbitrary "local time" that is unknown and unspecified. It is
> > > therefore
> > > > > > > difficult to exactly interpret them, since the timezone
> > > information is
> > > > > > > unavailable.
> > > > > > >
> > > > > > > (I'll let Joris express his thoughts more accurately, but the gist
> > > of
> > > > > > > his opinion is that "can be thought of as UTC" is only an
> > > indication,
> > > > > > > not a prescription)
> > > > > >
> > > > > > That's indeed correct. One clarification: you can interpret them as
> > > > > > is, and for many applications this is fine. It's only when you want
> > > to
> > > > > > interpret them as an absolute point in time that the user needs to
> > > > > > supply a timezone to interpret them.
> > > > > >
> > > > > > For the rest, Wes' responses already cover my viewpoint (as a pandas
> > > > > > maintainer, I of course have a similar perspective on this looking at
> > > > > > this from the pandas implementation he wrote).
> > > > > >
> > > > > > An additional source that explains the "local semantics" of naive
> > > > > > timestamps well IMO, and especially explains the "can be thought of
> > > as
> > > > > > UTC without being UTC" aspect, is the parquet format docs:
> > > > > >
> > > > >
> > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > > > > > (it's of course about Parquet and not Arrow, but the explanation is
> > > > > > relevant for the Arrow spec as well).
> > > > > >
> > > > > > Joris
> > > > >
> > >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Weston Pace <we...@gmail.com>.

I'm in no rush, so feel free to respond when you have time.

> If the timezone field doesn't say how to display data to the user, and we
> agree it doesn't describe how data is stored (since its very presence means
> data is stored as UTC) ... well ... what *is* the meaning of the timezone
> field?

In retrospect, my comment (Arrow probably isn't used for the final
formatting to the user) isn't relevant to the discussion and I agree
with your original point.    Even if we aren't supporting "display" I
do think these exact sort of "formatting of time zone" type tasks are
useful for analysis.  So it does seem like something Arrow will need
to consider.  For example, creating a histogram by day-of-week
requires formatting a timestamp into a weekday which requires a time
zone.   This is what the timezone field is used for.

>  TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> field is not set. We have stated that we want systems to use
> system-locale-independent choices for functions that act on this data
> (like stringification or field extraction)

This is indeed a rehash of an earlier discussion where I agreed with
you but I think I understand the subtleties a bit more and now I
disagree, particularly on field extraction.  Field extraction can be
done on a naive "datetime" without assuming UTC which I think makes it
safer for Python.  Field extraction cannot be done on a naive
"timestamp" without assuming UTC.

# Stringification

I think we can get away with stringification.  It seems like the
consensus is to always output UTC format.  I will point out that
pyarrow does not do that today.  Currently in pyarrow I get

>>> pa.array([datetime.datetime.now()])
<pyarrow.lib.TimestampArray object at 0x7f8ae865d520>
[
  2021-06-14 17:30:52.260044  # Local time
]

# Field extraction

Here is a concrete example demonstrating the problems of field
extraction.  Consider a user that runs an experiment over several
weeks.  For each observation they record the unix timestamp (or maybe
they build up an instance of datetime objects created with
datetime.datetime.now()).  Then, using Arrow as a backend for
analysis, they create a histogram to show events by weekday.   If
Arrow is assuming UTC then the histogram is going to have the wrong
days of the week (unless the user happens to be in UTC).

Simple queries like "Give me all events that happened on Tuesday" or
"Group rows by year" will not necessarily work on naive columns in the
way that a user expects (and yet these only require field extraction).

So, my particular resolution (what I am arguing for), is that arrow
libraries that perform field extraction should return an error when
presented with a timestamp that does not have a timezone.

On Mon, Jun 14, 2021 at 4:45 PM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > I will have to beg you all to give me some time to review all the
> > information when I am not on vacation, but Arrow has in essence two
> > timestamp data types
>
>
> This is how I always interpreted with and without timezone (really two
> distinct types).  I also thought we had covered this on the prior thread
> with Julian, but I guess we never reached consensus.  I think if we had to
> do it over again, perhaps different modelling would have made this clearer
> (e.g. storing separate fields (year, month, day, etc) for naive timestamps.
>
> If the timezone field doesn't say how to display data to the user, and we
> > agree it doesn't describe how data is stored (since its very presence means
> > data is stored as UTC) ... well ... what *is* the meaning of the timezone
> > field?
>
>
> I think it just so happens this is mostly a hold-over from Pandas and for
> some reason not everyone in the community looked too closely at it.  A long
> time ago there was a thread of possibly introducing a per slot/cell
> timezone like type as well but there hasn't seemed to be a lot of interest.
>
> (In my opinion, there shouldn't be a field at all.)
>
> This really isn't an option at this point due to compatibility guarantees,
> at best we could discourage use.
>
> On Mon, Jun 14, 2021 at 1:38 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Mon, Jun 14, 2021 at 2:47 PM Andrew Lamb <al...@influxdata.com> wrote:
> > >
> > > I think the world is headed towards using canonical UTC timestamps (some
> > > period of time from the unix epoch at UTC) so that timestamps can be
> > > interpreted as absolute times without requiring additional metadata. This
> > > is not yet universal yet, but it seems to me to be where things are
> > heading
> > > (and what new systems do)
> > >
> > > This feels like the same basic transition the world did from "strings"
> > with
> > > encodings as metadata which can be messed up to "UTF-8"
> > >
> > > Thus, I prefer Antoine's interpretation that timestamp values are always
> > > relative to UTC and the timezone metadata can be used to render them to
> > > local times if desired
> >
> > I will have to beg you all to give me some time to review all the
> > information when I am not on vacation, but Arrow has in essence two
> > timestamp data types:
> >
> > TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> > field is not set. We have stated that we want systems to use
> > system-locale-independent choices for functions that act on this data
> > (like stringification or field extraction)
> >
> > TIMESTAMP WITH TIME ZONE: the time zone field is set. The storage is
> > UTC-normalized, and time zone changes are metadata only operations.
> >
> > Localization is the action that converts between the first type to the
> > second type.
> >
> > It sounds to me like it is being proposed to eliminate the first of
> > these two data types. I understand the principles that might motivate
> > that, but I don't think that is something we can do at this time lest
> > we lose the ability to have high-fidelity interoperability with other
> > systems. A system that uses Arrow is certainly free to exclusively use
> > TIMESTAMP WITH TIME ZONE in its implementation (and ensure that the
> > time zone field is always set to UTC or another non-UTC time zone).
> >
> > > Andrew
> > >
> > > On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <we...@gmail.com>
> > wrote:
> > >
> > > > > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> > > > field
> > > > > means, "how the data is *displayed*." The data isn't displayed as
> > UTC.
> > > >
> > > > I don't think users will generally be using Arrow to format timestamps
> > > > for display to the user.  However, if it is, the correct thing to do
> > > > here would be to store and transport timestamps with timezone=UTC.
> > > > Then, when it comes time to display, first convert to timezone=local
> > > > and then convert to string.
> > > >
> > > > > If you parse a timestamp string, then you can extract all of the
> > fields
> > > > > (including hour and day) from the resulting int64 values and they
> > will be
> > > > > the same as they appeared in the strings. Many users never need to
> > worry
> > > > > about time zone isn’t their analyses.
> > > >
> > > > If Arrow required a timezone (didn't allow naive timestamps) then
> > > > users that truly don't care about the timezone could simply specify
> > > > UTC or any other timezone and carry on with their analysis.
> > > >
> > > > Personally, I think the safest option would be to only allow
> > > > timestamps to be stored with a timezone.   I'd agree with Antoine's
> > > > earlier point and say that the timezone should always be supplied at
> > > > the boundary.  However, it may be too late for that.
> > > >
> > > > Given that we have "timestamps without timezone" it seems to me the
> > > > safest thing is to consider them as naive and fail any function that
> > > > required the time zone.  So +1 for Joris' interpretation.  Yes, this
> > > > renders them useless for any purpose other than pass-through.  If a
> > > > user truly wants to do something with them then it seems the burden
> > > > should be on the user to supply a timezone and not for Arrow to infer
> > > > anything.
> > > >
> > > > -Weston
> > > >
> > > >
> > > > On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
> > > > <jo...@gmail.com> wrote:
> > > > >
> > > > > On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <an...@python.org>
> > wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > Joris' interpretation is that timestamp *values* are expressed in
> > an
> > > > > > arbitrary "local time" that is unknown and unspecified. It is
> > therefore
> > > > > > difficult to exactly interpret them, since the timezone
> > information is
> > > > > > unavailable.
> > > > > >
> > > > > > (I'll let Joris express his thoughts more accurately, but the gist
> > of
> > > > > > his opinion is that "can be thought of as UTC" is only an
> > indication,
> > > > > > not a prescription)
> > > > >
> > > > > That's indeed correct. One clarification: you can interpret them as
> > > > > is, and for many applications this is fine. It's only when you want
> > to
> > > > > interpret them as an absolute point in time that the user needs to
> > > > > supply a timezone to interpret them.
> > > > >
> > > > > For the rest, Wes' responses already cover my viewpoint (as a pandas
> > > > > maintainer, I of course have a similar perspective on this looking at
> > > > > this from the pandas implementation he wrote).
> > > > >
> > > > > An additional source that explains the "local semantics" of naive
> > > > > timestamps well IMO, and especially explains the "can be thought of
> > as
> > > > > UTC without being UTC" aspect, is the parquet format docs:
> > > > >
> > > >
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > > > > (it's of course about Parquet and not Arrow, but the explanation is
> > > > > relevant for the Arrow spec as well).
> > > > >
> > > > > Joris
> > > >
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Micah Kornfield <em...@gmail.com>.

>
> I will have to beg you all to give me some time to review all the
> information when I am not on vacation, but Arrow has in essence two
> timestamp data types


This is how I always interpreted with and without timezone (really two
distinct types).  I also thought we had covered this on the prior thread
with Julian, but I guess we never reached consensus.  I think if we had to
do it over again, perhaps different modelling would have made this clearer
(e.g. storing separate fields (year, month, day, etc) for naive timestamps.

If the timezone field doesn't say how to display data to the user, and we
> agree it doesn't describe how data is stored (since its very presence means
> data is stored as UTC) ... well ... what *is* the meaning of the timezone
> field?


I think it just so happens this is mostly a hold-over from Pandas and for
some reason not everyone in the community looked too closely at it.  A long
time ago there was a thread of possibly introducing a per slot/cell
timezone like type as well but there hasn't seemed to be a lot of interest.

(In my opinion, there shouldn't be a field at all.)

This really isn't an option at this point due to compatibility guarantees,
at best we could discourage use.

On Mon, Jun 14, 2021 at 1:38 PM Wes McKinney <we...@gmail.com> wrote:

> On Mon, Jun 14, 2021 at 2:47 PM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > I think the world is headed towards using canonical UTC timestamps (some
> > period of time from the unix epoch at UTC) so that timestamps can be
> > interpreted as absolute times without requiring additional metadata. This
> > is not yet universal yet, but it seems to me to be where things are
> heading
> > (and what new systems do)
> >
> > This feels like the same basic transition the world did from "strings"
> with
> > encodings as metadata which can be messed up to "UTF-8"
> >
> > Thus, I prefer Antoine's interpretation that timestamp values are always
> > relative to UTC and the timezone metadata can be used to render them to
> > local times if desired
>
> I will have to beg you all to give me some time to review all the
> information when I am not on vacation, but Arrow has in essence two
> timestamp data types:
>
> TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
> field is not set. We have stated that we want systems to use
> system-locale-independent choices for functions that act on this data
> (like stringification or field extraction)
>
> TIMESTAMP WITH TIME ZONE: the time zone field is set. The storage is
> UTC-normalized, and time zone changes are metadata only operations.
>
> Localization is the action that converts between the first type to the
> second type.
>
> It sounds to me like it is being proposed to eliminate the first of
> these two data types. I understand the principles that might motivate
> that, but I don't think that is something we can do at this time lest
> we lose the ability to have high-fidelity interoperability with other
> systems. A system that uses Arrow is certainly free to exclusively use
> TIMESTAMP WITH TIME ZONE in its implementation (and ensure that the
> time zone field is always set to UTC or another non-UTC time zone).
>
> > Andrew
> >
> > On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <we...@gmail.com>
> wrote:
> >
> > > > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> > > field
> > > > means, "how the data is *displayed*." The data isn't displayed as
> UTC.
> > >
> > > I don't think users will generally be using Arrow to format timestamps
> > > for display to the user.  However, if it is, the correct thing to do
> > > here would be to store and transport timestamps with timezone=UTC.
> > > Then, when it comes time to display, first convert to timezone=local
> > > and then convert to string.
> > >
> > > > If you parse a timestamp string, then you can extract all of the
> fields
> > > > (including hour and day) from the resulting int64 values and they
> will be
> > > > the same as they appeared in the strings. Many users never need to
> worry
> > > > about time zone isn’t their analyses.
> > >
> > > If Arrow required a timezone (didn't allow naive timestamps) then
> > > users that truly don't care about the timezone could simply specify
> > > UTC or any other timezone and carry on with their analysis.
> > >
> > > Personally, I think the safest option would be to only allow
> > > timestamps to be stored with a timezone.   I'd agree with Antoine's
> > > earlier point and say that the timezone should always be supplied at
> > > the boundary.  However, it may be too late for that.
> > >
> > > Given that we have "timestamps without timezone" it seems to me the
> > > safest thing is to consider them as naive and fail any function that
> > > required the time zone.  So +1 for Joris' interpretation.  Yes, this
> > > renders them useless for any purpose other than pass-through.  If a
> > > user truly wants to do something with them then it seems the burden
> > > should be on the user to supply a timezone and not for Arrow to infer
> > > anything.
> > >
> > > -Weston
> > >
> > >
> > > On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
> > > <jo...@gmail.com> wrote:
> > > >
> > > > On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <an...@python.org>
> wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > Joris' interpretation is that timestamp *values* are expressed in
> an
> > > > > arbitrary "local time" that is unknown and unspecified. It is
> therefore
> > > > > difficult to exactly interpret them, since the timezone
> information is
> > > > > unavailable.
> > > > >
> > > > > (I'll let Joris express his thoughts more accurately, but the gist
> of
> > > > > his opinion is that "can be thought of as UTC" is only an
> indication,
> > > > > not a prescription)
> > > >
> > > > That's indeed correct. One clarification: you can interpret them as
> > > > is, and for many applications this is fine. It's only when you want
> to
> > > > interpret them as an absolute point in time that the user needs to
> > > > supply a timezone to interpret them.
> > > >
> > > > For the rest, Wes' responses already cover my viewpoint (as a pandas
> > > > maintainer, I of course have a similar perspective on this looking at
> > > > this from the pandas implementation he wrote).
> > > >
> > > > An additional source that explains the "local semantics" of naive
> > > > timestamps well IMO, and especially explains the "can be thought of
> as
> > > > UTC without being UTC" aspect, is the parquet format docs:
> > > >
> > >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > > > (it's of course about Parquet and not Arrow, but the explanation is
> > > > relevant for the Arrow spec as well).
> > > >
> > > > Joris
> > >
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jun 14, 2021 at 2:47 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> I think the world is headed towards using canonical UTC timestamps (some
> period of time from the unix epoch at UTC) so that timestamps can be
> interpreted as absolute times without requiring additional metadata. This
> is not yet universal yet, but it seems to me to be where things are heading
> (and what new systems do)
>
> This feels like the same basic transition the world did from "strings" with
> encodings as metadata which can be messed up to "UTF-8"
>
> Thus, I prefer Antoine's interpretation that timestamp values are always
> relative to UTC and the timezone metadata can be used to render them to
> local times if desired

I will have to beg you all to give me some time to review all the
information when I am not on vacation, but Arrow has in essence two
timestamp data types:

TIMESTAMP WITHOUT TIME ZONE: this is the case where the time zone
field is not set. We have stated that we want systems to use
system-locale-independent choices for functions that act on this data
(like stringification or field extraction)

TIMESTAMP WITH TIME ZONE: the time zone field is set. The storage is
UTC-normalized, and time zone changes are metadata only operations.

Localization is the action that converts between the first type to the
second type.

It sounds to me like it is being proposed to eliminate the first of
these two data types. I understand the principles that might motivate
that, but I don't think that is something we can do at this time lest
we lose the ability to have high-fidelity interoperability with other
systems. A system that uses Arrow is certainly free to exclusively use
TIMESTAMP WITH TIME ZONE in its implementation (and ensure that the
time zone field is always set to UTC or another non-UTC time zone).

> Andrew
>
> On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <we...@gmail.com> wrote:
>
> > > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> > field
> > > means, "how the data is *displayed*." The data isn't displayed as UTC.
> >
> > I don't think users will generally be using Arrow to format timestamps
> > for display to the user.  However, if it is, the correct thing to do
> > here would be to store and transport timestamps with timezone=UTC.
> > Then, when it comes time to display, first convert to timezone=local
> > and then convert to string.
> >
> > > If you parse a timestamp string, then you can extract all of the fields
> > > (including hour and day) from the resulting int64 values and they will be
> > > the same as they appeared in the strings. Many users never need to worry
> > > about time zone isn’t their analyses.
> >
> > If Arrow required a timezone (didn't allow naive timestamps) then
> > users that truly don't care about the timezone could simply specify
> > UTC or any other timezone and carry on with their analysis.
> >
> > Personally, I think the safest option would be to only allow
> > timestamps to be stored with a timezone.   I'd agree with Antoine's
> > earlier point and say that the timezone should always be supplied at
> > the boundary.  However, it may be too late for that.
> >
> > Given that we have "timestamps without timezone" it seems to me the
> > safest thing is to consider them as naive and fail any function that
> > required the time zone.  So +1 for Joris' interpretation.  Yes, this
> > renders them useless for any purpose other than pass-through.  If a
> > user truly wants to do something with them then it seems the burden
> > should be on the user to supply a timezone and not for Arrow to infer
> > anything.
> >
> > -Weston
> >
> >
> > On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
> > <jo...@gmail.com> wrote:
> > >
> > > On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <an...@python.org> wrote:
> > > >
> > > > ...
> > > >
> > > > Joris' interpretation is that timestamp *values* are expressed in an
> > > > arbitrary "local time" that is unknown and unspecified. It is therefore
> > > > difficult to exactly interpret them, since the timezone information is
> > > > unavailable.
> > > >
> > > > (I'll let Joris express his thoughts more accurately, but the gist of
> > > > his opinion is that "can be thought of as UTC" is only an indication,
> > > > not a prescription)
> > >
> > > That's indeed correct. One clarification: you can interpret them as
> > > is, and for many applications this is fine. It's only when you want to
> > > interpret them as an absolute point in time that the user needs to
> > > supply a timezone to interpret them.
> > >
> > > For the rest, Wes' responses already cover my viewpoint (as a pandas
> > > maintainer, I of course have a similar perspective on this looking at
> > > this from the pandas implementation he wrote).
> > >
> > > An additional source that explains the "local semantics" of naive
> > > timestamps well IMO, and especially explains the "can be thought of as
> > > UTC without being UTC" aspect, is the parquet format docs:
> > >
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > > (it's of course about Parquet and not Arrow, but the explanation is
> > > relevant for the Arrow spec as well).
> > >
> > > Joris
> >

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Andrew Lamb <al...@influxdata.com>.

I think the world is headed towards using canonical UTC timestamps (some
period of time from the unix epoch at UTC) so that timestamps can be
interpreted as absolute times without requiring additional metadata. This
is not yet universal yet, but it seems to me to be where things are heading
(and what new systems do)

This feels like the same basic transition the world did from "strings" with
encodings as metadata which can be messed up to "UTF-8"

Thus, I prefer Antoine's interpretation that timestamp values are always
relative to UTC and the timezone metadata can be used to render them to
local times if desired

Andrew

On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <we...@gmail.com> wrote:

> > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> field
> > means, "how the data is *displayed*." The data isn't displayed as UTC.
>
> I don't think users will generally be using Arrow to format timestamps
> for display to the user.  However, if it is, the correct thing to do
> here would be to store and transport timestamps with timezone=UTC.
> Then, when it comes time to display, first convert to timezone=local
> and then convert to string.
>
> > If you parse a timestamp string, then you can extract all of the fields
> > (including hour and day) from the resulting int64 values and they will be
> > the same as they appeared in the strings. Many users never need to worry
> > about time zone isn’t their analyses.
>
> If Arrow required a timezone (didn't allow naive timestamps) then
> users that truly don't care about the timezone could simply specify
> UTC or any other timezone and carry on with their analysis.
>
> Personally, I think the safest option would be to only allow
> timestamps to be stored with a timezone.   I'd agree with Antoine's
> earlier point and say that the timezone should always be supplied at
> the boundary.  However, it may be too late for that.
>
> Given that we have "timestamps without timezone" it seems to me the
> safest thing is to consider them as naive and fail any function that
> required the time zone.  So +1 for Joris' interpretation.  Yes, this
> renders them useless for any purpose other than pass-through.  If a
> user truly wants to do something with them then it seems the burden
> should be on the user to supply a timezone and not for Arrow to infer
> anything.
>
> -Weston
>
>
> On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
> <jo...@gmail.com> wrote:
> >
> > On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <an...@python.org> wrote:
> > >
> > > ...
> > >
> > > Joris' interpretation is that timestamp *values* are expressed in an
> > > arbitrary "local time" that is unknown and unspecified. It is therefore
> > > difficult to exactly interpret them, since the timezone information is
> > > unavailable.
> > >
> > > (I'll let Joris express his thoughts more accurately, but the gist of
> > > his opinion is that "can be thought of as UTC" is only an indication,
> > > not a prescription)
> >
> > That's indeed correct. One clarification: you can interpret them as
> > is, and for many applications this is fine. It's only when you want to
> > interpret them as an absolute point in time that the user needs to
> > supply a timezone to interpret them.
> >
> > For the rest, Wes' responses already cover my viewpoint (as a pandas
> > maintainer, I of course have a similar perspective on this looking at
> > this from the pandas implementation he wrote).
> >
> > An additional source that explains the "local semantics" of naive
> > timestamps well IMO, and especially explains the "can be thought of as
> > UTC without being UTC" aspect, is the parquet format docs:
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> > (it's of course about Parquet and not Arrow, but the explanation is
> > relevant for the Arrow spec as well).
> >
> > Joris
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Joris Van den Bossche <jo...@gmail.com>.

On Mon, 14 Jun 2021 at 21:57, Adam Hooper <ad...@adamhooper.com> wrote:
>
> On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <we...@gmail.com> wrote:
>
> > > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> > field
> > > means, "how the data is *displayed*." The data isn't displayed as UTC.
> >
> > I don't think users will generally be using Arrow to format timestamps
> > for display to the user.
>
>
> Well, I didn't want to open *this* can of worms, but:
>
> If the timezone field doesn't say how to display data to the user, and we
> agree it doesn't describe how data is stored (since its very presence means
> data is stored as UTC) ... well ... what *is* the meaning of the timezone
> field?
>
> (In my opinion, there shouldn't be a field at all.)
>

How data is eventually displayed to the user, is up to the
*application* I think. Yes, the spec speaks about how the data is
displayed, but for me this is only about raw Arrow data and to help
explain the interpretation. IMO it's totally fine that an application
decides to display UTC timestamps in the user's locale time zone.

So if you have an application where you want to display the timestamp
in the user's system time zone, then you can use "timestamp with time
zone" with tz=UTC to store all timestamps as UTC, and display them
using the system locale.

But this timezone field is still useful for other applications that
*don't* want to display the timestamps in the user's system time zone.
One example of such an application is the pandas library, where we
want to enable that users can store timezone-aware timestamps (so
unambiguous points on the time-line, in contrast to tz-naive
timestamps without timezone), while still being able to express the
"local" time zone of the data / study (and store this in the data), so
that queries like "Give me all events that happened on Tuesday"
(example from Weston's answer) or "all events that happened between 8
and 9am" still work (and ensuring that such queries give the same
result wherever on the world you run the script analysing the
dataset).

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Adam Hooper <ad...@adamhooper.com>.

On Mon, Jun 14, 2021 at 3:25 PM Weston Pace <we...@gmail.com> wrote:

> > So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone"
> field
> > means, "how the data is *displayed*." The data isn't displayed as UTC.
>
> I don't think users will generally be using Arrow to format timestamps
> for display to the user.

Well, I didn't want to open *this* can of worms, but:

If the timezone field doesn't say how to display data to the user, and we
agree it doesn't describe how data is stored (since its very presence means
data is stored as UTC) ... well ... what *is* the meaning of the timezone
field?

(In my opinion, there shouldn't be a field at all.)

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Weston Pace <we...@gmail.com>.

> So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone" field
> means, "how the data is *displayed*." The data isn't displayed as UTC.

I don't think users will generally be using Arrow to format timestamps
for display to the user.  However, if it is, the correct thing to do
here would be to store and transport timestamps with timezone=UTC.
Then, when it comes time to display, first convert to timezone=local
and then convert to string.

> If you parse a timestamp string, then you can extract all of the fields
> (including hour and day) from the resulting int64 values and they will be
> the same as they appeared in the strings. Many users never need to worry
> about time zone isn’t their analyses.

If Arrow required a timezone (didn't allow naive timestamps) then
users that truly don't care about the timezone could simply specify
UTC or any other timezone and carry on with their analysis.

Personally, I think the safest option would be to only allow
timestamps to be stored with a timezone.   I'd agree with Antoine's
earlier point and say that the timezone should always be supplied at
the boundary.  However, it may be too late for that.

Given that we have "timestamps without timezone" it seems to me the
safest thing is to consider them as naive and fail any function that
required the time zone.  So +1 for Joris' interpretation.  Yes, this
renders them useless for any purpose other than pass-through.  If a
user truly wants to do something with them then it seems the burden
should be on the user to supply a timezone and not for Arrow to infer
anything.

-Weston

On Mon, Jun 14, 2021 at 9:12 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <an...@python.org> wrote:
> >
> > ...
> >
> > Joris' interpretation is that timestamp *values* are expressed in an
> > arbitrary "local time" that is unknown and unspecified. It is therefore
> > difficult to exactly interpret them, since the timezone information is
> > unavailable.
> >
> > (I'll let Joris express his thoughts more accurately, but the gist of
> > his opinion is that "can be thought of as UTC" is only an indication,
> > not a prescription)
>
> That's indeed correct. One clarification: you can interpret them as
> is, and for many applications this is fine. It's only when you want to
> interpret them as an absolute point in time that the user needs to
> supply a timezone to interpret them.
>
> For the rest, Wes' responses already cover my viewpoint (as a pandas
> maintainer, I of course have a similar perspective on this looking at
> this from the pandas implementation he wrote).
>
> An additional source that explains the "local semantics" of naive
> timestamps well IMO, and especially explains the "can be thought of as
> UTC without being UTC" aspect, is the parquet format docs:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
> (it's of course about Parquet and not Arrow, but the explanation is
> relevant for the Arrow spec as well).
>
> Joris

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Joris Van den Bossche <jo...@gmail.com>.

On Mon, 14 Jun 2021 at 17:57, Antoine Pitrou <an...@python.org> wrote:
>
> ...
>
> Joris' interpretation is that timestamp *values* are expressed in an
> arbitrary "local time" that is unknown and unspecified. It is therefore
> difficult to exactly interpret them, since the timezone information is
> unavailable.
>
> (I'll let Joris express his thoughts more accurately, but the gist of
> his opinion is that "can be thought of as UTC" is only an indication,
> not a prescription)

That's indeed correct. One clarification: you can interpret them as
is, and for many applications this is fine. It's only when you want to
interpret them as an absolute point in time that the user needs to
supply a timezone to interpret them.

For the rest, Wes' responses already cover my viewpoint (as a pandas
maintainer, I of course have a similar perspective on this looking at
this from the pandas implementation he wrote).

An additional source that explains the "local semantics" of naive
timestamps well IMO, and especially explains the "can be thought of as
UTC without being UTC" aspect, is the parquet format docs:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
(it's of course about Parquet and not Arrow, but the explanation is
relevant for the Arrow spec as well).

Joris

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Adam Hooper <ad...@adamhooper.com>.

On Mon, Jun 14, 2021 at 12:29 PM Wes McKinney <we...@gmail.com> wrote:

> Hi Antoine — when there is no time zone specified, I do not think it is
> appropriate to consider the data to refer to a specific moment in time
> without applying an explicit time zone localization
>

I have a case for treating "naive" as "UTC": in my field (web development),
"naive" is the only spec-compliant way to encode timestamps.

As a canonical example, consider a web-traffic log-analysis application.
Web servers record all events with UTC timestamps. But Grafana (or
whatever) won't *display* in UTC: it'll display the timezone of the viewer
looking at the page.

So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone" field
means, "how the data is *displayed*." The data isn't displayed as UTC.

And if "naive = unknown moment in time", then it's also wrong to store
these log timestamps as timezone-naive: the moments in time are well-known.

Summed up: as a web developer, *I omit Arrow's "timezone" metadata for my
UTC timestamps because different users and components of my system use
different time zones*. Arrow's "timezone" metadata is not the place for me.
I need Arrow to provide a timezone-agnostic UTC timestamp -- like MySQL's
TIMESTAMP.

When localizing data (adding a time zone when there was none previously), I
> do not think we can assume that the data is already localized to UTC. I
> provided a gist showing the behavior of the pandas tz_localize function —
> the int64 values must each be shifted by the UTC offset at that moment.
> That’s what I think we have to do in this project.

The question is: what should we call an int64-encoded datetime? "timestamp"
or "int64"?

I think "int64", because *most programming languages and libraries agree
timestamps are UTC*. They use different*, *struct-based types for datetimes.

Languages Arrow cares about:

* C and C-built languages like Python, Ruby, etc. store 64-bit integers
time_t as UTC. They have struct tm or similar tuples for date+time (+
sometimes timezone).
* Go and Rust timestamps are UTC integers; datetimes are structs.
* Java Instant stores UTC integer. Its LocalDateTime is a struct holding
LocalDate + LocalTime fields.
* Julia I don't know
* JavaScript Date stores UTC integer; its best approximation of datetime is
ISO8601-formatted strings.
* Numpy is the black sheep: its datetime64 is *always* int64-encoded
datetime, *never* UTC timestamp. But Pandas Timestamp, built atop it, is
always UTC timestamp.

It's unanimous: of all Arrow-supported languages, any developer who happens
upon an int64 time-related value can assume it's UTC. Numpy stands out as a
lone exception.

I think it would be confusing for Arrow "timestamp" columns to allow
int64-encoded datetime because that pattern is a wide deviation from the
norm. I believe more Arrow users would get more done more quickly if
int64-encoded datetimes were documented as a pattern one can use with
"int64" columns, not a pattern one can use with "timestamp" columns.

> If you know that the
> data is UTC, then the correct action is to call tz_localize(‘UTC’) and then
> tz_convert(tz) where tz is the intended time zone (which is only a
> modification to the type metadata). My interpretation is certainly colored
> by the experience of designing this functionality in pandas, but after 10
> years of observing real world use this model seems to work well and not
> trip people up too much.
>

*I'll raise my hand here*: I got tripped up last week. I interpreted
tz_localize() to do literally the opposite of what it does, after reading
and re-reading the doc, and posted a misleading comment in a JIRA ticket.
I've been using Pandas and training users for five years.

That isn't the worst: the prior five years, I misinterpreted
Postgres's TIMESTAMP
WITHOUT TIME ZONE to be UTC. Again, I was the opposite of correct. (TIMESTAMP
WITHOUT TIME ZONE is an int64-encoded datetime; the way to store UTC
timestamps is TIMESTAMP WITH TIME ZONE, which doesn't store a time zone.)

I'm a smart person. I keep making these embarrassing -- and costly --
mistakes.

I've never been tripped up by java.time.Instant. It's no wonder Java
embraced it.

I hope Arrow empowers its community to make tools that make me feel
not-stupid.

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Rok Mihevc <ro...@gmail.com>.

Is there a benefit to having timezone naive moments stored as timestamps
instead of int64?

On Mon, Jun 14, 2021 at 6:33 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 14/06/2021 à 18:28, Wes McKinney a écrit :
> > Hi Antoine — when there is no time zone specified, I do not think it is
> > appropriate to consider the data to refer to a specific moment in time
> > without applying an explicit time zone localization.
>
> Well, how can that be done? The timezone information is lost, how can
> the user (who possibly got the data from another source) recover it?
>
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jun 14, 2021 at 11:50 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 14/06/2021 à 18:47, Wes McKinney a écrit :
> > On Mon, Jun 14, 2021 at 11:33 AM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >>
> >> Le 14/06/2021 à 18:28, Wes McKinney a écrit :
> >>> Hi Antoine — when there is no time zone specified, I do not think it is
> >>> appropriate to consider the data to refer to a specific moment in time
> >>> without applying an explicit time zone localization.
> >>
> >> Well, how can that be done? The timezone information is lost, how can
> >> the user (who possibly got the data from another source) recover it?
> >
> >
> >> This is usually something that people take care of in their application
> > code. For example, when you parse a CSV and obtain “raw” timestamps, you
> > have to call “tz_localize” to apply a time zone to the and normalize the
> > internal representation to UTC.
>
> Right, this is why I advocate for this to be done at the boundary layer.
>   I.e, the CSV, Parquet... readers would expose an option to set the
> timezone of timestamp columns to a well-defined value.
>

In practice I think this would be impractical. This is something that users
expect to be able to address in their data preparation as they currently do
with other tools and systems. To force this issue (versus having
auto-localization as an optional feature that you opt in to) at data ingest
time would be a nuisance and harm many kinds of users.

> > If you don’t know what the time zone is supposed to be then you can’t get
> > it back, but you can still do many analytical operations on the data
> > (aggregating by year or month, for example) just fine. For many users the
> > absence of time zones is a non-issue in their work.
>
> So, basically, a timestamp without a timezone is still useful as a date
> (mostly, because the day number may be off)?

If you parse a timestamp string, then you can extract all of the fields
(including hour and day) from the resulting int64 values and they will be
the same as they appeared in the strings. Many users never need to worry
about time zone isn’t their analyses.

I’ve exhausted my ability to discuss this topic on mobile internet so I
will pick up the discussion later in the week when I can provide
supplementary code examples.

>
> But then, why don't we tell users to simply use a date type for such data?
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Antoine Pitrou <an...@python.org>.

Le 14/06/2021 à 18:47, Wes McKinney a écrit :
> On Mon, Jun 14, 2021 at 11:33 AM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Le 14/06/2021 à 18:28, Wes McKinney a écrit :
>>> Hi Antoine — when there is no time zone specified, I do not think it is
>>> appropriate to consider the data to refer to a specific moment in time
>>> without applying an explicit time zone localization.
>>
>> Well, how can that be done? The timezone information is lost, how can
>> the user (who possibly got the data from another source) recover it?
> 
> 
>> This is usually something that people take care of in their application
> code. For example, when you parse a CSV and obtain “raw” timestamps, you
> have to call “tz_localize” to apply a time zone to the and normalize the
> internal representation to UTC.

Right, this is why I advocate for this to be done at the boundary layer. 
  I.e, the CSV, Parquet... readers would expose an option to set the 
timezone of timestamp columns to a well-defined value.

> If you don’t know what the time zone is supposed to be then you can’t get
> it back, but you can still do many analytical operations on the data
> (aggregating by year or month, for example) just fine. For many users the
> absence of time zones is a non-issue in their work.

So, basically, a timestamp without a timezone is still useful as a date 
(mostly, because the day number may be off)?

But then, why don't we tell users to simply use a date type for such data?

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jun 14, 2021 at 11:33 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 14/06/2021 à 18:28, Wes McKinney a écrit :
> > Hi Antoine — when there is no time zone specified, I do not think it is
> > appropriate to consider the data to refer to a specific moment in time
> > without applying an explicit time zone localization.
>
> Well, how can that be done? The timezone information is lost, how can
> the user (who possibly got the data from another source) recover it?

> This is usually something that people take care of in their application
code. For example, when you parse a CSV and obtain “raw” timestamps, you
have to call “tz_localize” to apply a time zone to the and normalize the
internal representation to UTC.

If you don’t know what the time zone is supposed to be then you can’t get
it back, but you can still do many analytical operations on the data
(aggregating by year or month, for example) just fine. For many users the
absence of time zones is a non-issue in their work.

I’m on mobile internet the next couple days but I can send more code
examples later in the week.

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Antoine Pitrou <an...@python.org>.

Le 14/06/2021 à 18:28, Wes McKinney a écrit :
> Hi Antoine — when there is no time zone specified, I do not think it is
> appropriate to consider the data to refer to a specific moment in time
> without applying an explicit time zone localization.

Well, how can that be done? The timezone information is lost, how can 
the user (who possibly got the data from another source) recover it?

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Wes McKinney <we...@gmail.com>.

Hi Antoine — when there is no time zone specified, I do not think it is
appropriate to consider the data to refer to a specific moment in time
without applying an explicit time zone localization. So absent an explicit
UTC time zone, we can’t say that the data refers to instants in time from
the UTC perspective.

That said, absent a set time zone, there are two possible behaviors of
timestamp functions (like “extract hour”): localtime using the system
locale or UTC. I think we have already decided years ago that we would use
the latter interpretation when it comes to stringfication or extracting
fields.

When localizing data (adding a time zone when there was none previously), I
do not think we can assume that the data is already localized to UTC. I
provided a gist showing the behavior of the pandas tz_localize function —
the int64 values must each be shifted by the UTC offset at that moment.
That’s what I think we have to do in this project. If you know that the
data is UTC, then the correct action is to call tz_localize(‘UTC’) and then
tz_convert(tz) where tz is the intended time zone (which is only a
modification to the type metadata). My interpretation is certainly colored
by the experience of designing this functionality in pandas, but after 10
years of observing real world use this model seems to work well and not
trip people up too much.

Wes

On Mon, Jun 14, 2021 at 11:01 AM Antoine Pitrou <an...@python.org> wrote:

>
> Also, as a secondary (but IMHO important) concern, if we choose the
> "always UTC" interpretation, we should stop using the "time zone naive"
> wording in the spec, because there is a high risk of confusion with
> Python's different "naive timestamp" concept:
>
> https://docs.python.org/3/library/datetime.html
>
> """A naive object does not contain enough information to unambiguously
> locate itself relative to other date/time objects. Whether a naive
> object represents Coordinated Universal Time (UTC), local time, or time
> in some other timezone is purely up to the program, just like it is up
> to the program whether a particular number represents metres, miles, or
> mass. Naive objects are easy to understand and to work with, at the cost
> of ignoring some aspects of reality."""
>
>
> Le 14/06/2021 à 17:57, Antoine Pitrou a écrit :
> >
> > Hello,
> >
> > In ARROW-13033, there was a disagreement as to how the specification
> > about timezone-less timestamps should be interpreted.
> >
> > Here is the wording in the Schema specification:
> >
> >>    /// * If the time zone is null or equal to an empty string, the data
> is "time
> >>    ///   zone naive" and shall be displayed *as is* to the user, not
> localized
> >>    ///   to the locale of the user. This data can be though of as UTC
> but
> >>    ///   without having "UTC" as the time zone, it is not considered to
> be
> >>    ///   localized to any time zone
> >
> > My interpretation is that timestamp *values* are always expressed in
> > UTC.  The timezone is an optional piece of metadata that describes the
> > context in which they were obtained, but do not impact how the *values*
> > should be interpreted.
> >
> > Joris' interpretation is that timestamp *values* are expressed in an
> > arbitrary "local time" that is unknown and unspecified. It is therefore
> > difficult to exactly interpret them, since the timezone information is
> > unavailable.
> >
> > (I'll let Joris express his thoughts more accurately, but the gist of
> > his opinion is that "can be thought of as UTC" is only an indication,
> > not a prescription)
> >
> >
> > To me, the problem with the "unknown local timezone" interpretation is
> > that it renders the data essentially ambiguous and useless.  The problem
> > is very similar to the problem of having string data without a
> > well-known encoding. This is well-known to Python users as the Python 2
> > encoding hell (to the point that it motivated the heavy and disruptive
> > Python 3 transition).
> >
> > (note the problem is even worse for timestamps. At least, you can with a
> > high degree of probability detect that an arbitrary binary string is
> > *not* UTF8-encoded. You cannot do so with timestamp values: any 64-bit
> > timestamp may or may not be a UTC timestamp. Once you have lost that
> > information, you cannot regain it anymore.)
> >
> > In any case, I think this must be clarified, first on this mailing-list,
> > then by making the spec wording stronger and more prescriptive.
> >
> > Regards
> >
> > Antoine.
> >
>

Re: [Format][Important] Needed clarification of timezone-less timestamps

Posted by Antoine Pitrou <an...@python.org>.

Also, as a secondary (but IMHO important) concern, if we choose the 
"always UTC" interpretation, we should stop using the "time zone naive" 
wording in the spec, because there is a high risk of confusion with 
Python's different "naive timestamp" concept:

https://docs.python.org/3/library/datetime.html

"""A naive object does not contain enough information to unambiguously 
locate itself relative to other date/time objects. Whether a naive 
object represents Coordinated Universal Time (UTC), local time, or time 
in some other timezone is purely up to the program, just like it is up 
to the program whether a particular number represents metres, miles, or 
mass. Naive objects are easy to understand and to work with, at the cost 
of ignoring some aspects of reality."""


Le 14/06/2021 à 17:57, Antoine Pitrou a écrit :
> 
> Hello,
> 
> In ARROW-13033, there was a disagreement as to how the specification
> about timezone-less timestamps should be interpreted.
> 
> Here is the wording in the Schema specification:
> 
>>    /// * If the time zone is null or equal to an empty string, the data is "time
>>    ///   zone naive" and shall be displayed *as is* to the user, not localized
>>    ///   to the locale of the user. This data can be though of as UTC but
>>    ///   without having "UTC" as the time zone, it is not considered to be
>>    ///   localized to any time zone
> 
> My interpretation is that timestamp *values* are always expressed in
> UTC.  The timezone is an optional piece of metadata that describes the
> context in which they were obtained, but do not impact how the *values*
> should be interpreted.
> 
> Joris' interpretation is that timestamp *values* are expressed in an
> arbitrary "local time" that is unknown and unspecified. It is therefore
> difficult to exactly interpret them, since the timezone information is
> unavailable.
> 
> (I'll let Joris express his thoughts more accurately, but the gist of
> his opinion is that "can be thought of as UTC" is only an indication,
> not a prescription)
> 
> 
> To me, the problem with the "unknown local timezone" interpretation is
> that it renders the data essentially ambiguous and useless.  The problem
> is very similar to the problem of having string data without a
> well-known encoding. This is well-known to Python users as the Python 2
> encoding hell (to the point that it motivated the heavy and disruptive
> Python 3 transition).
> 
> (note the problem is even worse for timestamps. At least, you can with a
> high degree of probability detect that an arbitrary binary string is
> *not* UTF8-encoded. You cannot do so with timestamp values: any 64-bit
> timestamp may or may not be a UTC timestamp. Once you have lost that
> information, you cannot regain it anymore.)
> 
> In any case, I think this must be clarified, first on this mailing-list,
> then by making the spec wording stronger and more prescriptive.
> 
> Regards
> 
> Antoine.
>