You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Evan Chan <ev...@urbanlogiq.com> on 2021/07/07 16:37:33 UTC

[Rust] Eliminate Timezone field from Timestamp types?

Hi folks,

Some of us are having a discussion about a direction change for Rust Arrow timestamp types, which current support both a resolution field (Ns, Micros, Ms, Seconds) similar to the other language implementations, but also optionally a timezone string field.   I believe the timezone field is unique to the Rust implementation, as I don’t find it in the C/C++ and Python docs.   At the same time, in reality if the timezone field is non null, this is not well supported at all in the current code.  Functions returning timestamps pretty much all return a null timezone, for example, and don’t allow the timezone to be specified.  

The proposal would be to eliminate the timezone field and bring the Rust Arrow timestamp type in line with that of the other language implementations, also simplifying implementation.   It seems this is in line with direction of other projects (Parquet, Spark, and most DBs have timestamp types which do not have explicit timezones or are implicitly UTC).

Please feel free to see https://github.com/apache/arrow-datafusion/issues/686 <https://github.com/apache/arrow-datafusion/issues/686>
(Or would it be better to discuss here in mailing list?)

Cheers!
Evan

Re: [Rust] Eliminate Timezone field from Timestamp types?

Posted by Weston Pace <we...@gmail.com>.
Good question.  I'll take a stab at answering some of it.

C++ has the same passthru / interoperability concerns.  Python is
significant as it's builtin datetime module distinguishes between
"local" and "instant" datetimes (which it calls naive and non-naive).
In addition, pandas which has a very similar representation (e.g.
timestamp column with a single time zone string).  Pyarrow current
supports interoperability with both.  So if you get a timestamp array
from pandas with a time zone string pyarrow will convert to a
timestamp column with a timezone string and vice versa.  Wes & Joris
could probably give you a better answer how Pandas actually uses the
time zone string.  There is also interoperability with parquet.
Parquet does not support an arbitrary time zone string (my guess is
arrow is using metadata for that piece) but it does support a
distinction between local/instant logic and arrow uses (timezone
string == null or empty) to populate that field.

Second, some of the compute kernels are having time zone aware logic
added in ARROW-12980 so, for example, if you read in a column of unix
epochs (as int64) from a parquet file and you wanted to display them
as strings in your local time zone without leaving Arrow C++ you could
do something (roughly) like... Parquet -> INT64 -> Cast(Timestamp,
"Insert Local Timezone") -> strftime.  Although in that particular
case you could argue (and I think I might personally prefer) that
"Insert Local Timzeone" could instead be an argument passed into
strftime.  Perhaps Joris & Rok could comment more as I think they've
been working in this area.

I believe there are plans for the compute kernel to also forbid
certain operations based on local vs instant semantics.  For example,
Cast(Timestamp-UTC -> Timestamp-MST) is ok and Cast(Timestamp-MST ->
Timestamp-EST) is ok but Cast(Timestamp-None -> Timestamp-MST) is NOT
ok (although there is a localize kernel if you know that is what you
want to do and you're agreeing to the risks).  Similarly "instants"
can be compared amongst themselves and "local times" can be compared
amongst themselves but "instants" cannot be compared with "local
times".

-Weston

On Wed, Jul 7, 2021 at 3:04 PM Evan Chan <ev...@urbanlogiq.com> wrote:
>
> Thanks everyone for their input;
>
> Interoperability would be the biggest issue; how much does C++ do with the timezone string?
>
> -Evan
>
> > On Jul 7, 2021, at 1:33 PM, Weston Pace <we...@gmail.com> wrote:
> >
> > I don't know about removal but you could probably ignore the timezone
> > string and it's not clear the issues would be that significant.
> >
> > If Rust never produces a non-null non-UTC timestamp then I don't see
> > that as an issue.
> >
> > If you are consuming data with a timestamp string other than UTC it
> > isn't really clear what information that timestamp string is supposed
> > to convey anyways.  Are you supposed to extract fields as if you were
> > in that time zone?  Or does this indicate the time zone the data was
> > captured in?  Postgresql, etc. do not support this concept.  Probably
> > the safest thing to do would be to reject the data.
> >
> > There still remains the question of whether or not you need to
> > distinguish between local times and instant times.  Or, in python
> > terms, naive vs non-naive.  Or, in parquet terms, whether you need to
> > worry about the isAdjustedToUtc flag.  Or, in postgres terms, whether
> > you need to distinguish between "timestamp with timezone" and
> > "timestamp without timezone".
> >
> > This boils down to whether you want to support the constraints offered
> > by these semantic hints from the user or not.  For example, forbidding
> > comparison between the two types of timestamps or altering how you
> > display them.  If those features are not important, then Rust could
> > ignore the time zone field completely.  That could cause an
> > interoperability issue though (e.g. data going into rust with timezone
> > UTC comes back out with no timezone even though nothing changed).
> > Ideally rust could ignore the time zone string but leave it unchanged.
> >
> > On Wed, Jul 7, 2021 at 6:58 AM Joris Van den Bossche
> > <jo...@gmail.com> wrote:
> >>
> >> On Wed, 7 Jul 2021 at 18:46, Jorge Cardoso Leitão <jo...@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> AFAIK timezone is part of the spec.
> >>
> >>
> >> And for reference, the current spec (Schema flatbuffer file) for timestamp
> >> is at
> >> https://github.com/apache/arrow/blob/6c8d30ea82222fd2750b999840872d3f6cbdc8f8/format/Schema.fbs#L217-L247.
> >>
> >>
> >>
> >>> In Python, that would be [1]
> >>>
> >>> import pyarrow as pa
> >>> dt1 = pa.timestamp("ms", "+00:10")
> >>> dt2 = pa.timestamp("ms")
> >>>
> >>> arrow-rs is not very consistent with how it handles it. imo that is an
> >>> artifact of being currently difficult (API wise) to create an array with a
> >>> timezone, which have caused people to not use it much (and thus not
> >>> implement kernels with it / test it properly).
> >>>
> >>> I do not see how removing it would be compatible with the Arrow spec,
> >>> though.
> >>>
> >>> Best,
> >>> Jorge
> >>>
> >>> [1] https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html
> >>>
> >>>
> >>>
> >>> On Wed, Jul 7, 2021 at 6:37 PM Evan Chan <ev...@urbanlogiq.com> wrote:
> >>>
> >>>> Hi folks,
> >>>>
> >>>> Some of us are having a discussion about a direction change for Rust
> >>> Arrow
> >>>> timestamp types, which current support both a resolution field (Ns,
> >>> Micros,
> >>>> Ms, Seconds) similar to the other language implementations, but also
> >>>> optionally a timezone string field.   I believe the timezone field is
> >>>> unique to the Rust implementation, as I don’t find it in the C/C++ and
> >>>> Python docs.   At the same time, in reality if the timezone field is non
> >>>> null, this is not well supported at all in the current code.  Functions
> >>>> returning timestamps pretty much all return a null timezone, for example,
> >>>> and don’t allow the timezone to be specified.
> >>>>
> >>>> The proposal would be to eliminate the timezone field and bring the Rust
> >>>> Arrow timestamp type in line with that of the other language
> >>>> implementations, also simplifying implementation.   It seems this is in
> >>>> line with direction of other projects (Parquet, Spark, and most DBs have
> >>>> timestamp types which do not have explicit timezones or are implicitly
> >>> UTC).
> >>>>
> >>>> Please feel free to see
> >>>> https://github.com/apache/arrow-datafusion/issues/686 <
> >>>> https://github.com/apache/arrow-datafusion/issues/686>
> >>>> (Or would it be better to discuss here in mailing list?)
> >>>>
> >>>> Cheers!
> >>>> Evan
> >>>
>

Re: [Rust] Eliminate Timezone field from Timestamp types?

Posted by Evan Chan <ev...@urbanlogiq.com>.
Thanks everyone for their input;

Interoperability would be the biggest issue; how much does C++ do with the timezone string?

-Evan

> On Jul 7, 2021, at 1:33 PM, Weston Pace <we...@gmail.com> wrote:
> 
> I don't know about removal but you could probably ignore the timezone
> string and it's not clear the issues would be that significant.
> 
> If Rust never produces a non-null non-UTC timestamp then I don't see
> that as an issue.
> 
> If you are consuming data with a timestamp string other than UTC it
> isn't really clear what information that timestamp string is supposed
> to convey anyways.  Are you supposed to extract fields as if you were
> in that time zone?  Or does this indicate the time zone the data was
> captured in?  Postgresql, etc. do not support this concept.  Probably
> the safest thing to do would be to reject the data.
> 
> There still remains the question of whether or not you need to
> distinguish between local times and instant times.  Or, in python
> terms, naive vs non-naive.  Or, in parquet terms, whether you need to
> worry about the isAdjustedToUtc flag.  Or, in postgres terms, whether
> you need to distinguish between "timestamp with timezone" and
> "timestamp without timezone".
> 
> This boils down to whether you want to support the constraints offered
> by these semantic hints from the user or not.  For example, forbidding
> comparison between the two types of timestamps or altering how you
> display them.  If those features are not important, then Rust could
> ignore the time zone field completely.  That could cause an
> interoperability issue though (e.g. data going into rust with timezone
> UTC comes back out with no timezone even though nothing changed).
> Ideally rust could ignore the time zone string but leave it unchanged.
> 
> On Wed, Jul 7, 2021 at 6:58 AM Joris Van den Bossche
> <jo...@gmail.com> wrote:
>> 
>> On Wed, 7 Jul 2021 at 18:46, Jorge Cardoso Leitão <jo...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> AFAIK timezone is part of the spec.
>> 
>> 
>> And for reference, the current spec (Schema flatbuffer file) for timestamp
>> is at
>> https://github.com/apache/arrow/blob/6c8d30ea82222fd2750b999840872d3f6cbdc8f8/format/Schema.fbs#L217-L247.
>> 
>> 
>> 
>>> In Python, that would be [1]
>>> 
>>> import pyarrow as pa
>>> dt1 = pa.timestamp("ms", "+00:10")
>>> dt2 = pa.timestamp("ms")
>>> 
>>> arrow-rs is not very consistent with how it handles it. imo that is an
>>> artifact of being currently difficult (API wise) to create an array with a
>>> timezone, which have caused people to not use it much (and thus not
>>> implement kernels with it / test it properly).
>>> 
>>> I do not see how removing it would be compatible with the Arrow spec,
>>> though.
>>> 
>>> Best,
>>> Jorge
>>> 
>>> [1] https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html
>>> 
>>> 
>>> 
>>> On Wed, Jul 7, 2021 at 6:37 PM Evan Chan <ev...@urbanlogiq.com> wrote:
>>> 
>>>> Hi folks,
>>>> 
>>>> Some of us are having a discussion about a direction change for Rust
>>> Arrow
>>>> timestamp types, which current support both a resolution field (Ns,
>>> Micros,
>>>> Ms, Seconds) similar to the other language implementations, but also
>>>> optionally a timezone string field.   I believe the timezone field is
>>>> unique to the Rust implementation, as I don’t find it in the C/C++ and
>>>> Python docs.   At the same time, in reality if the timezone field is non
>>>> null, this is not well supported at all in the current code.  Functions
>>>> returning timestamps pretty much all return a null timezone, for example,
>>>> and don’t allow the timezone to be specified.
>>>> 
>>>> The proposal would be to eliminate the timezone field and bring the Rust
>>>> Arrow timestamp type in line with that of the other language
>>>> implementations, also simplifying implementation.   It seems this is in
>>>> line with direction of other projects (Parquet, Spark, and most DBs have
>>>> timestamp types which do not have explicit timezones or are implicitly
>>> UTC).
>>>> 
>>>> Please feel free to see
>>>> https://github.com/apache/arrow-datafusion/issues/686 <
>>>> https://github.com/apache/arrow-datafusion/issues/686>
>>>> (Or would it be better to discuss here in mailing list?)
>>>> 
>>>> Cheers!
>>>> Evan
>>> 


Re: [Rust] Eliminate Timezone field from Timestamp types?

Posted by Weston Pace <we...@gmail.com>.
I don't know about removal but you could probably ignore the timezone
string and it's not clear the issues would be that significant.

If Rust never produces a non-null non-UTC timestamp then I don't see
that as an issue.

If you are consuming data with a timestamp string other than UTC it
isn't really clear what information that timestamp string is supposed
to convey anyways.  Are you supposed to extract fields as if you were
in that time zone?  Or does this indicate the time zone the data was
captured in?  Postgresql, etc. do not support this concept.  Probably
the safest thing to do would be to reject the data.

There still remains the question of whether or not you need to
distinguish between local times and instant times.  Or, in python
terms, naive vs non-naive.  Or, in parquet terms, whether you need to
worry about the isAdjustedToUtc flag.  Or, in postgres terms, whether
you need to distinguish between "timestamp with timezone" and
"timestamp without timezone".

This boils down to whether you want to support the constraints offered
by these semantic hints from the user or not.  For example, forbidding
comparison between the two types of timestamps or altering how you
display them.  If those features are not important, then Rust could
ignore the time zone field completely.  That could cause an
interoperability issue though (e.g. data going into rust with timezone
UTC comes back out with no timezone even though nothing changed).
Ideally rust could ignore the time zone string but leave it unchanged.

On Wed, Jul 7, 2021 at 6:58 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> On Wed, 7 Jul 2021 at 18:46, Jorge Cardoso Leitão <jo...@gmail.com>
> wrote:
>
> > Hi,
> >
> > AFAIK timezone is part of the spec.
>
>
> And for reference, the current spec (Schema flatbuffer file) for timestamp
> is at
> https://github.com/apache/arrow/blob/6c8d30ea82222fd2750b999840872d3f6cbdc8f8/format/Schema.fbs#L217-L247.
>
>
>
> > In Python, that would be [1]
> >
> > import pyarrow as pa
> > dt1 = pa.timestamp("ms", "+00:10")
> > dt2 = pa.timestamp("ms")
> >
> > arrow-rs is not very consistent with how it handles it. imo that is an
> > artifact of being currently difficult (API wise) to create an array with a
> > timezone, which have caused people to not use it much (and thus not
> > implement kernels with it / test it properly).
> >
> > I do not see how removing it would be compatible with the Arrow spec,
> > though.
> >
> > Best,
> > Jorge
> >
> > [1] https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html
> >
> >
> >
> > On Wed, Jul 7, 2021 at 6:37 PM Evan Chan <ev...@urbanlogiq.com> wrote:
> >
> > > Hi folks,
> > >
> > > Some of us are having a discussion about a direction change for Rust
> > Arrow
> > > timestamp types, which current support both a resolution field (Ns,
> > Micros,
> > > Ms, Seconds) similar to the other language implementations, but also
> > > optionally a timezone string field.   I believe the timezone field is
> > > unique to the Rust implementation, as I don’t find it in the C/C++ and
> > > Python docs.   At the same time, in reality if the timezone field is non
> > > null, this is not well supported at all in the current code.  Functions
> > > returning timestamps pretty much all return a null timezone, for example,
> > > and don’t allow the timezone to be specified.
> > >
> > > The proposal would be to eliminate the timezone field and bring the Rust
> > > Arrow timestamp type in line with that of the other language
> > > implementations, also simplifying implementation.   It seems this is in
> > > line with direction of other projects (Parquet, Spark, and most DBs have
> > > timestamp types which do not have explicit timezones or are implicitly
> > UTC).
> > >
> > > Please feel free to see
> > > https://github.com/apache/arrow-datafusion/issues/686 <
> > > https://github.com/apache/arrow-datafusion/issues/686>
> > > (Or would it be better to discuss here in mailing list?)
> > >
> > > Cheers!
> > > Evan
> >

Re: [Rust] Eliminate Timezone field from Timestamp types?

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Wed, 7 Jul 2021 at 18:46, Jorge Cardoso Leitão <jo...@gmail.com>
wrote:

> Hi,
>
> AFAIK timezone is part of the spec.


And for reference, the current spec (Schema flatbuffer file) for timestamp
is at
https://github.com/apache/arrow/blob/6c8d30ea82222fd2750b999840872d3f6cbdc8f8/format/Schema.fbs#L217-L247.



> In Python, that would be [1]
>
> import pyarrow as pa
> dt1 = pa.timestamp("ms", "+00:10")
> dt2 = pa.timestamp("ms")
>
> arrow-rs is not very consistent with how it handles it. imo that is an
> artifact of being currently difficult (API wise) to create an array with a
> timezone, which have caused people to not use it much (and thus not
> implement kernels with it / test it properly).
>
> I do not see how removing it would be compatible with the Arrow spec,
> though.
>
> Best,
> Jorge
>
> [1] https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html
>
>
>
> On Wed, Jul 7, 2021 at 6:37 PM Evan Chan <ev...@urbanlogiq.com> wrote:
>
> > Hi folks,
> >
> > Some of us are having a discussion about a direction change for Rust
> Arrow
> > timestamp types, which current support both a resolution field (Ns,
> Micros,
> > Ms, Seconds) similar to the other language implementations, but also
> > optionally a timezone string field.   I believe the timezone field is
> > unique to the Rust implementation, as I don’t find it in the C/C++ and
> > Python docs.   At the same time, in reality if the timezone field is non
> > null, this is not well supported at all in the current code.  Functions
> > returning timestamps pretty much all return a null timezone, for example,
> > and don’t allow the timezone to be specified.
> >
> > The proposal would be to eliminate the timezone field and bring the Rust
> > Arrow timestamp type in line with that of the other language
> > implementations, also simplifying implementation.   It seems this is in
> > line with direction of other projects (Parquet, Spark, and most DBs have
> > timestamp types which do not have explicit timezones or are implicitly
> UTC).
> >
> > Please feel free to see
> > https://github.com/apache/arrow-datafusion/issues/686 <
> > https://github.com/apache/arrow-datafusion/issues/686>
> > (Or would it be better to discuss here in mailing list?)
> >
> > Cheers!
> > Evan
>

Re: [Rust] Eliminate Timezone field from Timestamp types?

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Hi,

AFAIK timezone is part of the spec. In Python, that would be [1]

import pyarrow as pa
dt1 = pa.timestamp("ms", "+00:10")
dt2 = pa.timestamp("ms")

arrow-rs is not very consistent with how it handles it. imo that is an
artifact of being currently difficult (API wise) to create an array with a
timezone, which have caused people to not use it much (and thus not
implement kernels with it / test it properly).

I do not see how removing it would be compatible with the Arrow spec,
though.

Best,
Jorge

[1] https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html



On Wed, Jul 7, 2021 at 6:37 PM Evan Chan <ev...@urbanlogiq.com> wrote:

> Hi folks,
>
> Some of us are having a discussion about a direction change for Rust Arrow
> timestamp types, which current support both a resolution field (Ns, Micros,
> Ms, Seconds) similar to the other language implementations, but also
> optionally a timezone string field.   I believe the timezone field is
> unique to the Rust implementation, as I don’t find it in the C/C++ and
> Python docs.   At the same time, in reality if the timezone field is non
> null, this is not well supported at all in the current code.  Functions
> returning timestamps pretty much all return a null timezone, for example,
> and don’t allow the timezone to be specified.
>
> The proposal would be to eliminate the timezone field and bring the Rust
> Arrow timestamp type in line with that of the other language
> implementations, also simplifying implementation.   It seems this is in
> line with direction of other projects (Parquet, Spark, and most DBs have
> timestamp types which do not have explicit timezones or are implicitly UTC).
>
> Please feel free to see
> https://github.com/apache/arrow-datafusion/issues/686 <
> https://github.com/apache/arrow-datafusion/issues/686>
> (Or would it be better to discuss here in mailing list?)
>
> Cheers!
> Evan