You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Joris Van den Bossche <jo...@gmail.com> on 2021/06/10 15:33:23 UTC

[Discuss] Handling timezones in (C++) compute kernels for timestamp data

Hi all,

There was recently a discussion on the interpretation of the spec about the
"timezone" field of timestamp type (and different timestamp-related types
that Arrow should have). See
https://lists.apache.org/thread.html/r017084eed74edbc95810fc049056570f45b0bb034d6eeadd647e8621%40%3Cdev.arrow.apache.org%3E
Somewhat related, I want to start a discussion to what extent we want to
implement functionality (compute kernels) in Arrow C++ to deal with
timezones.

We just merged a PR to add some kernels to extract fields from timestamps
(year, month, day, hour, etc -> ARROW-11759
<https://github.com/apache/arrow/pull/10176>). But once you start with
kernels for timestamp data, you quickly run into the question: what to do
with tz-aware timestamps with a timezone?

For example, we have:
- ARROW-12980 <https://issues.apache.org/jira/browse/ARROW-12980> about
making those kernels to extract timestamp fields timezone aware. For
example, if you have tz-aware timestamp with hour "09:30:00+02:00", this is
stored internally as "07:30:00 UTC" (+ the actual timezone as metadata of
the type). And for a kernel to extract the "hour" field, you want that to
return 9 and not 7 (which would happen if we use the internal UTC value
ignoring the timezone information).
- ARROW-13033 <https://issues.apache.org/jira/browse/ARROW-13033> (which I
opened today) about adding functionality to convert a tz-naive "local time"
(local "clock" time in a not-yet-specified time zone) to a properly
timezone-aware timestamp with the user-specified time zone attached. This
can be useful to handle data that does not have sufficient timezone
information attached to the data/type itself, but for which you know what
the timezone should be. For example, having a timestamp with hour
"09:30:00" (no explicit timezone, implicitly UTC), but the user knows this
is actually "09:30:00 CEST", so then you want to convert this to the UTC
time ("07:30:00Z") that is equivalent to "09:30:00 CEST".

Both such kernels require a conversion between "UTC time" and tz-naive
"local time" (C++ local_t <https://en.cppreference.com/w/cpp/chrono/local_t>),
which requires looking up the offset for the given timezone at that time
point (the first example requires conversion from UTC to local time, the
second from local time to UTC time).

Personally, I think such kernels that can handle timezones are important
(if we want that users store tz-aware data in Arrow), but I want to ensure
we are generally OK with expanding the scope of Arrow to actually start
doing something with the tz information of the timestamp type (up to now we
just store that value in the type but not yet ever interpret it). Which
means dealing with timezone offsets, timezone databases etc. But luckily,
the date.h (https://github.com/HowardHinnant/date) we vendor already
includes all the required functionality.

Best,
Joris

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

Posted by Rok Mihevc <ro...@gmail.com>.
As there was a lot of discussion around timestamp localization I'd
like to point out there is an open PR for it now [1].

[1] https://github.com/apache/arrow/pull/10610

Rok

On Thu, Jun 10, 2021 at 11:11 PM Wes McKinney <we...@gmail.com> wrote:
>
> I agree that we need to implement the equivalent of pandas's
> "tz_localize" method which performs UTC normalization on tz-naive data
> and sets the timezone field. Here's a demo of this functionality (I
> originally implemented this years ago by porting pytz's logic to run
> against NumPy arrays in Cython):
>
> https://gist.github.com/wesm/0e02567c0c4bab768bc0ecabc2fcb6a8
>
> On Thu, Jun 10, 2021 at 3:04 PM Joris Van den Bossche
> <jo...@gmail.com> wrote:
> >
> > On Thu, 10 Jun 2021 at 18:06, Antoine Pitrou <an...@python.org> wrote:
> > >
> > > On Thu, 10 Jun 2021 17:33:23 +0200
> > > Joris Van den Bossche <jo...@gmail.com> wrote:
> > > >
> > > > We just merged a PR to add some kernels to extract fields from timestamps
> > > > (year, month, day, hour, etc -> ARROW-11759
> > > > <https://github.com/apache/arrow/pull/10176>). But once you start with
> > > > kernels for timestamp data, you quickly run into the question: what to do
> > > > with tz-aware timestamps with a timezone?
> > > >
> > > > For example, we have:
> > > > - ARROW-12980 <https://issues.apache.org/jira/browse/ARROW-12980> about
> > > > making those kernels to extract timestamp fields timezone aware. For
> > > > example, if you have tz-aware timestamp with hour "09:30:00+02:00", this is
> > > > stored internally as "07:30:00 UTC" (+ the actual timezone as metadata of
> > > > the type). And for a kernel to extract the "hour" field, you want that to
> > > > return 9 and not 7 (which would happen if we use the internal UTC value
> > > > ignoring the timezone information).
> > > > - ARROW-13033 <https://issues.apache.org/jira/browse/ARROW-13033> (which I
> > > > opened today) about adding functionality to convert a tz-naive "local time"
> > > > (local "clock" time in a not-yet-specified time zone) to a properly
> > > > timezone-aware timestamp with the user-specified time zone attached. This
> > > > can be useful to handle data that does not have sufficient timezone
> > > > information attached to the data/type itself, but for which you know what
> > > > the timezone should be. For example, having a timestamp with hour
> > > > "09:30:00" (no explicit timezone, implicitly UTC), but the user knows this
> > > > is actually "09:30:00 CEST", so then you want to convert this to the UTC
> > > > time ("07:30:00Z") that is equivalent to "09:30:00 CEST".
> > >
> > > I don't think it's helpful to discuss those two use cases together.
> > > The first case is talking about the semantics of a kernel on valid
> > > timestamp data.
> > > The second case is talking about invalid timestamp data (with values
> > > expressed in a non-UTC timezone).
> > >
> >
> > What both cases have in common is that they need to look up timezone
> > offsets to do a conversion and thus require access to a timezone
> > database (and requiring us to deal with things like Windows not having
> > a system tz database available). That was the main aspect I wanted to
> > ensure we are OK with in general ("dealing with timezones"), and less
> > the specifics of the two examples I gave.
> >
> > If that general issue doesn't turn out to be such a discussion point,
> > I think that would be a good start. And then indeed each case where we
> > might want to add timezone handling can be discussed separately (since
> > adding it to a second or third etc kernel is much less of an issue
> > than *starting* to do timezone handling).
> >
> > Joris
> >
> > > Regards
> > >
> > > Antoine.
> > >
> > >

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

Posted by Wes McKinney <we...@gmail.com>.
I agree that we need to implement the equivalent of pandas's
"tz_localize" method which performs UTC normalization on tz-naive data
and sets the timezone field. Here's a demo of this functionality (I
originally implemented this years ago by porting pytz's logic to run
against NumPy arrays in Cython):

https://gist.github.com/wesm/0e02567c0c4bab768bc0ecabc2fcb6a8

On Thu, Jun 10, 2021 at 3:04 PM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> On Thu, 10 Jun 2021 at 18:06, Antoine Pitrou <an...@python.org> wrote:
> >
> > On Thu, 10 Jun 2021 17:33:23 +0200
> > Joris Van den Bossche <jo...@gmail.com> wrote:
> > >
> > > We just merged a PR to add some kernels to extract fields from timestamps
> > > (year, month, day, hour, etc -> ARROW-11759
> > > <https://github.com/apache/arrow/pull/10176>). But once you start with
> > > kernels for timestamp data, you quickly run into the question: what to do
> > > with tz-aware timestamps with a timezone?
> > >
> > > For example, we have:
> > > - ARROW-12980 <https://issues.apache.org/jira/browse/ARROW-12980> about
> > > making those kernels to extract timestamp fields timezone aware. For
> > > example, if you have tz-aware timestamp with hour "09:30:00+02:00", this is
> > > stored internally as "07:30:00 UTC" (+ the actual timezone as metadata of
> > > the type). And for a kernel to extract the "hour" field, you want that to
> > > return 9 and not 7 (which would happen if we use the internal UTC value
> > > ignoring the timezone information).
> > > - ARROW-13033 <https://issues.apache.org/jira/browse/ARROW-13033> (which I
> > > opened today) about adding functionality to convert a tz-naive "local time"
> > > (local "clock" time in a not-yet-specified time zone) to a properly
> > > timezone-aware timestamp with the user-specified time zone attached. This
> > > can be useful to handle data that does not have sufficient timezone
> > > information attached to the data/type itself, but for which you know what
> > > the timezone should be. For example, having a timestamp with hour
> > > "09:30:00" (no explicit timezone, implicitly UTC), but the user knows this
> > > is actually "09:30:00 CEST", so then you want to convert this to the UTC
> > > time ("07:30:00Z") that is equivalent to "09:30:00 CEST".
> >
> > I don't think it's helpful to discuss those two use cases together.
> > The first case is talking about the semantics of a kernel on valid
> > timestamp data.
> > The second case is talking about invalid timestamp data (with values
> > expressed in a non-UTC timezone).
> >
>
> What both cases have in common is that they need to look up timezone
> offsets to do a conversion and thus require access to a timezone
> database (and requiring us to deal with things like Windows not having
> a system tz database available). That was the main aspect I wanted to
> ensure we are OK with in general ("dealing with timezones"), and less
> the specifics of the two examples I gave.
>
> If that general issue doesn't turn out to be such a discussion point,
> I think that would be a good start. And then indeed each case where we
> might want to add timezone handling can be discussed separately (since
> adding it to a second or third etc kernel is much less of an issue
> than *starting* to do timezone handling).
>
> Joris
>
> > Regards
> >
> > Antoine.
> >
> >

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Thu, 10 Jun 2021 at 18:06, Antoine Pitrou <an...@python.org> wrote:
>
> On Thu, 10 Jun 2021 17:33:23 +0200
> Joris Van den Bossche <jo...@gmail.com> wrote:
> >
> > We just merged a PR to add some kernels to extract fields from timestamps
> > (year, month, day, hour, etc -> ARROW-11759
> > <https://github.com/apache/arrow/pull/10176>). But once you start with
> > kernels for timestamp data, you quickly run into the question: what to do
> > with tz-aware timestamps with a timezone?
> >
> > For example, we have:
> > - ARROW-12980 <https://issues.apache.org/jira/browse/ARROW-12980> about
> > making those kernels to extract timestamp fields timezone aware. For
> > example, if you have tz-aware timestamp with hour "09:30:00+02:00", this is
> > stored internally as "07:30:00 UTC" (+ the actual timezone as metadata of
> > the type). And for a kernel to extract the "hour" field, you want that to
> > return 9 and not 7 (which would happen if we use the internal UTC value
> > ignoring the timezone information).
> > - ARROW-13033 <https://issues.apache.org/jira/browse/ARROW-13033> (which I
> > opened today) about adding functionality to convert a tz-naive "local time"
> > (local "clock" time in a not-yet-specified time zone) to a properly
> > timezone-aware timestamp with the user-specified time zone attached. This
> > can be useful to handle data that does not have sufficient timezone
> > information attached to the data/type itself, but for which you know what
> > the timezone should be. For example, having a timestamp with hour
> > "09:30:00" (no explicit timezone, implicitly UTC), but the user knows this
> > is actually "09:30:00 CEST", so then you want to convert this to the UTC
> > time ("07:30:00Z") that is equivalent to "09:30:00 CEST".
>
> I don't think it's helpful to discuss those two use cases together.
> The first case is talking about the semantics of a kernel on valid
> timestamp data.
> The second case is talking about invalid timestamp data (with values
> expressed in a non-UTC timezone).
>

What both cases have in common is that they need to look up timezone
offsets to do a conversion and thus require access to a timezone
database (and requiring us to deal with things like Windows not having
a system tz database available). That was the main aspect I wanted to
ensure we are OK with in general ("dealing with timezones"), and less
the specifics of the two examples I gave.

If that general issue doesn't turn out to be such a discussion point,
I think that would be a good start. And then indeed each case where we
might want to add timezone handling can be discussed separately (since
adding it to a second or third etc kernel is much less of an issue
than *starting* to do timezone handling).

Joris

> Regards
>
> Antoine.
>
>

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

Posted by Antoine Pitrou <an...@python.org>.
On Thu, 10 Jun 2021 17:33:23 +0200
Joris Van den Bossche <jo...@gmail.com> wrote:
> 
> We just merged a PR to add some kernels to extract fields from timestamps
> (year, month, day, hour, etc -> ARROW-11759
> <https://github.com/apache/arrow/pull/10176>). But once you start with
> kernels for timestamp data, you quickly run into the question: what to do
> with tz-aware timestamps with a timezone?
> 
> For example, we have:
> - ARROW-12980 <https://issues.apache.org/jira/browse/ARROW-12980> about
> making those kernels to extract timestamp fields timezone aware. For
> example, if you have tz-aware timestamp with hour "09:30:00+02:00", this is
> stored internally as "07:30:00 UTC" (+ the actual timezone as metadata of
> the type). And for a kernel to extract the "hour" field, you want that to
> return 9 and not 7 (which would happen if we use the internal UTC value
> ignoring the timezone information).
> - ARROW-13033 <https://issues.apache.org/jira/browse/ARROW-13033> (which I
> opened today) about adding functionality to convert a tz-naive "local time"
> (local "clock" time in a not-yet-specified time zone) to a properly
> timezone-aware timestamp with the user-specified time zone attached. This
> can be useful to handle data that does not have sufficient timezone
> information attached to the data/type itself, but for which you know what
> the timezone should be. For example, having a timestamp with hour
> "09:30:00" (no explicit timezone, implicitly UTC), but the user knows this
> is actually "09:30:00 CEST", so then you want to convert this to the UTC
> time ("07:30:00Z") that is equivalent to "09:30:00 CEST".

I don't think it's helpful to discuss those two use cases together.
The first case is talking about the semantics of a kernel on valid
timestamp data.
The second case is talking about invalid timestamp data (with values
expressed in a non-UTC timezone).

Regards

Antoine.