You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/10/22 18:02:00 UTC

[jira] [Updated] (ARROW-14442) [R] Should we warn when converting timestamps with "" as tzone?

     [ https://issues.apache.org/jira/browse/ARROW-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Keane updated ARROW-14442:
-----------------------------------
    Description: 
{{POSIXct}}s in R can have timezones specified as {{""}} which is typically interpreted as the session local timezone. 

This can lead to surprising results like:

{code:r}
> Sys.timezone()
[1] "America/Chicago"
> as.integer(as.POSIXct("1970-01-01"))
[1] 21600
> Sys.setenv(TZ = "UTC")
> as.integer(as.POSIXct("1970-01-01"))
[1] 0
> Sys.setenv(TZ = "Australia/Brisbane")
> as.integer(as.POSIXct("1970-01-01"))
[1] -36000
{code}

See also: https://stackoverflow.com/questions/69670142/how-can-i-store-timezone-agnostic-dates-for-sharing-between-r-and-python-using-p/69678923#69678923 

This runs counter to what timestamps without timezones are interpreted as in Arrow: https://github.com/apache/arrow/blob/03669438bbce53078616c7f943a63fb0c11db196/format/Schema.fbs#L333-L336

> However, it may also be encoded into a Timestamp column with an empty timezone. The timestamp values should be computed "as if" the timezone of the date-time values was UTC; for example, the naive date-time "January 1st 1970, 00h00" would be encoded as timestamp value 0.

Critically in R, when {{as.POSIXct("1970-01-01 00:00:00")}} is run, the timestamp value is computed "as if" the timezone of the date-time values was the local timezone (and *not* UTC like the Arrow spec says).

This can lead to some surprising results when converting these timezoneless timestamps from R to Arrow. Using {{as.POSIXct("1970-01-01 00:00:00")}} as an example, and presume US Central time.  We have a few options:

* Warn when the timezone is "" or not set that the behavior might be surprising
  We store whatever integer R passes to us (21600), with no timezone set. When someone sees this formatted, the times/dates will be what the time was at UTC ("1970-01-01 06:00:00")
* Set the timezone to UTC without changing the integer value of the timestamp.   We store whatever integer R passes to us (21600), with UTC as the timezone set. When someone sees this formatted, the times/dates will be in UTC ("1970-01-01 06:00:00 UTC") This might be surprising / counterintuitive because the timestamps will suddenly be different and will be based in UTC and not local time like people are expecting.
* Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
This is surprising because we are asserting the local timezone when that is not specified in R.

If someone is using a timestamp without tzone in R to represent a timezoneless timestamp, options 2 and 3 above violate that when it is put into Arrow. Whereas, if someone is using a timestamp that just so happens to be without a tzone but they assume it's in local time, option 1 leads to (very) surprising results

  was:
{{POSIXct}}s in R can have timezones specified as {{""}} which is typically interpreted as the session local timezone. 

This can lead to surprising results like:

{code:r}
> Sys.timezone()
[1] "America/Chicago"
> as.integer(as.POSIXct("1970-01-01"))
[1] 21600
> Sys.setenv(TZ = "UTC")
> as.integer(as.POSIXct("1970-01-01"))
[1] 0
> Sys.setenv(TZ = "Australia/Brisbane")
> as.integer(as.POSIXct("1970-01-01"))
[1] -36000
{code}

See also: https://stackoverflow.com/questions/69670142/how-can-i-store-timezone-agnostic-dates-for-sharing-between-r-and-python-using-p/69678923#69678923 

This runs counter to what timestamps without timezones are interpreted as in Arrow: https://github.com/apache/arrow/blob/03669438bbce53078616c7f943a63fb0c11db196/format/Schema.fbs#L333-L336

> However, it may also be encoded into a Timestamp column with an empty timezone. The timestamp values should be computed "as if" the timezone of the date-time values was UTC; for example, the naive date-time "January 1st 1970, 00h00" would be encoded as timestamp value 0.

Critically in R, when {{as.POSIXct("1970-01-01 00:00:00")}} is run, the timestamp value is computed "as if" the timezone of the date-time values was the local timezone (and *not* UTC like the Arrow spec says).

This can lead to some surprising results when converting these timezoneless timestamps from R to Arrow. Using {{as.POSIXct("1970-01-01 00:00:00")}} as an example, and presume US Central time.  We have a few options:

* Warn when the timezone is "" or not set that the behavior might be surprising
  We store whatever integer R passes to us (21600), with no timezone set. When someone sees this formatted, the times/dates will be what the time was at UTC ("1970-01-01 06:00:00")
* Set the timezone to UTC without changing the integer value of the timestamp.   We store whatever integer R passes to us (21600), with UTC as the timezone set. When someone sees this formatted, the times/dates will be in UTC ("1970-01-01 06:00:00 UTC") This might be surprising / counterintuitive because the timestamps will suddenly be different and will be based in UTC and not local time like people are expecting.
* Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
This is surprising because we are asserting the local timezone when that is not specified in R.


> [R] Should we warn when converting timestamps with "" as tzone?
> ---------------------------------------------------------------
>
>                 Key: ARROW-14442
>                 URL: https://issues.apache.org/jira/browse/ARROW-14442
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Jonathan Keane
>            Priority: Major
>
> {{POSIXct}}s in R can have timezones specified as {{""}} which is typically interpreted as the session local timezone. 
> This can lead to surprising results like:
> {code:r}
> > Sys.timezone()
> [1] "America/Chicago"
> > as.integer(as.POSIXct("1970-01-01"))
> [1] 21600
> > Sys.setenv(TZ = "UTC")
> > as.integer(as.POSIXct("1970-01-01"))
> [1] 0
> > Sys.setenv(TZ = "Australia/Brisbane")
> > as.integer(as.POSIXct("1970-01-01"))
> [1] -36000
> {code}
> See also: https://stackoverflow.com/questions/69670142/how-can-i-store-timezone-agnostic-dates-for-sharing-between-r-and-python-using-p/69678923#69678923 
> This runs counter to what timestamps without timezones are interpreted as in Arrow: https://github.com/apache/arrow/blob/03669438bbce53078616c7f943a63fb0c11db196/format/Schema.fbs#L333-L336
> > However, it may also be encoded into a Timestamp column with an empty timezone. The timestamp values should be computed "as if" the timezone of the date-time values was UTC; for example, the naive date-time "January 1st 1970, 00h00" would be encoded as timestamp value 0.
> Critically in R, when {{as.POSIXct("1970-01-01 00:00:00")}} is run, the timestamp value is computed "as if" the timezone of the date-time values was the local timezone (and *not* UTC like the Arrow spec says).
> This can lead to some surprising results when converting these timezoneless timestamps from R to Arrow. Using {{as.POSIXct("1970-01-01 00:00:00")}} as an example, and presume US Central time.  We have a few options:
> * Warn when the timezone is "" or not set that the behavior might be surprising
>   We store whatever integer R passes to us (21600), with no timezone set. When someone sees this formatted, the times/dates will be what the time was at UTC ("1970-01-01 06:00:00")
> * Set the timezone to UTC without changing the integer value of the timestamp.   We store whatever integer R passes to us (21600), with UTC as the timezone set. When someone sees this formatted, the times/dates will be in UTC ("1970-01-01 06:00:00 UTC") This might be surprising / counterintuitive because the timestamps will suddenly be different and will be based in UTC and not local time like people are expecting.
> * Set the timezone to local time without changing the integer value fo the timestamp. We store whatever integer R passes to us (21600), with CST as the timezone set. Display is then "1970-01-01 00:00:00 CST"
> This is surprising because we are asserting the local timezone when that is not specified in R.
> If someone is using a timestamp without tzone in R to represent a timezoneless timestamp, options 2 and 3 above violate that when it is put into Arrow. Whereas, if someone is using a timestamp that just so happens to be without a tzone but they assume it's in local time, option 1 leads to (very) surprising results



--
This message was sent by Atlassian Jira
(v8.3.4#803005)