You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/10 13:50:00 UTC

[jira] [Commented] (ARROW-5912) [Python] conversion from datetime objects with mixed timezones should normalize to UTC

    [ https://issues.apache.org/jira/browse/ARROW-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193630#comment-17193630 ] 

Joris Van den Bossche commented on ARROW-5912:
----------------------------------------------

In the meantime, this now results in a tz-aware pyarrow array. It only takes the first encountered timezone. I think ideally in case of multiple timezones, it would use UTC instead, but at least the result is already more correct now (the actual values stored under the hood are correctly normalized to UTC).

> [Python] conversion from datetime objects with mixed timezones should normalize to UTC
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-5912
>                 URL: https://issues.apache.org/jira/browse/ARROW-5912
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: beginner
>
> Currently, when having objects with mixed timezones, they are each separately interpreted as their local time:
> {code:python}
> >>> ts_pd_paris = pd.Timestamp("1970-01-01 01:00", tz="Europe/Paris")
> >>> ts_pd_paris    
> Timestamp('1970-01-01 01:00:00+0100', tz='Europe/Paris')
> >>> ts_pd_helsinki = pd.Timestamp("1970-01-01 02:00", tz="Europe/Helsinki")
> >>> ts_pd_helsinki
> Timestamp('1970-01-01 02:00:00+0200', tz='Europe/Helsinki')
> >>> a = pa.array([ts_pd_paris, ts_pd_helsinki])                                                                                                              
> >>> a
> <pyarrow.lib.TimestampArray object at 0x7f7856c4a360>
> [
>   1970-01-01 01:00:00.000000,
>   1970-01-01 02:00:00.000000
> ]
> >>> a.type
> TimestampType(timestamp[us])
> {code}
> So both times are actually about the same moment in time (the same value in UTC; in pandas their stored {{value}} is also the same), but once converted to pyarrow, they are both tz-naive but no longer the same time. That seems rather unexpected and a source for bugs.
> I think a better option would be to normalize to UTC, and result in a tz-aware TimestampArray with UTC as timezone. 
> That is also the behaviour of pandas if you force the conversion to result in datetimes (by default pandas will keep them as object array preserving the different timezones).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)