You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2020/09/23 14:43:00 UTC

[jira] [Resolved] (ARROW-4965) [Python] Timestamp array type detection should use tzname of datetime.datetime objects

     [ https://issues.apache.org/jira/browse/ARROW-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Krisztian Szucs resolved ARROW-4965.
------------------------------------
    Resolution: Fixed

> [Python] Timestamp array type detection should use tzname of datetime.datetime objects
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-4965
>                 URL: https://issues.apache.org/jira/browse/ARROW-4965
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>         Environment: $ python --version
> Python 3.7.2
> $ pip freeze
> numpy==1.16.2
> pyarrow==0.12.1
> pytz==2018.9
> six==1.12.0
> $ sw_vers
> ProductName:    Mac OS X
> ProductVersion: 10.14.3
> BuildVersion:   18D109
> (pyarrow) 
>            Reporter: Tim Swast
>            Assignee: Krisztian Szucs
>            Priority: Major
>             Fix For: 2.0.0
>
>
> The type detection from datetime objects to array appears to ignore the presence of a tzinfo on the datetime object, instead storing them as naive timestamp columns.
> Python code:
> {code:python}
> import datetime
> import pytz
> import pyarrow as pa
> naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10)
> utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc)
> tzaware_datetime = utc_datetime.astimezone(pytz.timezone('America/Los_Angeles'))
> def inspect(varname):
>     print(varname)
>     arr = globals()[varname]
>     print(arr.type)
>     print(arr)
>     print()
> auto_naive_arr = pa.array([naive_datetime])
> inspect("auto_naive_arr")
> auto_utc_arr = pa.array([utc_datetime])
> inspect("auto_utc_arr")
> auto_tzaware_arr = pa.array([tzaware_datetime])
> inspect("auto_tzaware_arr")
> auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime])
> inspect("auto_mixed_arr")
> naive_type = pa.timestamp("us", naive_datetime.tzname())
> utc_type = pa.timestamp("us", utc_datetime.tzname())
> tzaware_type = pa.timestamp("us", tzaware_datetime.tzname())
> naive_arr = pa.array([naive_datetime], type=naive_type)
> inspect("naive_arr")
> utc_arr = pa.array([utc_datetime], type=utc_type)
> inspect("utc_arr")
> tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type)
> inspect("tzaware_arr")
> mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type)
> inspect("mixed_arr")
> {code}
> This prints:
> {noformat}
> $ python detect_timezone.py
> auto_naive_arr
> timestamp[us]
> [
>   1547381470000000
> ]
> auto_utc_arr
> timestamp[us]
> [
>   1547381470000000
> ]
> auto_tzaware_arr
> timestamp[us]
> [
>   1547352670000000
> ]
> auto_mixed_arr
> timestamp[us]
> [
>   1547381470000000,
>   1547352670000000
> ]
> naive_arr
> timestamp[us]
> [
>   1547381470000000
> ]
> utc_arr
> timestamp[us, tz=UTC]
> [
>   1547381470000000
> ]
> tzaware_arr
> timestamp[us, tz=PST]
> [
>   1547352670000000
> ]
> mixed_arr
> timestamp[us, tz=UTC]
> [
>   1547381470000000,
>   1547352670000000
> ]
> {noformat}
> But I would expect the following types instead:
> * {{naive_datetime}}: {{timestamp[us]}}
> * {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}}
> * {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe {{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as the {{tzname}})
> * {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}}
> Also, in the "mixed" case, I'd expect the actual stored microseconds to be the same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both refer to the same point in time. It seems reasonable for any naive datetime objects mixed in with tz-aware datetimes to be interpreted as UTC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)