You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/04/16 14:40:00 UTC
[jira] [Commented] (ARROW-8482) [Python][R][Parquet] Possible time
zone handling inconsistencies
[ https://issues.apache.org/jira/browse/ARROW-8482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084924#comment-17084924 ]
Wes McKinney commented on ARROW-8482:
-------------------------------------
All the data is normalized to UTC basis so I don't believe the timestamp values themselves are being altered.
In Python I have
{code}
In [6]: t
Out[6]:
pyarrow.Table
string_time_utc: timestamp[us]
timestamp_est: timestamp[us]
In [7]: t[0]
Out[7]:
<pyarrow.lib.ChunkedArray object at 0x7f9cefbe3590>
[
[
2018-02-01 14:00:00.531000,
2018-02-01 14:01:00.456000,
2018-03-05 14:01:02.200000
]
]
In [8]: t[1]
Out[8]:
<pyarrow.lib.ChunkedArray object at 0x7f9cef8e80b0>
[
[
2018-02-01 09:00:00.531000,
2018-02-01 09:01:00.456000,
2018-03-05 09:01:02.200000
]
]
{code}
In R now I have:
{code}
> t <- arrow::read_parquet('test.parquet', as_data_frame=FALSE)
> t
Table
3 rows x 2 columns
$string_time_utc <timestamp[us]>
$timestamp_est <timestamp[us]>
See $metadata for additional Schema metadata
> t$
t$string_time_utc t$timestamp_est
> t$column(0)
ChunkedArray
<timestamp[us]>
[
2018-02-01 14:00:00.531000,
2018-02-01 14:01:00.456000,
2018-03-05 14:01:02.200000
]
> t$column(1)
ChunkedArray
<timestamp[us]>
[
2018-02-01 09:00:00.531000,
2018-02-01 09:01:00.456000,
2018-03-05 09:01:02.200000
]
> t$column(0)$as_vector()
[1] "2018-02-01 08:00:00 CST" "2018-02-01 08:01:00 CST"
[3] "2018-03-05 08:01:02 CST"
> t$column(1)$as_vector()
[1] "2018-02-01 03:00:00 CST" "2018-02-01 03:01:00 CST"
[3] "2018-03-05 03:01:02 CST"
{code}
This is a locale issue. R apparently treats naive timestamps as localtime. If you want UTC interpretation in R you need to store tz-aware UTC timestamps
{code}
> t$column(1)$cast(arrow::timestamp("ms", "UTC"))$as_vector()
[1] "2018-02-01 09:00:00 UTC" "2018-02-01 09:01:00 UTC"
[3] "2018-03-05 09:01:02 UTC"
> t$column(0)$cast(arrow::timestamp("ms", "UTC"))$as_vector()
[1] "2018-02-01 14:00:00 UTC" "2018-02-01 14:01:00 UTC"
[3] "2018-03-05 14:01:02 UTC"
{code}
[~npr] is my interpretation correct?
> [Python][R][Parquet] Possible time zone handling inconsistencies
> -----------------------------------------------------------------
>
> Key: ARROW-8482
> URL: https://issues.apache.org/jira/browse/ARROW-8482
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python, R
> Reporter: Olaf
> Assignee: Wes McKinney
> Priority: Critical
>
> Hello there!
>
> First of all, thanks for making parquet files a reality in *R* and *Python*. This is really great.
> I found a very nasty bug when exchanging parquet files between the two platforms. Consider this.
>
>
> {code:java}
> import pandas as pd
> import pyarrow.parquet as pq
> import numpy as np
> df = pd.DataFrame({'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'),
> pd.to_datetime('2018-02-01 14:01:00.456'),
> pd.to_datetime('2018-03-05 14:01:02.200')]})
> df['timestamp_est'] = pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
> Out[5]:
> string_time_utc timestamp_est
> 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
> 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
> 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
>
> Now I simply write to disk
>
> {code:java}
> df.to_parquet('myparquet.pq')
> {code}
>
> And the use *R* to load it.
>
> {code:java}
> test <- read_parquet('myparquet.pq')
> > test
> # A tibble: 3 x 2
> string_time_utc timestamp_est
> <dttm> <dttm>
> 1 2018-02-01 09:00:00.530999 2018-02-01 04:00:00.530999
> 2 2018-02-01 09:01:00.456000 2018-02-01 04:01:00.456000
> 3 2018-03-05 09:01:02.200000 2018-03-05 04:01:02.200000
> {code}
>
>
> As you can see, the timestamps have been converted in the process. I first referenced this bug in feather but I still it is still there. This is a very dangerous, silent bug.
>
> What do you think?
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)