You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/07/08 16:27:00 UTC
[jira] [Commented] (ARROW-5878) [Python][C++] Parquet reader not forward compatible for timestamps without timezone

    [ https://issues.apache.org/jira/browse/ARROW-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880520#comment-16880520 ] 

Wes McKinney commented on ARROW-5878:
-------------------------------------

This is sort of a grey area because of comments in parquet.thrift

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L334

I am OK with always setting TIMESTAMP_MICROS/TIMESTAMP_MILLIS ConvertedType for data that originates from Arrow.

Do you want to submit a PR? We are probably doing a 0.14.1 release so this can get fixed fairly soon

> [Python][C++] Parquet reader not forward compatible for timestamps without timezone
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-5878
>                 URL: https://issues.apache.org/jira/browse/ARROW-5878
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 0.14.0
>            Reporter: Florian Jetter
>            Priority: Major
>             Fix For: 1.0.0
>
>         Attachments: timezones_pyarrow_14.paquet
>
>
> Timestamps without timezone which are written by pyarrow 0.14.0 cannot be read anymore as timestamps by earlier versions. The timestamp is read as an integer when reading in with pyarrow 0.13.0
> Looking at the parquet schemas, it seems that the logical type cannot be understood by the older versions, see below.
> h4. File generation with pyarrow 0.14.0
> {code:java}
> import datetime
> import pyarrow.parquet as pq
> import pandas as pd
> df = pd.DataFrame(
>     {
>         "datetime64": pd.Series(["2018-01-01"], dtype="datetime64[ns]"),
>         "datetime64_ts": pd.Series(
>             [pd.Timestamp(datetime.datetime(2018, 1, 1), tz="Europe/Berlin")],
>             dtype="datetime64[ns]",
>         ),
>     }
> )
> pq.write_table(pa.Table.from_pandas(df), "timezones_pyarrow_14.paquet")
> {code}
> h4. Reading with pyarrow 0.13.0
> {code:java}
> In [1]: import pyarrow.parquet as pq
> In [2]: import pyarrow as pa
> In [3]: with open("timezones_pyarrow_14.paquet", "rb") as fd:
>    ...:     table = pq.read_pandas(fd)
>    ...:
> In [4]: table.to_pandas()
> Out[4]:
>          datetime64             datetime64_ts
> 0  1514764800000000 2018-01-01 00:00:00+01:00
> In [5]: table.to_pandas().dtypes
> Out[5]:
> datetime64                               int64
> datetime64_ts    datetime64[ns, Europe/Berlin]
> dtype: object
> {code}
> h3. Parquet schema as seen by pyarrow versions:
> pyarrow 0.13.0 parquet schema
> {code:java}
> datetime64: INT64
> datetime64_ts: INT64 TIMESTAMP_MICROS
> {code}
> pyarrow 0.14.0 parquet schema
> {code:java}
> datetime64: INT64 Timestamp(isAdjustedToUTC=false, timeUnit=microseconds)
> datetime64_ts: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)