You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Florian Jetter (JIRA)" <ji...@apache.org> on 2019/07/08 15:18:00 UTC

[jira] [Created] (ARROW-5878) [Python][C++] Parquet reader not forward compatible for timestamps without timezone

Florian Jetter created ARROW-5878:
-------------------------------------

             Summary: [Python][C++] Parquet reader not forward compatible for timestamps without timezone
                 Key: ARROW-5878
                 URL: https://issues.apache.org/jira/browse/ARROW-5878
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 0.14.0
            Reporter: Florian Jetter
         Attachments: timezones_pyarrow_14.paquet

Timestamps without timezone which are written by pyarrow 0.14.0 cannot be read anymore as timestamps by earlier versions. The timestamp is read as an integer when reading in with pyarrow 0.13.0

Looking at the parquet schemas, it seems that the logical type cannot be understood by the older versions, see below.
h4. File generation with pyarrow 0.14.0
{code:java}
import datetime
import pyarrow.parquet as pq
import pandas as pd

df = pd.DataFrame(
    {
        "datetime64": pd.Series(["2018-01-01"], dtype="datetime64[ns]"),
        "datetime64_ts": pd.Series(
            [pd.Timestamp(datetime.datetime(2018, 1, 1), tz="Europe/Berlin")],
            dtype="datetime64[ns]",
        ),
    }
)
pq.write_table(pa.Table.from_pandas(df), "timezones_pyarrow_14.paquet")
{code}
h4. Reading with pyarrow 0.13.0
{code:java}
In [1]: import pyarrow.parquet as pq

In [2]: import pyarrow as pa

In [3]: with open("timezones_pyarrow_14.paquet", "rb") as fd:
   ...:     table = pq.read_pandas(fd)
   ...:

In [4]: table.to_pandas()
Out[4]:
         datetime64             datetime64_ts
0  1514764800000000 2018-01-01 00:00:00+01:00

In [5]: table.to_pandas().dtypes
Out[5]:
datetime64                               int64
datetime64_ts    datetime64[ns, Europe/Berlin]
dtype: object
{code}
h3. Parquet schema as seen by pyarrow versions:

pyarrow 0.13.0 parquet schema
{code:java}
datetime64: INT64
datetime64_ts: INT64 TIMESTAMP_MICROS
{code}
pyarrow 0.14.0 parquet schema
{code:java}
datetime64: INT64 Timestamp(isAdjustedToUTC=false, timeUnit=microseconds)
datetime64_ts: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)