You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/27 08:16:00 UTC
[jira] [Commented] (ARROW-11388) Dataset Timezone Handling

    [ https://issues.apache.org/jira/browse/ARROW-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272656#comment-17272656 ] 

Joris Van den Bossche commented on ARROW-11388:
-----------------------------------------------

[~andydoug] thanks for the report, there are a few different issues you are bumping into here:

1. The fact that {{Dataset.to_table()}} raises an error when you specify the {{schema}} manually and it doesn't match exactly with the file's schema is a known limitation right now (_"fields had matching names but differing types. From: timestamp: timestamp[us, tz=UTC] To: timestamp: timestamp[ns, tz=US/Eastern]"_). Right now types need to match exactly, but we need to relax this constraint. We hope to fix this for the next version, and this is generally covered by ARROW-11003

2. Normally, when writing a pyarrow Table with a timezone to parquet and reading it back in, we should be able to preserve the timezone. Parquet itself doesn't support it (we can only store that it is "timezone-aware" (in UTC), that's the reason it still comes back as UTC), but we store the timezone in additional metadata stored in the parquet file. For non-nanosecond resolutions this actually works, but so not if the data originally is in nanosecond resolution. This is covered by ARROW-9634 (and the reason you originally have nanosecond data, is because your data comes from pandas)

Note that if you don't specify the schema, the timezone will still be restored after conversion to pandas (because we also store the timezone in the pandas metadata):

{code}
In [61]: dataset = ds.dataset(test_dir, format="parquet")

In [62]: dataset.to_table().to_pandas().index.dtype
Out[62]: datetime64[ns, US/Eastern]
{code}

> Dataset Timezone Handling
> -------------------------
>
>                 Key: ARROW-11388
>                 URL: https://issues.apache.org/jira/browse/ARROW-11388
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0, 3.0.0
>            Reporter: Andy Douglas
>            Priority: Minor
>
> I'm trying to write a pandas dataframe with a datetimeindex with timezone information to a pyarrow dataset but the timezone information doesn't seem to be written (apart from in the pandas metadata)
>  
> For example
>  
> {code:java}
> import os
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> from pathlib import Path
> # I've tried with both v2.0 and v3.0 today
> print(pa.__version__)
> # create dummy dataframe with datetime index containing tz info
> df = pd.DataFrame(
>     dict(
>         timestamp=pd.date_range("2021-01-01", freq="1T", periods=100, tz="US/Eastern"),
>         x=np.arange(100),
>      )
> ).set_index("timestamp")
> test_dir = Path("test_dir")
> table = pa.Table.from_pandas(df)
> schema = table.schema
> print(schema)
> print(schema.pandas_metadata)
> # warning - creates dir in cwd
> pq.write_to_dataset(table, test_dir)
> # timestamp column is us and UTC
> print(pq.ParquetFile(test_dir / os.listdir(test_dir)[0]).read())
> # create dataset using schema from earlier
> dataset = ds.dataset(test_dir, format="parquet", schema=schema)
> # doesn't work
> dataset.to_table()
> {code}
>  
>  
> Is this a bug or am I missing something?
> Thanks
> Andy
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)