You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/01/07 17:29:03 UTC
[jira] [Commented] (ARROW-1958) [Python] Error in pandas conversion for datetimetz row index

    [ https://issues.apache.org/jira/browse/ARROW-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315396#comment-16315396 ] 

ASF GitHub Bot commented on ARROW-1958:
---------------------------------------

wesm closed pull request #1454: ARROW-1958: [Python] Error in pandas conversion for datetimetz row index
URL: https://github.com/apache/arrow/pull/1454
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index f08049f6a..f3089d2a0 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -459,6 +459,8 @@ def _make_datetimetz(tz):
 
 def table_to_blockmanager(options, table, memory_pool, nthreads=1,
                           categoricals=None):
+    from pyarrow.compat import DatetimeTZDtype
+
     index_columns = []
     columns = []
     column_indexes = []
@@ -517,7 +519,12 @@ def table_to_blockmanager(options, table, memory_pool, nthreads=1,
                 # non-writeable arrays when calling MultiIndex.from_arrays
                 values = values.copy()
 
-            index_arrays.append(pd.Series(values, dtype=col_pandas.dtype))
+            if isinstance(col_pandas.dtype, DatetimeTZDtype):
+                index_array = (pd.Series(values).dt.tz_localize('utc')
+                               .dt.tz_convert(col_pandas.dtype.tz))
+            else:
+                index_array = pd.Series(values, dtype=col_pandas.dtype)
+            index_arrays.append(index_array)
             index_names.append(
                 _backwards_compatible_index_name(raw_name, logical_name)
             )
diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py
index 7609d3488..ef2909d61 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -254,6 +254,16 @@ def test_datetimetz_column_index(self):
         md = column_indexes['metadata']
         assert md['timezone'] == 'America/New_York'
 
+    def test_datetimetz_row_index(self):
+        df = pd.DataFrame({
+            'a': pd.date_range(
+                start='2017-01-01', periods=3, tz='America/New_York'
+            )
+        })
+        df = df.set_index('a')
+
+        _check_pandas_roundtrip(df, preserve_index=True)
+
     def test_categorical_row_index(self):
         df = pd.DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3]})
         df['a'] = df.a.astype('category')


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [Python] Error in pandas conversion for datetimetz row index
> ------------------------------------------------------------
>
>                 Key: ARROW-1958
>                 URL: https://issues.apache.org/jira/browse/ARROW-1958
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>         Environment: Ubuntu 16.04
>            Reporter: Albert Shieh
>            Assignee: Albert Shieh
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> The pandas conversion of a datetimetz row index in a Table fails with non-UTC time zones because the values are stored as datetime64\[ns\] and interpreted as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and converted to datetime64\[ns, tz\]. There's correct handling for time zones for columns in Column.to_pandas, but not for the row index in table_to_blockmanager.
> This is a minimal example demonstrating the failure of a roundtrip between a DataFrame and a Table:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({
>     'a': pd.date_range(
>         start='2017-01-01', periods=3, tz='America/New_York'
>     )
> })
> df = df.set_index('a')
> df_pa = pa.Table.from_pandas(df).to_pandas()
> print(df)
> print(df_pa)
> {code}
> The output is:
> {noformat}
> Empty DataFrame
> Columns: []
> Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 00:00:00-05:00]
> Empty DataFrame
> Columns: []
> Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 05:00:00-05:00]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)