You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "anmyachev (via GitHub)" <gi...@apache.org> on 2023/05/31 19:37:03 UTC

[GitHub] [arrow] anmyachev opened a new issue, #35855: `pyarrow.Table.to_pandas` creates Index instead of PeriodIndex

anmyachev opened a new issue, #35855:
URL: https://github.com/apache/arrow/issues/35855

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   An interesting detail is that after reading the same file through pandas, the result of reading with pyarrow becomes correct.
   
   ```python
   import pandas
   import pyarrow.parquet
   
   path = "test1111.parquet"
   print(pyarrow.parquet.read_table(path, use_pandas_metadata=True)[:2].to_pandas().index) <- Index
   print("##############################")
   print(pandas.read_parquet(path, engine="pyarrow")[:2].index)
   print("##############################")
   print(pyarrow.parquet.read_table(path, use_pandas_metadata=True)[:2].to_pandas().index) <- PeriodIndex
   ```
   
   Deps:
   ```bash
   pyarrow==12.0.0
   pandas==2.0.2
   python=3.9
   ```
   
   Output:
   ```bash
   ----For pyarrow==12.0.0:
   Index([17167, 17168], dtype='int64', name='idx_periodrange')
   ##############################
   PeriodIndex(['2017-01-01', '2017-01-02'], dtype='period[D]', name='idx_periodrange')
   ##############################
   PeriodIndex(['2017-01-01', '2017-01-02'], dtype='period[D]', name='idx_periodrange')
   ```
   
   For a reproducer, this file must first be unzipped.
   [test1111.zip](https://github.com/apache/arrow/files/11618246/test1111.zip)
   
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35855: `pyarrow.Table.to_pandas` creates Index instead of PeriodIndex

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35855:
URL: https://github.com/apache/arrow/issues/35855#issuecomment-1584544321

   Perhaps `pyarrow` could ensure that `pandas.core.arrays.arrow.extension_types` in the `to_pandas` method?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35855: `pyarrow.Table.to_pandas` creates Index instead of PeriodIndex

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35855:
URL: https://github.com/apache/arrow/issues/35855#issuecomment-1584542880

   I think this is expected behavior from pyarrow's perspective.  The extension type is registered here: https://github.com/pandas-dev/pandas/blob/1a254df2e7e5a100cad1af4a97eded5177ae7d3e/pandas/core/arrays/arrow/extension_types.py#LL57C33-L57C45
   
   So until that file is imported, Arrow doesn't recognize the extension type.  In other words, everything works if you change the start of the file:
   
   ```
   import pandas
   import pandas.core.arrays.arrow.extension_types
   import pyarrow.parquet
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] anmyachev commented on issue #35855: `pyarrow.Table.to_pandas` creates Index instead of PeriodIndex

Posted by "anmyachev (via GitHub)" <gi...@apache.org>.

anmyachev commented on issue #35855:
URL: https://github.com/apache/arrow/issues/35855#issuecomment-1587594544

   > I think this is expected behavior from pyarrow's perspective. The extension type is registered here: https://github.com/pandas-dev/pandas/blob/1a254df2e7e5a100cad1af4a97eded5177ae7d3e/pandas/core/arrays/arrow/extension_types.py#LL57C33-L57C45
   > 
   > So until that file is imported, Arrow doesn't recognize the extension type. In other words, everything works if you change the start of the file:
   > 
   > ```
   > import pandas
   > import pandas.core.arrays.arrow.extension_types
   > import pyarrow.parquet
   > ```
   
   Thanks @westonpace for the answer! I'll try to do a workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org