You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/01/05 09:52:00 UTC
[jira] [Commented] (ARROW-10643) [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe

    [ https://issues.apache.org/jira/browse/ARROW-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469158#comment-17469158 ] 

Alenka Frim commented on ARROW-10643:
-------------------------------------

When trying out other types of indexes for empty pandas roundtrip I bumped into a different error.

In `_table_to_blocks` (pandas_compat.py) the input `extension_columns` should be equal to {} for an empty table but is equal to {None: interval[int64, right]} for `pd.interval_range` and so an error is triggered as None can not be encoded. Same happens for `pd.PeriodIndex`.

Example:
{code:python}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame(index=pd.interval_range(start=0, end=5))
table = pa.table(df)
table.to_pandas().shape
{code}
Error:
{code:python}
TypeError                                 Traceback (most recent call last)
/var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1m0000gn/T/ipykernel_13963/1439451337.py in <module>
      1 df5 = pd.DataFrame(index=pd.PeriodIndex(year=[2000, 2002], quarter=[1, 3]))
      2 table5 = pa.table(df5)
----> 3 table5.to_pandas().shape

~/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()
    764             self_destruct=self_destruct
    765         )
--> 766         return self._to_pandas(options, categories=categories,
    767                                ignore_metadata=ignore_metadata,
    768                                types_mapper=types_mapper)

~/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()
   1819                    types_mapper=None):
   1820         from pyarrow.pandas_compat import table_to_blockmanager
-> 1821         mgr = table_to_blockmanager(
   1822             options, self, categories,
   1823             ignore_metadata=ignore_metadata,

~/repos/arrow/python/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    787     _check_data_column_metadata_consistency(all_columns)
    788     columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 789     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    790 
    791     axes = [columns, index]

~/repos/arrow/python/pyarrow/pandas_compat.py in _table_to_blocks(options, block_table, categories, extension_columns)
   1133     # Convert an arrow table to Block from the internal pandas API
   1134     columns = block_table.column_names
-> 1135     result = pa.lib.table_to_blocks(options, block_table, categories,
   1136                                     list(extension_columns.keys()))
   1137     return [_reconstruct_block(item, columns, extension_columns)

~/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.table_to_blocks()
   1215         c_options.categorical_columns = {tobytes(cat) for cat in categories}
   1216     if extension_columns is not None:
-> 1217         c_options.extension_columns = {tobytes(col)
   1218                                        for col in extension_columns}
   1219 

~/repos/arrow/python/pyarrow/lib.cpython-39-darwin.so in set.from_py.__pyx_convert_unordered_set_from_py_std_3a__3a_string()

~/repos/arrow/python/pyarrow/lib.cpython-39-darwin.so in string.from_py.__pyx_convert_string_from_py_std__in_string()

TypeError: expected bytes, NoneType found
{code}
I will create a separate issue for this.

> [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-10643
>                 URL: https://issues.apache.org/jira/browse/ARROW-10643
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Alenka Frim
>            Priority: Major
>              Labels: conversion, pandas, pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> From https://github.com/pandas-dev/pandas/issues/37897
> The roundtrip of an empty pandas.DataFrame _with_ and index (so no columns, but a non-zero shape for the rows) isn't faithful:
> {code}
> In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))
> In [34]: df
> Out[34]: 
> Empty DataFrame
> Columns: []
> Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
> In [35]: df.shape
> Out[35]: (10, 0)
> In [36]: table = pa.table(df)
> In [37]: table.to_pandas()
> Out[37]: 
> Empty DataFrame
> Columns: []
> Index: []
> In [38]: table.to_pandas().shape
> Out[38]: (0, 0)
> {code}
> Since the pandas metadata in the Table actually have this RangeIndex information:
> {code}
> In [39]: table.schema.pandas_metadata
> Out[39]: 
> {'index_columns': [{'kind': 'range',
>    'name': None,
>    'start': 0,
>    'stop': 10,
>    'step': 1}],
>  'column_indexes': [{'name': None,
>    'field_name': None,
>    'pandas_type': 'empty',
>    'numpy_type': 'object',
>    'metadata': None}],
>  'columns': [],
>  'creator': {'library': 'pyarrow', 'version': '3.0.0.dev162+g305160495'},
>  'pandas_version': '1.2.0.dev0+1225.g91f5bfcdc4'}
> {code}
> we should in principle be able to correctly roundtrip this case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)