You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/01/19 17:21:00 UTC

[jira] [Reopened] (ARROW-10643) [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe

     [ https://issues.apache.org/jira/browse/ARROW-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reopened ARROW-10643:
-------------------------------------------
      Assignee:     (was: Alenka Frim)

> [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-10643
>                 URL: https://issues.apache.org/jira/browse/ARROW-10643
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: conversion, pandas, pull-request-available
>             Fix For: 7.0.0
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> From https://github.com/pandas-dev/pandas/issues/37897
> The roundtrip of an empty pandas.DataFrame _with_ and index (so no columns, but a non-zero shape for the rows) isn't faithful:
> {code}
> In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))
> In [34]: df
> Out[34]: 
> Empty DataFrame
> Columns: []
> Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
> In [35]: df.shape
> Out[35]: (10, 0)
> In [36]: table = pa.table(df)
> In [37]: table.to_pandas()
> Out[37]: 
> Empty DataFrame
> Columns: []
> Index: []
> In [38]: table.to_pandas().shape
> Out[38]: (0, 0)
> {code}
> Since the pandas metadata in the Table actually have this RangeIndex information:
> {code}
> In [39]: table.schema.pandas_metadata
> Out[39]: 
> {'index_columns': [{'kind': 'range',
>    'name': None,
>    'start': 0,
>    'stop': 10,
>    'step': 1}],
>  'column_indexes': [{'name': None,
>    'field_name': None,
>    'pandas_type': 'empty',
>    'numpy_type': 'object',
>    'metadata': None}],
>  'columns': [],
>  'creator': {'library': 'pyarrow', 'version': '3.0.0.dev162+g305160495'},
>  'pandas_version': '1.2.0.dev0+1225.g91f5bfcdc4'}
> {code}
> we should in principle be able to correctly roundtrip this case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)