You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Gianluca Ficarelli (Jira)" <ji...@apache.org> on 2022/09/21 15:30:00 UTC

[jira] [Created] (ARROW-17806) pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0

Gianluca Ficarelli created ARROW-17806:
------------------------------------------

             Summary: pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0
                 Key: ARROW-17806
                 URL: https://issues.apache.org/jira/browse/ARROW-17806
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Python
    Affects Versions: 9.0.0
            Reporter: Gianluca Ficarelli


A dataframe with a MultiIndex built in this way:
{code:java}
import pandas as pd
df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
df1 = df1.set_index("b", append=True)
print(df1)
print(df1.index.get_level_values("idx0")) {code}
gives with Pandas 1.5.0:
{code:java}
          a
idx0 b     
0    20  10
1    21  11
2    22  12

RangeIndex(start=0, stop=3, step=1, name='idx0'){code}
while with Pandas 1.4.4:
{code:java}
          a
idx0 b     
0    20  10
1    21  11
2    22  12

Int64Index([0, 1, 2], dtype='int64', name='idx0'){code}
i.e. the result is RangeIndex instead of Int64Index.

With pandas 1.5.0 and pyarrow 9.0.0, writing this DataFrame with index=None (i.e. the default value) as in:
{code:java}
df1.to_parquet(path, engine="pyarrow", index=None) {code}
then reading the same file with:
{code:java}
pd.read_parquet(path, engine="pyarrow") {code}
raises an exception:
{code:java}
 File /<venv>/lib/python3.9/site-packages/pyarrow/pandas_compat.py:997, in _extract_index_level(table, result_table, field_name, field_name_to_metadata)
    995 def _extract_index_level(table, result_table, field_name,
    996                          field_name_to_metadata):
--> 997     logical_name = field_name_to_metadata[field_name]['name']
    998     index_name = _backwards_compatible_index_name(field_name, logical_name)
    999     i = table.schema.get_field_index(field_name)

KeyError: 'b'
{code}
while with pandas 1.4.4 and pyarrow 9.0.0 it works correctly. 

Note that the problem disappears if the parquet file is written with index=True (that is not the default value), probably because the RangeIndex is converted to Int64Index:
{code:java}
df1.to_parquet(path, engine="pyarrow", index=True)  {code}
I suspect that the issue is caused by the change from Int64Index to RangeIndex and it may be related to [https://github.com/pandas-dev/pandas/issues/46675]

Should pyarrow be able to handle this case? Or is it an issue with Pandas?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)