You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Gianluca Ficarelli (Jira)" <ji...@apache.org> on 2022/09/21 16:20:00 UTC
[jira] [Comment Edited] (ARROW-17806) pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0

    [ https://issues.apache.org/jira/browse/ARROW-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607849#comment-17607849 ] 

Gianluca Ficarelli edited comment on ARROW-17806 at 9/21/22 4:19 PM:
---------------------------------------------------------------------

Another example maybe related, but no exceptions are raised, and the resulting dataframe misses a MultiIndex level in Pandas 1.5.0 (to make the commands work, /tmp/folder00 must already exist):
{code:java}
import pandas as pd
from pathlib import Path
path = "/tmp/folder00/simple_00.parquet"
df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(100, 200)], names=["idx1", "idx2"])
print(df1)
print(df1.index.get_level_values("idx0"))
df1.to_parquet(path, engine="pyarrow", index=None)

path = "/tmp/folder00/simple_01.parquet"
df1 = pd.DataFrame({"a": [30, 31, 32], "b": [40, 41, 42]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(1000, 2000)], names=["idx1", "idx2"])
df1.to_parquet(path, engine="pyarrow", index=None)
print(df1)
print(df1.index.get_level_values("idx0"))
 {code}
Printed result with Pandas 1.5.0:
{code:java}
                 a   b
idx1 idx2 idx0        
100  200  0     10  20
          1     11  21
          2     12  22
RangeIndex(start=0, stop=3, step=1, name='idx0')

                 a   b
idx1 idx2 idx0        
1000 2000 0     30  40
          1     31  41
          2     32  42
RangeIndex(start=0, stop=3, step=1, name='idx0')
 {code}
Then:
{code:java}
# pass the base dir to read and concatenate both the files
df2 = pd.read_parquet(Path(path).parent, engine="pyarrow")
print(df2) {code}
result with pandas 1.5.0 (pyarrow 9.0.0): the resulting dataframe misses the {{idx0}} level
{code:java}
             a   b
idx1 idx2        
100  200   10  20
     200   11  21
     200   12  22
1000 2000  30  40
     2000  31  41
     2000  32  42
{code}
result with pandas 1.4.4 (pyarrow 9.0.0): the resulting dataframe is complete
{code:java}
                  a   b
idx1 idx2 idx0        
100  200  0     10  20
          1     11  21
          2     12  22
1000 2000 0     30  40
          1     31  41
          2     32  42{code}
Instead, reading only a single file:
{code:java}
df2 = pd.read_parquet(path, engine="pyarrow")
print(df2) 
df2.index.get_level_values("idx0"){code}
works with both pandas 1.4.4:
{code:java}
                 a   b
idx1 idx2 idx0        
1000 2000 0     30  40
          1     31  41
          2     32  42
Int64Index([0, 1, 2], dtype='int64', name='idx0'){code}
and pandas 1.5.0
{code:java}
                  a   b
idx1 idx2 idx0        
1000 2000 0     30  40
          1     31  41
          2     32  42
RangeIndex(start=0, stop=3, step=1, name='idx0'){code}
with a difference in the type of the index at level idx0.


was (Author: JIRAUSER294344):
Another example maybe related, but the resulting dataframe may miss a MultiIndex level in Pandas 1.5.0, without exception raised (/tmp/folder00 must already exist):
{code:java}
import pandas as pd
from pathlib import Path
path = "/tmp/folder00/simple_00.parquet"
df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(100, 200)], names=["idx1", "idx2"])
print(df1)
print(df1.index.get_level_values("idx0"))
df1.to_parquet(path, engine="pyarrow", index=None)

path = "/tmp/folder00/simple_01.parquet"
df1 = pd.DataFrame({"a": [30, 31, 32], "b": [40, 41, 42]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(1000, 2000)], names=["idx1", "idx2"])
df1.to_parquet(path, engine="pyarrow", index=None)
print(df1)
print(df1.index.get_level_values("idx0"))
 {code}
Printed result with Pandas 1.5.0:
{code:java}
                 a   b
idx1 idx2 idx0        
100  200  0     10  20
          1     11  21
          2     12  22
RangeIndex(start=0, stop=3, step=1, name='idx0')

                 a   b
idx1 idx2 idx0        
1000 2000 0     30  40
          1     31  41
          2     32  42
RangeIndex(start=0, stop=3, step=1, name='idx0')
 {code}
Then:
{code:java}
# pass the base dir to read and concatenate both the files
df2 = pd.read_parquet(Path(path).parent, engine="pyarrow")
print(df2) {code}
result with pandas 1.5.0 (pyarrow 9.0.0): the resulting dataframe misses the {{idx0}} level
{code:java}
             a   b
idx1 idx2        
100  200   10  20
     200   11  21
     200   12  22
1000 2000  30  40
     2000  31  41
     2000  32  42
{code}
result with pandas 1.4.4 (pyarrow 9.0.0): the resulting dataframe is complete
{code:java}
                  a   b
idx1 idx2 idx0        
100  200  0     10  20
          1     11  21
          2     12  22
1000 2000 0     30  40
          1     31  41
          2     32  42{code}
Instead, reading only a single file:
{code:java}
df2 = pd.read_parquet(path, engine="pyarrow")
print(df2) 
df2.index.get_level_values("idx0"){code}
works with both pandas 1.4.4:
{code:java}
                 a   b
idx1 idx2 idx0        
1000 2000 0     30  40
          1     31  41
          2     32  42
Int64Index([0, 1, 2], dtype='int64', name='idx0'){code}
and pandas 1.5.0
{code:java}
                  a   b
idx1 idx2 idx0        
1000 2000 0     30  40
          1     31  41
          2     32  42
RangeIndex(start=0, stop=3, step=1, name='idx0'){code}
with a difference in the type of the index at level idx0.

> pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0
> -----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-17806
>                 URL: https://issues.apache.org/jira/browse/ARROW-17806
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 9.0.0
>            Reporter: Gianluca Ficarelli
>            Priority: Major
>
> A dataframe with a MultiIndex built in this way:
> {code:java}
> import pandas as pd
> df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
> df1 = df1.set_index("b", append=True)
> print(df1)
> print(df1.index.get_level_values("idx0")) {code}
> gives with Pandas 1.5.0:
> {code:java}
>           a
> idx0 b     
> 0    20  10
> 1    21  11
> 2    22  12
> RangeIndex(start=0, stop=3, step=1, name='idx0'){code}
> while with Pandas 1.4.4:
> {code:java}
>           a
> idx0 b     
> 0    20  10
> 1    21  11
> 2    22  12
> Int64Index([0, 1, 2], dtype='int64', name='idx0'){code}
> i.e. the result is RangeIndex instead of Int64Index.
> With pandas 1.5.0 and pyarrow 9.0.0, writing this DataFrame with index=None (i.e. the default value) as in:
> {code:java}
> df1.to_parquet(path, engine="pyarrow", index=None) {code}
> then reading the same file with:
> {code:java}
> pd.read_parquet(path, engine="pyarrow") {code}
> raises an exception:
> {code:java}
>  File /<venv>/lib/python3.9/site-packages/pyarrow/pandas_compat.py:997, in _extract_index_level(table, result_table, field_name, field_name_to_metadata)
>     995 def _extract_index_level(table, result_table, field_name,
>     996                          field_name_to_metadata):
> --> 997     logical_name = field_name_to_metadata[field_name]['name']
>     998     index_name = _backwards_compatible_index_name(field_name, logical_name)
>     999     i = table.schema.get_field_index(field_name)
> KeyError: 'b'
> {code}
> while with pandas 1.4.4 and pyarrow 9.0.0 it works correctly. 
> Note that the problem disappears if the parquet file is written with index=True (that is not the default value), probably because the RangeIndex is converted to Int64Index:
> {code:java}
> df1.to_parquet(path, engine="pyarrow", index=True)  {code}
> I suspect that the issue is caused by the change from Int64Index to RangeIndex and it may be related to [https://github.com/pandas-dev/pandas/issues/46675]
> Should pyarrow be able to handle this case? Or is it an issue with Pandas?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)