You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Gianluca Ficarelli (Jira)" <ji...@apache.org> on 2022/09/21 16:20:00 UTC
[jira] [Comment Edited] (ARROW-17806) pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0
[ https://issues.apache.org/jira/browse/ARROW-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607849#comment-17607849 ]
Gianluca Ficarelli edited comment on ARROW-17806 at 9/21/22 4:19 PM:
---------------------------------------------------------------------
Another example maybe related, but no exceptions are raised, and the resulting dataframe misses a MultiIndex level in Pandas 1.5.0 (to make the commands work, /tmp/folder00 must already exist):
{code:java}
import pandas as pd
from pathlib import Path
path = "/tmp/folder00/simple_00.parquet"
df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(100, 200)], names=["idx1", "idx2"])
print(df1)
print(df1.index.get_level_values("idx0"))
df1.to_parquet(path, engine="pyarrow", index=None)
path = "/tmp/folder00/simple_01.parquet"
df1 = pd.DataFrame({"a": [30, 31, 32], "b": [40, 41, 42]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(1000, 2000)], names=["idx1", "idx2"])
df1.to_parquet(path, engine="pyarrow", index=None)
print(df1)
print(df1.index.get_level_values("idx0"))
{code}
Printed result with Pandas 1.5.0:
{code:java}
a b
idx1 idx2 idx0
100 200 0 10 20
1 11 21
2 12 22
RangeIndex(start=0, stop=3, step=1, name='idx0')
a b
idx1 idx2 idx0
1000 2000 0 30 40
1 31 41
2 32 42
RangeIndex(start=0, stop=3, step=1, name='idx0')
{code}
Then:
{code:java}
# pass the base dir to read and concatenate both the files
df2 = pd.read_parquet(Path(path).parent, engine="pyarrow")
print(df2) {code}
result with pandas 1.5.0 (pyarrow 9.0.0): the resulting dataframe misses the {{idx0}} level
{code:java}
a b
idx1 idx2
100 200 10 20
200 11 21
200 12 22
1000 2000 30 40
2000 31 41
2000 32 42
{code}
result with pandas 1.4.4 (pyarrow 9.0.0): the resulting dataframe is complete
{code:java}
a b
idx1 idx2 idx0
100 200 0 10 20
1 11 21
2 12 22
1000 2000 0 30 40
1 31 41
2 32 42{code}
Instead, reading only a single file:
{code:java}
df2 = pd.read_parquet(path, engine="pyarrow")
print(df2)
df2.index.get_level_values("idx0"){code}
works with both pandas 1.4.4:
{code:java}
a b
idx1 idx2 idx0
1000 2000 0 30 40
1 31 41
2 32 42
Int64Index([0, 1, 2], dtype='int64', name='idx0'){code}
and pandas 1.5.0
{code:java}
a b
idx1 idx2 idx0
1000 2000 0 30 40
1 31 41
2 32 42
RangeIndex(start=0, stop=3, step=1, name='idx0'){code}
with a difference in the type of the index at level idx0.
was (Author: JIRAUSER294344):
Another example maybe related, but the resulting dataframe may miss a MultiIndex level in Pandas 1.5.0, without exception raised (/tmp/folder00 must already exist):
{code:java}
import pandas as pd
from pathlib import Path
path = "/tmp/folder00/simple_00.parquet"
df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(100, 200)], names=["idx1", "idx2"])
print(df1)
print(df1.index.get_level_values("idx0"))
df1.to_parquet(path, engine="pyarrow", index=None)
path = "/tmp/folder00/simple_01.parquet"
df1 = pd.DataFrame({"a": [30, 31, 32], "b": [40, 41, 42]}, index=pd.RangeIndex(3, name="idx0"))
df1 = pd.concat([df1], axis="index", keys=[(1000, 2000)], names=["idx1", "idx2"])
df1.to_parquet(path, engine="pyarrow", index=None)
print(df1)
print(df1.index.get_level_values("idx0"))
{code}
Printed result with Pandas 1.5.0:
{code:java}
a b
idx1 idx2 idx0
100 200 0 10 20
1 11 21
2 12 22
RangeIndex(start=0, stop=3, step=1, name='idx0')
a b
idx1 idx2 idx0
1000 2000 0 30 40
1 31 41
2 32 42
RangeIndex(start=0, stop=3, step=1, name='idx0')
{code}
Then:
{code:java}
# pass the base dir to read and concatenate both the files
df2 = pd.read_parquet(Path(path).parent, engine="pyarrow")
print(df2) {code}
result with pandas 1.5.0 (pyarrow 9.0.0): the resulting dataframe misses the {{idx0}} level
{code:java}
a b
idx1 idx2
100 200 10 20
200 11 21
200 12 22
1000 2000 30 40
2000 31 41
2000 32 42
{code}
result with pandas 1.4.4 (pyarrow 9.0.0): the resulting dataframe is complete
{code:java}
a b
idx1 idx2 idx0
100 200 0 10 20
1 11 21
2 12 22
1000 2000 0 30 40
1 31 41
2 32 42{code}
Instead, reading only a single file:
{code:java}
df2 = pd.read_parquet(path, engine="pyarrow")
print(df2)
df2.index.get_level_values("idx0"){code}
works with both pandas 1.4.4:
{code:java}
a b
idx1 idx2 idx0
1000 2000 0 30 40
1 31 41
2 32 42
Int64Index([0, 1, 2], dtype='int64', name='idx0'){code}
and pandas 1.5.0
{code:java}
a b
idx1 idx2 idx0
1000 2000 0 30 40
1 31 41
2 32 42
RangeIndex(start=0, stop=3, step=1, name='idx0'){code}
with a difference in the type of the index at level idx0.
> pyarrow fails to write and read a dataframe with MultiIndex containing a RangeIndex with Pandas 1.5.0
> -----------------------------------------------------------------------------------------------------
>
> Key: ARROW-17806
> URL: https://issues.apache.org/jira/browse/ARROW-17806
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 9.0.0
> Reporter: Gianluca Ficarelli
> Priority: Major
>
> A dataframe with a MultiIndex built in this way:
> {code:java}
> import pandas as pd
> df1 = pd.DataFrame({"a": [10, 11, 12], "b": [20, 21, 22]}, index=pd.RangeIndex(3, name="idx0"))
> df1 = df1.set_index("b", append=True)
> print(df1)
> print(df1.index.get_level_values("idx0")) {code}
> gives with Pandas 1.5.0:
> {code:java}
> a
> idx0 b
> 0 20 10
> 1 21 11
> 2 22 12
> RangeIndex(start=0, stop=3, step=1, name='idx0'){code}
> while with Pandas 1.4.4:
> {code:java}
> a
> idx0 b
> 0 20 10
> 1 21 11
> 2 22 12
> Int64Index([0, 1, 2], dtype='int64', name='idx0'){code}
> i.e. the result is RangeIndex instead of Int64Index.
> With pandas 1.5.0 and pyarrow 9.0.0, writing this DataFrame with index=None (i.e. the default value) as in:
> {code:java}
> df1.to_parquet(path, engine="pyarrow", index=None) {code}
> then reading the same file with:
> {code:java}
> pd.read_parquet(path, engine="pyarrow") {code}
> raises an exception:
> {code:java}
> File /<venv>/lib/python3.9/site-packages/pyarrow/pandas_compat.py:997, in _extract_index_level(table, result_table, field_name, field_name_to_metadata)
> 995 def _extract_index_level(table, result_table, field_name,
> 996 field_name_to_metadata):
> --> 997 logical_name = field_name_to_metadata[field_name]['name']
> 998 index_name = _backwards_compatible_index_name(field_name, logical_name)
> 999 i = table.schema.get_field_index(field_name)
> KeyError: 'b'
> {code}
> while with pandas 1.4.4 and pyarrow 9.0.0 it works correctly.
> Note that the problem disappears if the parquet file is written with index=True (that is not the default value), probably because the RangeIndex is converted to Int64Index:
> {code:java}
> df1.to_parquet(path, engine="pyarrow", index=True) {code}
> I suspect that the issue is caused by the change from Int64Index to RangeIndex and it may be related to [https://github.com/pandas-dev/pandas/issues/46675]
> Should pyarrow be able to handle this case? Or is it an issue with Pandas?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)