You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Troy Zimmerman (Jira)" <ji...@apache.org> on 2020/09/28 18:34:00 UTC
[jira] [Created] (ARROW-10122) [Python] Selecting one column of
multi-index results in a duplicated value column.
Troy Zimmerman created ARROW-10122:
--------------------------------------
Summary: [Python] Selecting one column of multi-index results in a duplicated value column.
Key: ARROW-10122
URL: https://issues.apache.org/jira/browse/ARROW-10122
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 1.0.1
Environment: arrow 1.0.1
parquet 1.5.1
pandas 1.1.0
pyarrow 1.0.1
Reporter: Troy Zimmerman
When I read one column of a multi-index, that column is duplicated as a value column in the resulting Pandas data frame.
{code:python}
>> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)})
>>> df = table.to_pandas().set_index(["first", "second"])
>>> print(df)
value
first second
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
>>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
>>> data = ds.dataset("/tmp/test.parquet")
{code}
This works as expected, as does selecting all or no columns.
{code:python}
>>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
value
first second
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
{code}
This does not work as expected, as the {{first}} column is both an index and a value.
{code:python}
>>> print(data.to_table(columns=["first", "value"]).to_pandas())
first value
first
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4{code}
This is easy to workaround by specifying the full multi-index in {{to_table}}, but does this behavior make sense?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)