You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Troy Zimmerman (Jira)" <ji...@apache.org> on 2020/09/28 18:51:00 UTC
[jira] [Commented] (ARROW-10122) [Python] Selecting one column of
multi-index results in a duplicated value column.
[ https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203460#comment-17203460 ]
Troy Zimmerman commented on ARROW-10122:
----------------------------------------
This seems like it could be related to ARROW-9302?
> [Python] Selecting one column of multi-index results in a duplicated value column.
> ----------------------------------------------------------------------------------
>
> Key: ARROW-10122
> URL: https://issues.apache.org/jira/browse/ARROW-10122
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1
> Environment: arrow 1.0.1
> parquet 1.5.1
> pandas 1.1.0
> pyarrow 1.0.1
> Reporter: Troy Zimmerman
> Priority: Minor
>
> When I read one column of a multi-index, that column is duplicated as a value column in the resulting Pandas data frame.
> {code:python}
> >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)})
> >>> df = table.to_pandas().set_index(["first", "second"])
> >>> print(df)
> value
> first second
> 0 0 0
> 1 1 1
> 2 2 2
> 3 3 3
> 4 4 4
> >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
> >>> data = ds.dataset("/tmp/test.parquet")
> {code}
> This works as expected, as does selecting all or no columns.
> {code:python}
> >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
> value
> first second
> 0 0 0
> 1 1 1
> 2 2 2
> 3 3 3
> 4 4 4
> {code}
> This does not work as expected, as the {{first}} column is both an index and a value.
> {code:python}
> >>> print(data.to_table(columns=["first", "value"]).to_pandas())
> first value
> first
> 0 0 0
> 1 1 1
> 2 2 2
> 3 3 3
> 4 4 4{code}
> This is easy to workaround by specifying the full multi-index in {{to_table}}, but does this behavior make sense?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)