You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Troy Zimmerman (Jira)" <ji...@apache.org> on 2020/09/28 18:51:00 UTC

[jira] [Commented] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.

    [ https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203460#comment-17203460 ] 

Troy Zimmerman commented on ARROW-10122:
----------------------------------------

This seems like it could be related to ARROW-9302?

> [Python] Selecting one column of multi-index results in a duplicated value column.
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-10122
>                 URL: https://issues.apache.org/jira/browse/ARROW-10122
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>         Environment: arrow 1.0.1
> parquet 1.5.1
> pandas 1.1.0
> pyarrow 1.0.1
>            Reporter: Troy Zimmerman
>            Priority: Minor
>
> When I read one column of a multi-index, that column is duplicated as a value column in the resulting Pandas data frame.
> {code:python}
> >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)}) 
> >>> df = table.to_pandas().set_index(["first", "second"])
> >>> print(df)
>               value
> first second
> 0     0           0
> 1     1           1
> 2     2           2
> 3     3           3
> 4     4           4
> >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
> >>> data = ds.dataset("/tmp/test.parquet")
> {code}
> This works as expected, as does selecting all or no columns.
> {code:python}
> >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
>               value
> first second
> 0     0           0
> 1     1           1
> 2     2           2
> 3     3           3
> 4     4           4
> {code}
> This does not work as expected, as the {{first}} column is both an index and a value.
> {code:python}
> >>> print(data.to_table(columns=["first", "value"]).to_pandas())
>        first  value
> first
> 0          0      0
> 1          1      1
> 2          2      2
> 3          3      3
> 4          4      4{code}
> This is easy to workaround by specifying the full multi-index in {{to_table}}, but does this behavior make sense?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)