You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Troy Zimmerman (Jira)" <ji...@apache.org> on 2020/09/28 18:34:00 UTC

[jira] [Created] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.

Troy Zimmerman created ARROW-10122:
--------------------------------------

             Summary: [Python] Selecting one column of multi-index results in a duplicated value column.
                 Key: ARROW-10122
                 URL: https://issues.apache.org/jira/browse/ARROW-10122
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.1
         Environment: arrow 1.0.1
parquet 1.5.1
pandas 1.1.0
pyarrow 1.0.1
            Reporter: Troy Zimmerman


When I read one column of a multi-index, that column is duplicated as a value column in the resulting Pandas data frame.
{code:python}
>> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)}) 
>>> df = table.to_pandas().set_index(["first", "second"])
>>> print(df)
              value
first second
0     0           0
1     1           1
2     2           2
3     3           3
4     4           4
>>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
>>> data = ds.dataset("/tmp/test.parquet")
{code}
This works as expected, as does selecting all or no columns.
{code:python}
>>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
              value
first second
0     0           0
1     1           1
2     2           2
3     3           3
4     4           4
{code}
This does not work as expected, as the {{first}} column is both an index and a value.
{code:python}
>>> print(data.to_table(columns=["first", "value"]).to_pandas())
       first  value
first
0          0      0
1          1      1
2          2      2
3          3      3
4          4      4{code}
This is easy to workaround by specifying the full multi-index in {{to_table}}, but does this behavior make sense?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)