You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/10/20 11:22:00 UTC
[jira] [Comment Edited] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns

    [ https://issues.apache.org/jira/browse/ARROW-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621017#comment-17621017 ] 

Alenka Frim edited comment on ARROW-17360 at 10/20/22 11:21 AM:
----------------------------------------------------------------

Thank you for reporting!

I would say this is not the expected behaviour. If we look at the {{parquet}} or {{feather}} format the {{read}} methods preserve the ordering of selected columns:
{code:python}
import pyarrow as pa
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
pq.read_table('example.parquet', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]

import pyarrow.feather as feather
feather.write_feather(table, 'example_feather')
feather.read_table('example_feather', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}
FWIU looking at the code in [pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382] and [arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341] I think the behaviour comes from [Apache ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165] and can therefore be open as an issue there (about following order in the original schema).

Nevertheless there are two options we have to make this work correctly:
 * add a re-ordering in {{pyarrow}} as it is done for [feather implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281].
 * Even better would be if {{pandas}} uses the new {{dataset}} API to read {{orc}} files like so:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("example.orc", format="orc")
dataset.to_table(columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}


was (Author: alenkaf):
Thank you for reporting!

I would say this is not the expected behaviour. If we look at the {{parquet}} or {{feather}} format the {{read}} methods preserve the ordering of selected columns:
{code:python}
import pyarrow as pa
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
pq.read_table('example.parquet', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]

import pyarrow.feather as feather
feather.write_feather(table, 'example_feather')
feather.read_table('example_feather', columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}
FWIU looking at the code in [pyarrow/_orc.pyx|https://github.com/apache/arrow/blob/962121062e4b13c148f24a6d4fa4b1a2f1be0d88/python/pyarrow/_orc.pyx#L379-L382] and [arrow/adapters/orc/adapter.cc|https://github.com/apache/arrow/blob/183517c8baad039c0100687c8a405bd4d8b404a7/cpp/src/arrow/adapters/orc/adapter.cc#L336-L341] I think the behaviour comes from [Apache ORC|https://github.com/apache/orc/blob/7f7362bdcecfd48e5ff9f4a3255100e3ea724f6f/c%2B%2B/include/orc/Reader.hh#L158-L165] and can therefore be open as an issue there.

Nevertheless there are two options we have to make this work correctly:
 * add a re-ordering in {{pyarrow}} as it is done for [feather implementation|https://github.com/apache/arrow/blob/0f91e684ddda3dfd11d376c2755bbc3071c3099d/python/pyarrow/feather.py#L280-L281].
 * Even better would be if {{pandas}} uses the new {{dataset}} API to read {{orc}} files like so:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("example.orc", format="orc")
dataset.to_table(columns=['b', 'a'])
# pyarrow.Table
# b: string
# a: int64
# ----
# b: [["a","b","c"]]
# a: [[1,2,3]]
{code}

> [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
> -----------------------------------------------------------------------
>
>                 Key: ARROW-17360
>                 URL: https://issues.apache.org/jira/browse/ARROW-17360
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 8.0.1
>            Reporter: Matthew Roeschke
>            Priority: Major
>              Labels: orc
>
> xref [https://github.com/pandas-dev/pandas/issues/47944]
>  
> {code:java}
> In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
> # pandas main branch / 1.5
> In [2]: df.to_orc("abc")
> In [3]: pd.read_orc("abc", columns=['b', 'a'])
> Out[3]:
>    a  b
> 0  1  a
> 1  2  b
> 2  3  c
> In [4]: import pyarrow.orc as orc
> In [5]: orc_file = orc.ORCFile("abc")
> # reordered to a, b
> In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
> Out[6]:
>    a  b
> 0  1  a
> 1  2  b
> 2  3  c
> # reordered to a, b
> In [7]: orc_file.read(columns=['b', 'a'])
> Out[7]:
> pyarrow.Table
> a: int64
> b: string
> ----
> a: [[1,2,3]]
> b: [["a","b","c"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)