You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/01 14:21:00 UTC

[jira] [Commented] (ARROW-2366) [Python][C++][Parquet] Support reading Parquet files having a permutation of column order

    [ https://issues.apache.org/jira/browse/ARROW-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072810#comment-17072810 ] 

Joris Van den Bossche commented on ARROW-2366:
----------------------------------------------

This is now implemented in the C++ Datasets project:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# create dummy dataset with column order permutation
import pathlib
basedir = pathlib.Path(".")
case = basedir / "dataset_column_order_permutation"
case.mkdir(exist_ok=True)

table1 = pa.table([[1, 2, 3], [.1, .2, .3]], names=['a', 'b'])
pq.write_table(table1, case / "data1.parquet")

table2 = pa.table([[.4, .5, .6], [4, 5, 6]], names=['b', 'a'])
pq.write_table(table2, case / "data2.parquet")

# reading with the old python implementation indeed raises on schema mismatch
pq.read_table(str(case))

# this reads fine
ds.dataset(str(case)).to_table().to_pandas()
{code}

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), this issue should be solved (we can still add a test for it to close this issue)

> [Python][C++][Parquet] Support reading Parquet files having a permutation of column order
> -----------------------------------------------------------------------------------------
>
>                 Key: ARROW-2366
>                 URL: https://issues.apache.org/jira/browse/ARROW-2366
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>
> See discussion in https://github.com/dask/fastparquet/issues/320



--
This message was sent by Atlassian Jira
(v8.3.4#803005)