You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/09/04 15:46:00 UTC

[jira] [Commented] (ARROW-1459) [Python] PyArrow fails to load partitioned parquet files with non-primitive types

    [ https://issues.apache.org/jira/browse/ARROW-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152738#comment-16152738 ] 

Wes McKinney commented on ARROW-1459:
-------------------------------------

I think this is ARROW-1357, and has already been fixed in trunk. If you are on Linux can you try out a nightly build and confirm

{{conda install pyarrow -c twosigma}}

> [Python] PyArrow fails to load partitioned parquet files with non-primitive types
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-1459
>                 URL: https://issues.apache.org/jira/browse/ARROW-1459
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.6.0
>            Reporter: Jonas Amrich
>
> When reading partitioned parquet files (tested with those produced by Spark), that contain lists, the resulting table seems to contain data loaded only from one partition. Primitive types seems to be loaded correctly.
> It can be reproduced using following code (arrow 0.6.0, spark 2.1.1):
> {noformat}
> >>> df = spark.createDataFrame(list(zip(np.arange(10).tolist(), np.arange(20).reshape((10,2)).tolist())))
> >>> df.toPandas()
>    _1        _2
> 0   0    [0, 1]
> 1   1    [2, 3]
> 2   2    [4, 5]
> 3   3    [6, 7]
> 4   4    [8, 9]
> 5   5  [10, 11]
> 6   6  [12, 13]
> 7   7  [14, 15]
> 8   8  [16, 17]
> 9   9  [18, 19]
> >>> df.repartition(2).write.parquet('df_parts.parquet')
> >>> pq.read_table('df_parts.parquet').to_pandas()
>    _1        _2
> 0   0    [0, 1]
> 1   2    [4, 5]
> 2   4    [8, 9]
> 3   6  [12, 13]
> 4   8  [16, 17]
> 5   1    [0, 1]
> 6   3    [4, 5]
> 7   5    [8, 9]
> 8   7  [12, 13]
> 9   9  [16, 17]
> {noformat}
> When the data is loaded using Spark or coalesced into one partition, everything works as expected:
> {noformat}
> >>> spark.read.parquet('df_parts.parquet').toPandas()
>    _1        _2
> 0   1    [2, 3]
> 1   3    [6, 7]
> 2   5  [10, 11]
> 3   7  [14, 15]
> 4   9  [18, 19]
> 5   0    [0, 1]
> 6   2    [4, 5]
> 7   4    [8, 9]
> 8   6  [12, 13]
> 9   8  [16, 17]
> >>> df.coalesce(1).write.parquet('df_single.parquet')
> >>> pq.read_table('df_single.parquet').to_pandas()
>    _1        _2
> 0   0    [0, 1]
> 1   1    [2, 3]
> 2   2    [4, 5]
> 3   3    [6, 7]
> 4   4    [8, 9]
> 5   5  [10, 11]
> 6   6  [12, 13]
> 7   7  [14, 15]
> 8   8  [16, 17]
> 9   9  [18, 19]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)