You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/01/24 04:03:00 UTC

[jira] [Updated] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

     [ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-1956:
--------------------------------
    Summary: [Python] Support reading specific partitions from a partitioned parquet dataset  (was: Support reading specific partitions from a partitioned parquet dataset)

> [Python] Support reading specific partitions from a partitioned parquet dataset
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-1956
>                 URL: https://issues.apache.org/jira/browse/ARROW-1956
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>    Affects Versions: 0.8.0
>         Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>            Reporter: Suvayu Ali
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.9.0
>
>         Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This is very useful in case of large datasets.  I have attached a small script that creates a dataset and shows what is expected when reading (quoting salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the end I did it by hand by creating the directory hierarchies, and writing the individual files myself (similar to the implementation in the attached script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)