You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ying Wang (JIRA)" <ji...@apache.org> on 2018/09/07 19:17:00 UTC

[jira] [Commented] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

    [ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607546#comment-16607546 ] 

Ying Wang commented on ARROW-1956:
----------------------------------

I don't know if this is helpful to people, but I found myself needing to ingest an entire Parquet dataset at once (database company) and I came up with this:

 

```python

import pyarrow.parquet as pq

 

dataset = pq.ParquetDataset('/path/to/dataset')

dataset_pieces = dataset.pieces # ParquetDataset is composed of a list of ParquetDatasetPieces

for dataset_piece in dataset_pieces:

    df = dataset_piece.read(partitions=dataset.partitions).to_pandas() # dataset.partitions is ParquetPartitions object

    # do whatever with dataframe

```

It'll be slow but you can parallelize it as you want and each dataframe will contain the full dataset schema (as opposed to reading the individual ParquetFile which will not include partition keys as part of the schema).

> [Python] Support reading specific partitions from a partitioned parquet dataset
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-1956
>                 URL: https://issues.apache.org/jira/browse/ARROW-1956
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>    Affects Versions: 0.8.0
>         Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>            Reporter: Suvayu Ali
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.11.0
>
>         Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This is very useful in case of large datasets.  I have attached a small script that creates a dataset and shows what is expected when reading (quoting salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the end I did it by hand by creating the directory hierarchies, and writing the individual files myself (similar to the implementation in the attached script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)