You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2020/11/12 17:15:00 UTC
[jira] [Commented] (ARROW-1956) [Python] Support reading specific
partitions from a partitioned parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230803#comment-17230803 ]
Ben Kietzman commented on ARROW-1956:
-------------------------------------
This can be accomplished with {{read_table}}, which will infer partitioning from directory structure and read only partitions specified by a given filter. For example to read only partitions 3 and 7:
{code}
pq.read_table('base_dir', filters=[
[['partition_id', '==', 3]],
[['partition_id', '==', 7]],
])
{code}
> [Python] Support reading specific partitions from a partitioned parquet dataset
> -------------------------------------------------------------------------------
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
> Reporter: Suvayu Ali
> Priority: Minor
> Labels: dataset, dataset-parquet-read, parquet
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset. This is very useful in case of large datasets. I have attached a small script that creates a dataset and shows what is expected when reading (quoting salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of files/directories to ParquetDataset, but it didn't work:
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files. In the end I did it by hand by creating the directory hierarchies, and writing the individual files myself (similar to the implementation in the attached script). Again, in PySpark I can do
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)