You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2020/09/16 13:21:00 UTC

[jira] [Commented] (ARROW-9682) [Python] Unable to specify the partition style with pq.write_to_dataset

    [ https://issues.apache.org/jira/browse/ARROW-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196967#comment-17196967 ] 

Lance Dacey commented on ARROW-9682:
------------------------------------

Excellent. [~jorisvandenbossche] will there be a way to potentially repartition datasets? My use case is this:

1) I download data every 30 minutes from a source using UUID parquet filenames (each file just contains new or updated records since the last hour so I could not think of a good callback function name). This is 48 parquet files per day.
2) The data is then partitioned based on the created_date which creates even more files (some can be quite small)
3) When I query the dataset, I need to read in a lot of very small files.

I would then want to read the data and repartition the files using a callback function so the dozens of files in partition ("date", "==", "2020-09-15") would become 2020-09-15.parquet, consolidated as a single file to keep things tidy. I know I can do this with Spark, but it would be nice to have a native pyarrow method.

> [Python] Unable to specify the partition style with pq.write_to_dataset
> -----------------------------------------------------------------------
>
>                 Key: ARROW-9682
>                 URL: https://issues.apache.org/jira/browse/ARROW-9682
>             Project: Apache Arrow
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: Ubuntu 18.04
> Python 3.7
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: dataset-parquet-write, parquet, parquetWriter
>
> I am able to import and test DirectoryPartitioning but I am not able to figure out a way to write a dataset using this feature. It seems like write_to_dataset defaults to the "hive" style. Is there a way to test this?
> {code:java}
> from pyarrow.dataset import DirectoryPartitioning
> partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())]))
> print(partitioning.parse("/2009/11/3"))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)