You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2021/04/29 09:31:00 UTC
[jira] [Commented] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

    [ https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335292#comment-17335292 ] 

Lance Dacey commented on ARROW-12365:
-------------------------------------

@jorisvandenbossche I will close this issue in favor of an overwrite option for partitions since that is the only reason I use the partition_filename_cb

https://issues.apache.org/jira/browse/ARROW-12358

> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> ------------------------------------------------------------------
>
>                 Key: ARROW-12365
>                 URL: https://issues.apache.org/jira/browse/ARROW-12365
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: Ubuntu 18.04
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: dataset, parquet, python
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a file within a partition will have a specific name. 
> My use case is that I need to report on the final version of data and our visualization tool connects directly to our parquet files on Azure Blob (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging" dataset (this has all versions of the data and there will be many files within each partition. In this case, up to 24 files within a single date partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the updated_at timestamp and drop duplicates on the unique constraint, then partition and save it with partition_filename_cb. In the example below, if I partition by the "date_id" column, then my dataset structure will be "/date_id=202104123/20210413.parquet"
> {code:java}
>         use_legacy_dataset=True,         partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per partition and that I only have the latest version of each row based on the maximum updated_at timestamp. My visualization tool can safely connect to and incrementally refresh from this dataset.
>  
>  
> An alternative solution would be to allow us to overwrite anything in an existing partition. I do not care about the file names so much as I want to ensure that I am fully replacing any data which might already exist in my partition, and I want to limit the number of physical files.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)