You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2022/03/04 13:18:00 UTC

[jira] [Closed] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

     [ https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Dacey closed ARROW-12365.
-------------------------------
    Fix Version/s: 6.0.0
       Resolution: Resolved

delete_matching option solves this issue

> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> ------------------------------------------------------------------
>
>                 Key: ARROW-12365
>                 URL: https://issues.apache.org/jira/browse/ARROW-12365
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: Ubuntu 18.04
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: dataset, parquet, python
>             Fix For: 6.0.0
>
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a file within a partition will have a specific name. 
> My use case is that I need to report on the final version of data and our visualization tool connects directly to our parquet files on Azure Blob (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging" dataset (this has all versions of the data and there will be many files within each partition. In this case, up to 24 files within a single date partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the updated_at timestamp and drop duplicates on the unique constraint, then partition and save it with partition_filename_cb. In the example below, if I partition by the "date_id" column, then my dataset structure will be "/date_id=202104123/20210413.parquet"
> {code:java}
>         use_legacy_dataset=True,         partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per partition and that I only have the latest version of each row based on the maximum updated_at timestamp. My visualization tool can safely connect to and incrementally refresh from this dataset.
>  
>  
> An alternative solution would be to allow us to overwrite anything in an existing partition. I do not care about the file names so much as I want to ensure that I am fully replacing any data which might already exist in my partition, and I want to limit the number of physical files.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)