You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/16 18:23:17 UTC

[GitHub] [arrow] ldacey commented on pull request #7921: ARROW-9658: [Python] Python bindings for dataset writing

ldacey commented on pull request #7921:
URL: https://github.com/apache/arrow/pull/7921#issuecomment-693579613


   Do think it is possible to add in support to repartition datasets? I am facing some issues with many small files just due to the frequency that I need to download data, which is compounded by the partitions. 
   
   I asked this on Jira as well but:
   
   1) I download data every 30 minutes from a source using UUID parquet filenames (each file just contains new or updated records since the last retrieval so I could not think of a good callback function name). This is 48 parquet files per day.
   2) The data is then partitioned based on the created_date which creates even more files (some can be quite small)
   3) When I query the dataset, I need to read in a lot of small files.
   
   I would then want to read the data and repartition the files using a callback function so the dozens of files in partition ("date", "==", "2020-09-15") would become 2020-09-15.parquet, consolidated as a single file to keep things tidy. I know I can do this with Spark, but it would be nice to have a native pyarrow method.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org