You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/21 11:51:22 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #34264: [Python] Control file size writing Parquet files

jorisvandenbossche commented on issue #34264:
URL: https://github.com/apache/arrow/issues/34264#issuecomment-1438345998

   Yes, the `ParquetWriter` interface is the low-level interface for writing _single_ files (and so using that you need to handle this logic manually), but the generic dataset writing functionality allows you to control file size in _some_ way and thus automatically split your dataset in multiple files. But, this is based on the number of rows written and not the resulting file size. You could still use that if you can make a rough estimate of rows for a given size.
   
   How this would look like:
   
   ```
   >>> table = pa.table({"col": range(10000)})
   >>> import pyarrow.dataset as ds
   >>> ds.write_dataset(table, "test_split", format="parquet", max_rows_per_file=3000, max_rows_per_group=3000)
   ```
   
   ```
   $ ls test_split/
   part-0.parquet	part-1.parquet	part-2.parquet	part-3.parquet
   ```
   
   (I needed to specify `max_rows_per_group` as well, but that's just because I used a tiny example and that keyword has a default that is larger than 3000)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org