You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "tooptoop4 (via GitHub)" <gi...@apache.org> on 2023/05/12 18:46:21 UTC

[GitHub] [arrow] tooptoop4 opened a new issue, #35572: python - write out single parquet file?

tooptoop4 opened a new issue, #35572:
URL: https://github.com/apache/arrow/issues/35572

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   if I have read many small parquets from an s3 folder and want to write them out into a single large parquet file (ie 6gb or 20gb) is there any setting to guarantee it writes out a single file? (I don't want it writing out multiple files ie 1gb x6, I need single file)
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] tooptoop4 commented on issue #35572: python - write out single parquet file?

Posted by "tooptoop4 (via GitHub)" <gi...@apache.org>.

tooptoop4 commented on issue #35572:
URL: https://github.com/apache/arrow/issues/35572#issuecomment-1548761508

   i'm on v7, does parquet.write_to_dataset write 1 file by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35572: python - write out single parquet file?

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35572:
URL: https://github.com/apache/arrow/issues/35572#issuecomment-1553002278

   I believe it will.  But you may need to set `use_legacy_dataset=False`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35572: python - write out single parquet file?

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35572:
URL: https://github.com/apache/arrow/issues/35572#issuecomment-1548403517

   With the dataset writer you should be able to adjust `max_rows_per_file` and `max_rows_per_group` to get your desired behavior.  By default `max_rows_per_file=0` (no limit) and so it should already behave how you expect:
   
   ```
   # Will create one file /tmp/my_dataset/part-0.parquet no matter how many rows are in
   # table1, table2, and table3.  Keep in mind that first argument can be any iterable of tables
   pyarrow.dataset.write_dataset([table1, table2, table3], "/tmp/my_dataset", format="parquet")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org