You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/13 05:59:36 UTC

[GitHub] [arrow] westonpace commented on issue #13142: write_batch vs write_table of ParquetWriter

westonpace commented on issue #13142:
URL: https://github.com/apache/arrow/issues/13142#issuecomment-1125684483

Use write_table if you have a table, use write_batch if you have a record batch. They do the same thing.

Columns in a record batch are stored in contiguous buffers (e.g. an int32 array in a record batch will have one values buffer and one validity buffer). Generally, when doing I/O, Arrow reads data one record batch at a time and you often can't read anything smaller than a record batch. So if you store one giant record batch then you will need to read that entire giant record batch out all at once.

A table contains multiple record batches.

Both write_table and write_batch take an optional row_group_size which you can use to slice a large continuous in-memory object into pieces as you write it. Arrow's record batch is roughly analogous to parquet's "row group". Arrow's table is roughly analogous to an entire parquet file.

> I would like to split the data and write to the same parquet file to save memory, should I use "write_table" or "write_batch" ?
Also, what would be the best size after splitting?

I'm not really sure what you mean by this. Are you trying to write the data so it can be read back out a piece at a time?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org