You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/15 17:38:00 UTC

[GitHub] [arrow] tachyonwill commented on a change in pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

tachyonwill commented on a change in pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#discussion_r827248653



##########
File path: python/pyarrow/parquet.py
##########
@@ -605,6 +605,9 @@ def _sanitize_table(table, new_schema, flavor):
     If None, no encryption will be done.
     The encryption properties can be created using:
     ``CryptoFactory.file_encryption_properties()``.
+write_batch_size : int, default None
+    Number of values to write to a page at a time. If None, use the default of
+    1024.

Review comment:
       The default is 1024: https://github.com/apache/arrow/blob/3bf061783f4e1ab447d2eb0f487c0c4fce6d5b15/cpp/src/parquet/properties.h#L96
   
   The way `data_page_size` and `write_batch_size` work is `write_batch_size` values are written, then the size of the page is checked against `data_page_size` and if `data_page_size` is exceeded, we start a new page. Normally, 1024 is fine for the `write_batch_size` but if the values are really big(big strings) or  `data_page_size` really small, then a smaller `write_batch_size` is needed to keep the page sizes close to `data_page_size` .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org