You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/11/14 18:19:45 UTC

[GitHub] [beam] robertwb commented on a diff in pull request #23808: Args to set the max shard size in WriteToParquet

robertwb commented on code in PR #23808:
URL: https://github.com/apache/beam/pull/23808#discussion_r1021908251


##########
sdks/python/apache_beam/io/parquetio.py:
##########
@@ -448,6 +451,14 @@ def __init__(
         is '-SSSSS-of-NNNNN' if None is passed as the shard_name_template.
       mime_type: The MIME type to use for the produced files, if the filesystem
         supports specifying MIME types.
+      max_records_per_shard: Maximum number of records to write to any
+        individual shard.
+      max_bytes_per_shard: Target maximum number of bytes to write to any

Review Comment:
   It's impossible to always avoid, as a single record may exceed the bytes-per-shard limit. Even if this is not the case, some file formats have a footer/trailer (e.g. checksums, listings, indices, delimiters...) that would put one over the limit even if the written records were under, so it's really file-format-dependent on how this can be achieved. 
   
   The intent here is that shard not be "too big" which is generally flexible. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org