You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/26 23:31:12 UTC

[GitHub] [iceberg] rdblue commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

rdblue commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-664054683


   We do this a bit differently: instead of rewriting everything that matches a filter, we configure the rewrite to primarily look for small files. Any file already near the target size is ignored, and then we bin pack the rest of the files in each partition and rewrite. That ensures that we don't rewrite large amounts of data that are already reasonably sized. We don't care about files that are too large because they can be split, only the small files.
   
   We also have an option to keep the files in order by file name for systems like Spark that sort the data. This is a hacky way to avoid ruining file pruning that takes advantage of clustered/sorted data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org