You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/04 15:20:29 UTC

[GitHub] [iceberg] davseitsev commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

davseitsev commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-668660110


   @rdblue logic that you described doesn't work for me. I have Spark Structured Streaming job which writes daily partitioned table and produce files of different size, primarily small ones. I ran `RewriteDataFilesAction` a few times on a single partition.
   First run produced 86 files with size up to 128Mb as I configured. Second run took all these files and compacted them again merging them with very small files. Third run again took all files produced by previous job and compacted them again.
   
   I understand that eventually compacted files will become so close to `targetSize` that they will be ignored, but until this happened we need to rewrite gigabytes of data again and again. Also it doesn't work for me because compaction process is relevant only for current day. At the beginning of new day we run major compaction with deduplication, sorting etc. and like "close" partition for previous day, it will not be modified anymore. Intermediate minor compaction is necessary only to prevent clients from reading thousands of small files.
   
   Applying filter by row timestamp can improve the situation, but as we have many tables with completely different size we need to choose right compaction period for each table to have output files with reasonable size. It's really difficult to manage because their size can vary, new table can be added, etc. Also we have late records which could not be considered for compaction if there is no fresh records in the file.
   
   It would be really nice to have separate configuration to limit file size or the predicate like @JingsongLi suggested. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org