You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/22 02:00:09 UTC

[GitHub] [iceberg] zhangjun0x01 edited a comment on pull request #1624: extract some common functions to iceberg-core

zhangjun0x01 edited a comment on pull request #1624:
URL: https://github.com/apache/iceberg/pull/1624#issuecomment-714168491


   hi,@openinx ,I add a table properties'REWRITE_SCAN_LIMIT' ,and set the default value is 100M, in RewriteDataFilesAction, if the file size is greater  than this value, it will not be scanned, that is, it will not be compressed.
   
   Because the default value of the targetSizeInBytes is 128M.For example, my iceberg table has 3 datafiles, each of them is 120M. During the execution of Rewrite, the program will still scan these three files and then regenerate the 3 new files which file size  is the same as the original data. If we periodically perform a Rewrite operation on an iceberg table that is being written in real time, these 120M files will be compressed  repeatedly. I think this is unreasonable.
   
   In addition, in the process of combine FileScanTask into CombinedScanTask, it is difficult to ensure that the size of scanned data is exactly targetSizeInBytes (default: 128M).
   
   org.apache.iceberg.util.BinPacking.Bin#canAdd method.
   ```
       boolean canAdd(long weight) {
         return binWeight + weight <= targetWeight;
       }
   ```
   
   In most cases, the actual scan size will be less than  targetSizeInBytes, so I set this limit to the default 100M,not 128M


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org