You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/14 12:29:10 UTC

[GitHub] [iceberg] openinx commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

openinx commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r708216986



##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -56,7 +57,10 @@ public static boolean hasDeletes(FileScanTask task) {
     Preconditions.checkArgument(lookback > 0, "Invalid split planning lookback (negative or 0): %s", lookback);
     Preconditions.checkArgument(openFileCost >= 0, "Invalid file open cost (negative): %s", openFileCost);
 
-    Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), openFileCost);
+    // Check the size of delete file as well to avoid unbalanced bin-packing
+    Function<FileScanTask, Long> weightFunc = file -> Math.max(
+        file.length() + file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),

Review comment:
       Should also consider the cost when using different join algorithm between delete files and data files ? 
   
   For the equality files, the cost to join data file is :  `file.recordCount() * sum(eqDeleteFile.recordCount()) *  avgRecordByteSize)` .
   
   For the positional delete files, the cost to join data file is: `file.length() + sum(posFiles.fileSizeInBytes())`.
   
   The current approach sounds like we are treating all the delete files are positional delete files...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org