You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/04 19:12:45 UTC

[GitHub] [iceberg] WinkerDu opened a new pull request #3073: Core: Support items size setting in BinPacking iterator

WinkerDu opened a new pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073


   In our scene, We use V2Format to support streaming CDC row-level insert / delete. There could be tons of delete file for scan task of small file rewrite action or query scan, such as Spark Batch scan typically.
   
   The existing scan task bin-packing logic is only based on data file size, means a scan task contains data files whose total file size satisfies a given target size, this logic work fine for V1Format since the task only deals with data file.
   
   But for V2Format, scan task performance also consists of delete file applying cost. Suppose that bin-packing target size is 128MB, the data file could be small due to streaming CDC update, such as 1 MB, each data file tries to apply 128 valid eq-delete / pos-delete files. the total delete file applied number of 1 task could reach to 128 * 128 = 16,384, even we have enough CPU core to run these tasks, we could not boost scan performance for 1 task.
   
   This PR introduce a new configuration to specified items size of 1 bin during bin-packing iterating (default as Integer.MAX_VALUE). We could set this to control task scale to boost global scan performance to fully utilize computing resource.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706873908



##########
File path: core/src/test/java/org/apache/iceberg/util/TestTableScanUtil.java
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseCombinedScanTask;
+import org.apache.iceberg.CombinedScanTask;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.junit.Assert;
+import org.junit.Test;
+import org.mockito.Mockito;
+
+public class TestTableScanUtil {
+
+  private List<FileScanTask> tasksWithDataAndDeleteSizes(List<Pair<Long, Long[]>> sizePairs) {
+    return sizePairs.stream().map(sizePair -> {
+      DataFile dataFile = dataFileWithSize(sizePair.first());
+      DeleteFile[] deleteFiles = deleteFilesWithSizes(
+          Arrays.stream(sizePair.second()).mapToLong(Long::longValue).toArray());
+      return new MockFileScanTask(dataFile, deleteFiles);
+    }).collect(Collectors.toList());
+  }
+
+  private DataFile dataFileWithSize(long size) {
+    DataFile mockFile = Mockito.mock(DataFile.class);
+    Mockito.when(mockFile.fileSizeInBytes()).thenReturn(size);
+    return mockFile;
+  }
+
+  private DeleteFile[] deleteFilesWithSizes(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> {
+      DeleteFile mockDeleteFile = Mockito.mock(DeleteFile.class);
+      Mockito.when(mockDeleteFile.fileSizeInBytes()).thenReturn(size);
+      return mockDeleteFile;
+    }).toArray(DeleteFile[]::new);
+  }
+
+  @Test
+  public void testPlanTaskWithDeleteFiles() {
+    List<FileScanTask> testFiles = tasksWithDataAndDeleteSizes(
+        Arrays.asList(
+            Pair.of(100L, new Long[] {20L, 20L}),
+            Pair.of(250L, new Long[0]),
+            Pair.of(100L, new Long[] {40L, 40L, 40L}),
+            Pair.of(510L, new Long[] {1L, 1L}),

Review comment:
       I don't think that these weights test what you want. Won't the results be the same either way since this weighs 510 without delete files and the next file weighs 4 without delete files? If you want to the new feature, I'd set this to 500 and add 3 delete files to the next file scan task. Then it weighs 16 with delete files and 4 without so you know that delete files are taken into account.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r703629582



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -65,6 +65,7 @@
   private Expression filter;
   private long targetSizeInBytes;
   private int splitLookback;
+  private int itemsPerBin;

Review comment:
       Looks like this is only modifying the old rewrite path and not the new path. We should probably fix it in both places




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706489136



##########
File path: core/src/main/java/org/apache/iceberg/util/BinPacking.java
##########
@@ -62,11 +62,6 @@ public ListPacker(long targetWeight, int lookback, boolean largestBinFirst) {
     private final Function<T, Long> weightFunc;
     private final boolean largestBinFirst;
 
-    public PackingIterable(Iterable<T> iterable, long targetWeight, int lookback,

Review comment:
       Please revert this change. Even if this isn't used, the class is public and we don't need to make unnecessary breaking changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706488102



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -234,5 +249,8 @@ private void validateOptions() {
     Preconditions.checkArgument(minInputFiles > 0,
         "Cannot set %s is less than 1. All values less than 1 have the same effect as 1. %d < 1",
         MIN_INPUT_FILES, minInputFiles);
+
+    Preconditions.checkArgument(openFileCost > 0,
+        "Cannot set %s is less than 1, %d < 1", RewriteDataFiles.OPEN_FILE_COST, openFileCost);

Review comment:
       This should be "Invalid file open cost: %d (not positive)"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913786246


   ![image](https://user-images.githubusercontent.com/11454734/132249412-6174fc2f-ae0d-47c3-96f6-f85ec7197b92.png)
   checked the ci log, not sure why it was cancelled. Can I retry this ci? @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r703638638



##########
File path: core/src/main/java/org/apache/iceberg/util/BinPacking.java
##########
@@ -169,19 +179,21 @@ public boolean hasNext() {
 
   private static class Bin<T> {
     private final long targetWeight;
+    private final int itemsPerBin;
     private final List<T> items = Lists.newArrayList();
     private long binWeight = 0L;
 
-    Bin(long targetWeight) {
+    Bin(long targetWeight, int itemsPerBin) {
       this.targetWeight = targetWeight;
+      this.itemsPerBin = itemsPerBin;
     }
 
     List<T> items() {
       return items;
     }
 
     boolean canAdd(long weight) {
-      return binWeight + weight <= targetWeight;
+      return binWeight + weight <= targetWeight && items.size() <= itemsPerBin;

Review comment:
       I think there is a slight issue with this implementation and how it interacts with lookback.
   
   Consider we have a lookback of 3 a max number of items per bin of 2 and a max value of bin of 10
   ```
   // Add Item (6)
   [ 8 ], [3, 4], [4, 3]
   // Even if remove largest is set largest bin is [8] but we actually have 2 bins we know we can never add another item to which we should be dropping first.
   
   I think to properly do this we need to also finish off any bins which have reached their max item size
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706490176



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -69,6 +73,26 @@ public Table table() {
     return Arrays.stream(sizes).mapToObj(size -> new MockFileScanTask(size * MB)).collect(Collectors.toList());
   }
 
+  private FileScanTask fileOfDataAndDeleteSize(long dataSize, long[] deleteSizes) {
+    DataFile dataFile = mockDataFile(dataSize);
+    DeleteFile[] deleteFiles = deleteFilesOfSize(deleteSizes);
+    return new MockFileScanTask(dataFile, deleteFiles);
+  }
+
+  private DeleteFile[] deleteFilesOfSize(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> {
+      DeleteFile mockDeleteFile = Mockito.mock(DeleteFile.class);
+      Mockito.when(mockDeleteFile.fileSizeInBytes()).thenReturn(size * MB);
+      return mockDeleteFile;
+    }).toArray(DeleteFile[]::new);
+  }
+
+  private DataFile mockDataFile(long size) {

Review comment:
       For consistency, what about `dataFileWithSize`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706490049



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -69,6 +73,26 @@ public Table table() {
     return Arrays.stream(sizes).mapToObj(size -> new MockFileScanTask(size * MB)).collect(Collectors.toList());
   }
 
+  private FileScanTask fileOfDataAndDeleteSize(long dataSize, long[] deleteSizes) {
+    DataFile dataFile = mockDataFile(dataSize);
+    DeleteFile[] deleteFiles = deleteFilesOfSize(deleteSizes);
+    return new MockFileScanTask(dataFile, deleteFiles);
+  }
+
+  private DeleteFile[] deleteFilesOfSize(long... sizes) {

Review comment:
       What about `deleteFilesWithSizes`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-915646119


   This doesn't seem like the right way to solve the problem that some splits are too expensive because there are too many delete files. A symptom of the problem is that it isn't obvious how to set the maximum number of items per bin.
   
   I think that the right approach to solve the problem is to modify the weight function that we use for splits, _not_ to modify the bin packing algorithm itself.
   
   A straightforward way to fix this problem is to make the weight function take into account how much of a penalty a delete file is. Right now, the weight function is this, in `TableScanUtil`:
   
   ```java
       Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), openFileCost);
   ```
   
   That could easily be updated so that each delete file that must be opened is added to the open cost:
   
   ```java
       Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), (1 + file.deletes().size()) * openFileCost);
   ```
   
   Another option is to check the size of the delete data as well:
   
   ```java
       Function<FileScanTask, Long> weightFunc = file -> Math.max(
           file.length() + file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),
           (1 + file.deletes().size()) * openFileCost);
   ```
   
   Let's try those options instead of modifying bin packing. I think modifying the packing algorithm is the wrong direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r703739079



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -65,6 +65,7 @@
   private Expression filter;
   private long targetSizeInBytes;
   private int splitLookback;
+  private int itemsPerBin;

Review comment:
       Yeah the new api extends off the interface rather than this base class. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706489927



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -69,6 +73,26 @@ public Table table() {
     return Arrays.stream(sizes).mapToObj(size -> new MockFileScanTask(size * MB)).collect(Collectors.toList());
   }
 
+  private FileScanTask fileOfDataAndDeleteSize(long dataSize, long[] deleteSizes) {

Review comment:
       Nit: I think this should be `taskWithDataAndDeleteSizes`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706489518



##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -56,7 +57,10 @@ public static boolean hasDeletes(FileScanTask task) {
     Preconditions.checkArgument(lookback > 0, "Invalid split planning lookback (negative or 0): %s", lookback);
     Preconditions.checkArgument(openFileCost >= 0, "Invalid file open cost (negative): %s", openFileCost);
 
-    Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), openFileCost);
+    // For V2Format, we should check the size of delete file as well to avoid unbalanced bin-packing

Review comment:
       No need to use personal pronouns. This should be simply "check the size ... to avoid ..."




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706546034



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -211,7 +220,13 @@ protected long writeMaxFileSize() {
   }
 
   private long sizeOfInputFiles(List<FileScanTask> group) {
-    return group.stream().mapToLong(FileScanTask::length).sum();
+    return group.stream().mapToLong(this::sizeOfInputFile).sum();
+  }
+
+  private long sizeOfInputFile(FileScanTask file) {

Review comment:
       I think the `file groups` here depends the rewrite job scale, it should take into account how much delete files cost in each rewrite job, otherwise there could be 'job-level' unbalanced issue.
   @rdblue @RussellSpitzer what do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r711825612



##########
File path: core/src/test/java/org/apache/iceberg/util/TestTableScanUtil.java
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseCombinedScanTask;
+import org.apache.iceberg.CombinedScanTask;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.junit.Assert;
+import org.junit.Test;
+import org.mockito.Mockito;
+
+public class TestTableScanUtil {
+
+  private List<FileScanTask> tasksWithDataAndDeleteSizes(List<Pair<Long, Long[]>> sizePairs) {
+    return sizePairs.stream().map(sizePair -> {
+      DataFile dataFile = dataFileWithSize(sizePair.first());
+      DeleteFile[] deleteFiles = deleteFilesWithSizes(
+          Arrays.stream(sizePair.second()).mapToLong(Long::longValue).toArray());
+      return new MockFileScanTask(dataFile, deleteFiles);
+    }).collect(Collectors.toList());
+  }
+
+  private DataFile dataFileWithSize(long size) {
+    DataFile mockFile = Mockito.mock(DataFile.class);
+    Mockito.when(mockFile.fileSizeInBytes()).thenReturn(size);
+    return mockFile;
+  }
+
+  private DeleteFile[] deleteFilesWithSizes(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> {
+      DeleteFile mockDeleteFile = Mockito.mock(DeleteFile.class);
+      Mockito.when(mockDeleteFile.fileSizeInBytes()).thenReturn(size);
+      return mockDeleteFile;
+    }).toArray(DeleteFile[]::new);
+  }
+
+  @Test
+  public void testPlanTaskWithDeleteFiles() {
+    List<FileScanTask> testFiles = tasksWithDataAndDeleteSizes(
+        Arrays.asList(
+            Pair.of(100L, new Long[] {20L, 20L}),
+            Pair.of(250L, new Long[0]),
+            Pair.of(100L, new Long[] {40L, 40L, 40L}),

Review comment:
       This isn't a blocker, so I'll merge this when tests pass.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-915355537


   I agree with @openinx that ideally we really should try to get a more V2-style algorithm here that is able to take into account deletes and such. But I don't have an issue with adding more parameters to bin-pack.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx closed pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
openinx closed pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706861105



##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -76,6 +76,12 @@
    */
   String TARGET_FILE_SIZE_BYTES = "target-file-size-bytes";
 
+  /**
+   * The estimated cost to open a file, used as a minimum weight when combining splits. By default this
+   * will use the "read.split.open-file-cost" value in the table properties of the table being updated.
+   */
+  String OPEN_FILE_COST = "open-file-cost";

Review comment:
       The other properties are `file-open-cost`, not `open-file-cost`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706874065



##########
File path: core/src/test/java/org/apache/iceberg/util/TestTableScanUtil.java
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseCombinedScanTask;
+import org.apache.iceberg.CombinedScanTask;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.junit.Assert;
+import org.junit.Test;
+import org.mockito.Mockito;
+
+public class TestTableScanUtil {
+
+  private List<FileScanTask> tasksWithDataAndDeleteSizes(List<Pair<Long, Long[]>> sizePairs) {
+    return sizePairs.stream().map(sizePair -> {
+      DataFile dataFile = dataFileWithSize(sizePair.first());
+      DeleteFile[] deleteFiles = deleteFilesWithSizes(
+          Arrays.stream(sizePair.second()).mapToLong(Long::longValue).toArray());
+      return new MockFileScanTask(dataFile, deleteFiles);
+    }).collect(Collectors.toList());
+  }
+
+  private DataFile dataFileWithSize(long size) {
+    DataFile mockFile = Mockito.mock(DataFile.class);
+    Mockito.when(mockFile.fileSizeInBytes()).thenReturn(size);
+    return mockFile;
+  }
+
+  private DeleteFile[] deleteFilesWithSizes(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> {
+      DeleteFile mockDeleteFile = Mockito.mock(DeleteFile.class);
+      Mockito.when(mockDeleteFile.fileSizeInBytes()).thenReturn(size);
+      return mockDeleteFile;
+    }).toArray(DeleteFile[]::new);
+  }
+
+  @Test
+  public void testPlanTaskWithDeleteFiles() {
+    List<FileScanTask> testFiles = tasksWithDataAndDeleteSizes(
+        Arrays.asList(
+            Pair.of(100L, new Long[] {20L, 20L}),
+            Pair.of(250L, new Long[0]),
+            Pair.of(100L, new Long[] {40L, 40L, 40L}),

Review comment:
       I think you want at least one file scan task that weighs over 512 with delete files alone. One with 250 data and 3 delete files of 100 each would do it. Then that shouldn't get combined with anything if weighed properly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on pull request #3073: Core: Support items size setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913158740


   @RussellSpitzer thanks for your replay. The name has changed to `items-per-bin`
   
   I think the number of data files and delete referenced per bin should be specified separately, `insert` operation only add data files for example. I am glad to deal with delete-files situation in following PR
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913786532


   I think this is the timeout issue me and @kbendick were discussing yesterday


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-923052661


   Merged. Thanks for updating this, @WinkerDu!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r710377459



##########
File path: core/src/test/java/org/apache/iceberg/util/TestTableScanUtil.java
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseCombinedScanTask;
+import org.apache.iceberg.CombinedScanTask;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.junit.Assert;
+import org.junit.Test;
+import org.mockito.Mockito;
+
+public class TestTableScanUtil {
+
+  private List<FileScanTask> tasksWithDataAndDeleteSizes(List<Pair<Long, Long[]>> sizePairs) {
+    return sizePairs.stream().map(sizePair -> {
+      DataFile dataFile = dataFileWithSize(sizePair.first());
+      DeleteFile[] deleteFiles = deleteFilesWithSizes(
+          Arrays.stream(sizePair.second()).mapToLong(Long::longValue).toArray());
+      return new MockFileScanTask(dataFile, deleteFiles);
+    }).collect(Collectors.toList());
+  }
+
+  private DataFile dataFileWithSize(long size) {
+    DataFile mockFile = Mockito.mock(DataFile.class);
+    Mockito.when(mockFile.fileSizeInBytes()).thenReturn(size);
+    return mockFile;
+  }
+
+  private DeleteFile[] deleteFilesWithSizes(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> {
+      DeleteFile mockDeleteFile = Mockito.mock(DeleteFile.class);
+      Mockito.when(mockDeleteFile.fileSizeInBytes()).thenReturn(size);
+      return mockDeleteFile;
+    }).toArray(DeleteFile[]::new);
+  }
+
+  @Test
+  public void testPlanTaskWithDeleteFiles() {
+    List<FileScanTask> testFiles = tasksWithDataAndDeleteSizes(
+        Arrays.asList(
+            Pair.of(100L, new Long[] {20L, 20L}),
+            Pair.of(250L, new Long[0]),
+            Pair.of(100L, new Long[] {40L, 40L, 40L}),

Review comment:
       @rdblue 
   Sorry for late reply since I am busy these day.
   I've change the unit test case as following with open cost 50, lookback 3 and target weight 300: 
   ```
   group 1. data: 100, delete: 50, 100
   group 2. data: 50, delete: 1, 50
   group 3. data: 50, delete: 100
   group 4. data: 1, delete: 1, 1
   group 5. data: 75, delete 75
   ```
   plan tasks after bin-packing would be `{group 1}, {group 2, group 3}, {group 4, group 5}`
   1. group 1 total weight : 100 + 50 + 100 = 300
   2. group 2 and group 3 total weight: `(50 + 1 + 50) -> (50 + 50 + 50)` + 50 + 100 = 300
   3. group 3 and group 4 total weight: `(1 + 1 + 1) -> (50 + 50 + 50)` + 75 + 75 = 300




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on pull request #3073: Core: Support items size setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913025303


   @rdblue @openinx could you please review this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913789529


   @RussellSpitzer OK, let's get the review done first :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706601353



##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -56,7 +57,10 @@ public static boolean hasDeletes(FileScanTask task) {
     Preconditions.checkArgument(lookback > 0, "Invalid split planning lookback (negative or 0): %s", lookback);
     Preconditions.checkArgument(openFileCost >= 0, "Invalid file open cost (negative): %s", openFileCost);
 
-    Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), openFileCost);
+    // For V2Format, we should check the size of delete file as well to avoid unbalanced bin-packing
+    Function<FileScanTask, Long> weightFunc = file -> Math.max(
+        file.length() + file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),
+        (1 + file.deletes().size()) * openFileCost);

Review comment:
       A new commit with related UT has pushed, please check @rdblue 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu edited a comment on pull request #3073: Core: Support items size setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu edited a comment on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913025303


   @rdblue @openinx @RussellSpitzer could you please review this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r703685409



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -65,6 +65,7 @@
   private Expression filter;
   private long targetSizeInBytes;
   private int splitLookback;
+  private int itemsPerBin;

Review comment:
       Well, what new path you mean, `RewriteDataFiles` and related implements?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913321192


   I'll take a full pass on tuesday, I'm just getting settled back home but we have power and interenet so should be plenty of time then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706870079



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -211,7 +220,13 @@ protected long writeMaxFileSize() {
   }
 
   private long sizeOfInputFiles(List<FileScanTask> group) {
-    return group.stream().mapToLong(FileScanTask::length).sum();
+    return group.stream().mapToLong(this::sizeOfInputFile).sum();
+  }
+
+  private long sizeOfInputFile(FileScanTask file) {

Review comment:
       @rdblue OK, I revert the code change of `BinPackStrategy`, just introduce deletes based `weightFunc` in `TableScanUtil`, please have a further review, thank you




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706861023



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -211,7 +220,13 @@ protected long writeMaxFileSize() {
   }
 
   private long sizeOfInputFiles(List<FileScanTask> group) {
-    return group.stream().mapToLong(FileScanTask::length).sum();
+    return group.stream().mapToLong(this::sizeOfInputFile).sum();
+  }
+
+  private long sizeOfInputFile(FileScanTask file) {

Review comment:
       I don't think this should be based on deletes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-914997015


   @WinkerDu I definitely agreed the v2 bin-pack algorithm should be improved for v2 to consider the total size of insert & delete files.  I think the `iterms-per-bin` proposed from you team is trying to resolve the unbalanced issue,  but I'm concerning it's hard to set the correct `iterms-per-bin` value for a given table in real production environment,  because the `iterms-per-bin` is still controlling the data file's count.  We actually don't have a real suitable approach to evaluate the cost about joining the data file size & its delete records.  I think we need more accurate approach to decide which scan tasks should be dispatched to different tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r703695049



##########
File path: core/src/main/java/org/apache/iceberg/util/BinPacking.java
##########
@@ -169,19 +179,21 @@ public boolean hasNext() {
 
   private static class Bin<T> {
     private final long targetWeight;
+    private final int itemsPerBin;
     private final List<T> items = Lists.newArrayList();
     private long binWeight = 0L;
 
-    Bin(long targetWeight) {
+    Bin(long targetWeight, int itemsPerBin) {
       this.targetWeight = targetWeight;
+      this.itemsPerBin = itemsPerBin;
     }
 
     List<T> items() {
       return items;
     }
 
     boolean canAdd(long weight) {
-      return binWeight + weight <= targetWeight;
+      return binWeight + weight <= targetWeight && items.size() <= itemsPerBin;

Review comment:
       OK, I will push a new commit to do this with UT case




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706488919



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -211,7 +220,13 @@ protected long writeMaxFileSize() {
   }
 
   private long sizeOfInputFiles(List<FileScanTask> group) {
-    return group.stream().mapToLong(FileScanTask::length).sum();
+    return group.stream().mapToLong(this::sizeOfInputFile).sum();
+  }
+
+  private long sizeOfInputFile(FileScanTask file) {

Review comment:
       @RussellSpitzer, what do you think about this change to `BinPackStrategy`?
   
   I'm not sure that we want to make this change to the file rewrite. While delete files are relevant to how this will run, the bin packing here is not to balance work in this job, it is to pack data into files so that work is balanced in future jobs.
   
   I think it is probably a mistake to change this class.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706487848



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -211,7 +220,13 @@ protected long writeMaxFileSize() {
   }
 
   private long sizeOfInputFiles(List<FileScanTask> group) {
-    return group.stream().mapToLong(FileScanTask::length).sum();
+    return group.stream().mapToLong(this::sizeOfInputFile).sum();
+  }
+
+  private long sizeOfInputFile(FileScanTask file) {
+    // For V2Format, we should check the size of delete file as well to avoid unbalanced bin-packing
+    return Math.max(file.length() + file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),

Review comment:
       Can you wrap this so that both arguments to `max` start on a new line?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r703695049



##########
File path: core/src/main/java/org/apache/iceberg/util/BinPacking.java
##########
@@ -169,19 +179,21 @@ public boolean hasNext() {
 
   private static class Bin<T> {
     private final long targetWeight;
+    private final int itemsPerBin;
     private final List<T> items = Lists.newArrayList();
     private long binWeight = 0L;
 
-    Bin(long targetWeight) {
+    Bin(long targetWeight, int itemsPerBin) {
       this.targetWeight = targetWeight;
+      this.itemsPerBin = itemsPerBin;
     }
 
     List<T> items() {
       return items;
     }
 
     boolean canAdd(long weight) {
-      return binWeight + weight <= targetWeight;
+      return binWeight + weight <= targetWeight && items.size() <= itemsPerBin;

Review comment:
       I think maybe it's reasonable that lookback is the only condition to trigger bin removal.
   If the bins [3, 4], [4, 3] finish off before lookback limit reached, the 'lookback' setting will be confused.
   On the other hand, if lookback is the only condition to remove bin, both 'lookback' and 'itemsPerBin' settings will work fine literally.
   what do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-915492512


   > @WinkerDu I definitely agreed the v2 bin-pack algorithm should be improved for v2 to consider the total size of insert & delete files. I think the `iterms-per-bin` proposed from you team is trying to resolve the unbalanced issue, but I'm concerning it's hard to set the correct `iterms-per-bin` value for a given table in real production environment, because the `iterms-per-bin` is still controlling the data file's count. We actually don't have a real suitable approach to evaluate the cost about joining the data file size & its delete records. I think we need more accurate approach to decide which scan tasks should be dispatched to different tasks.
   
   I need to spend some time looking closer at the test cases (and probably try this out on some V2 tables), but I share the concern that this config value might be really hard to determine in a production env. Especially for example Flink users who have CDC streams, often times databases will experience a burst of deletes / updates due to some cron schedule and will then have an outsized number of delete files for a period of time (assuming partitioned by time as well).
   
   Wondering how we would go about picking a good number (or how often one would need to set a non-standard number, or change the number for individual sections of the table).
   
   That said, I'm also not adverse to adding another argument to make bin packing more useful in the near-term while we figure out the best way to have a more "V2 native" algorithm / parameter set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r704658095



##########
File path: core/src/main/java/org/apache/iceberg/util/BinPacking.java
##########
@@ -36,51 +36,59 @@
   public static class ListPacker<T> {
     private final long targetWeight;
     private final int lookback;
+    private final int itemsPerBin;
     private final boolean largestBinFirst;
 
     public ListPacker(long targetWeight, int lookback, boolean largestBinFirst) {
+      this(targetWeight, lookback, largestBinFirst, Integer.MAX_VALUE);
+    }
+
+    public ListPacker(long targetWeight, int lookback, boolean largestBinFirst, int itemsPerBin) {
       this.targetWeight = targetWeight;
       this.lookback = lookback;
+      this.itemsPerBin = itemsPerBin;

Review comment:
       Nit: might want to move this one below `largestBinFirst` so they're in the same order as the parameters list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-914950939


   Let's retry the travis CI ..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r708216986



##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -56,7 +57,10 @@ public static boolean hasDeletes(FileScanTask task) {
     Preconditions.checkArgument(lookback > 0, "Invalid split planning lookback (negative or 0): %s", lookback);
     Preconditions.checkArgument(openFileCost >= 0, "Invalid file open cost (negative): %s", openFileCost);
 
-    Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), openFileCost);
+    // Check the size of delete file as well to avoid unbalanced bin-packing
+    Function<FileScanTask, Long> weightFunc = file -> Math.max(
+        file.length() + file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),

Review comment:
       Should also consider the cost when using different join algorithm between delete files and data files ? 
   
   For the equality files, the cost to join data file is :  `file.recordCount() * sum(eqDeleteFile.recordCount()) *  avgRecordByteSize)` .
   
   For the positional delete files, the cost to join data file is: `file.length() + sum(posFiles.fileSizeInBytes())`.
   
   The current approach sounds like we are treating all the delete files are positional delete files...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r711825378



##########
File path: core/src/test/java/org/apache/iceberg/util/TestTableScanUtil.java
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.util;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.BaseCombinedScanTask;
+import org.apache.iceberg.CombinedScanTask;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.junit.Assert;
+import org.junit.Test;
+import org.mockito.Mockito;
+
+public class TestTableScanUtil {
+
+  private List<FileScanTask> tasksWithDataAndDeleteSizes(List<Pair<Long, Long[]>> sizePairs) {
+    return sizePairs.stream().map(sizePair -> {
+      DataFile dataFile = dataFileWithSize(sizePair.first());
+      DeleteFile[] deleteFiles = deleteFilesWithSizes(
+          Arrays.stream(sizePair.second()).mapToLong(Long::longValue).toArray());
+      return new MockFileScanTask(dataFile, deleteFiles);
+    }).collect(Collectors.toList());
+  }
+
+  private DataFile dataFileWithSize(long size) {
+    DataFile mockFile = Mockito.mock(DataFile.class);
+    Mockito.when(mockFile.fileSizeInBytes()).thenReturn(size);
+    return mockFile;
+  }
+
+  private DeleteFile[] deleteFilesWithSizes(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> {
+      DeleteFile mockDeleteFile = Mockito.mock(DeleteFile.class);
+      Mockito.when(mockDeleteFile.fileSizeInBytes()).thenReturn(size);
+      return mockDeleteFile;
+    }).toArray(DeleteFile[]::new);
+  }
+
+  @Test
+  public void testPlanTaskWithDeleteFiles() {
+    List<FileScanTask> testFiles = tasksWithDataAndDeleteSizes(
+        Arrays.asList(
+            Pair.of(100L, new Long[] {20L, 20L}),
+            Pair.of(250L, new Long[0]),
+            Pair.of(100L, new Long[] {40L, 40L, 40L}),

Review comment:
       It is easier to read if you don't have everything add up to exactly the target weight. That way the reader doesn't need to think about edge cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r703638638



##########
File path: core/src/main/java/org/apache/iceberg/util/BinPacking.java
##########
@@ -169,19 +179,21 @@ public boolean hasNext() {
 
   private static class Bin<T> {
     private final long targetWeight;
+    private final int itemsPerBin;
     private final List<T> items = Lists.newArrayList();
     private long binWeight = 0L;
 
-    Bin(long targetWeight) {
+    Bin(long targetWeight, int itemsPerBin) {
       this.targetWeight = targetWeight;
+      this.itemsPerBin = itemsPerBin;
     }
 
     List<T> items() {
       return items;
     }
 
     boolean canAdd(long weight) {
-      return binWeight + weight <= targetWeight;
+      return binWeight + weight <= targetWeight && items.size() <= itemsPerBin;

Review comment:
       I think there is a slight issue with this implementation and how it interacts with lookback.
   
   Consider we have a lookback of 3 a max number of items per bin of 2 and a max value of bin of 10
   ```
   // Add Item (6)
   [ 8 ], [3, 4], [4, 3]
   // Neither of our completed bins [3, 4], and [4,3] will be removed regardless of whether removeLargest is set
   
   I think to properly do this we need to also finish off any bins which have reached their max item size
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-917192335


   @rdblue @RussellSpitzer @openinx @kbendick A new commit has been pushed, please have a review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913298581


   @aokolnychyi @kbendick @jackye1995 please have a further review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r710365448



##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -56,7 +57,10 @@ public static boolean hasDeletes(FileScanTask task) {
     Preconditions.checkArgument(lookback > 0, "Invalid split planning lookback (negative or 0): %s", lookback);
     Preconditions.checkArgument(openFileCost >= 0, "Invalid file open cost (negative): %s", openFileCost);
 
-    Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), openFileCost);
+    // Check the size of delete file as well to avoid unbalanced bin-packing
+    Function<FileScanTask, Long> weightFunc = file -> Math.max(
+        file.length() + file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),

Review comment:
       Thanks for your reply @openinx . Actually I don't think it's necessary to separate delete file to eq-delete and pos-delete when calculating cost. besides the algorithm here would be complicated, the top-level logic to apply delete files is 
   `applyEqDeletes(applyPosDeletes(records))`
   Generally I think the base logic above for eq-delete and pos-delete is the same.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913786742


   I wouldn't worry about doing a retry till we get the reviews done, if it passed java 11 (and it did) it just means it is an unrelated issue I think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3073: Core: Enhance weightFunc of bin-packing to adapt to V2Format

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#discussion_r706490523



##########
File path: core/src/main/java/org/apache/iceberg/util/TableScanUtil.java
##########
@@ -56,7 +57,10 @@ public static boolean hasDeletes(FileScanTask task) {
     Preconditions.checkArgument(lookback > 0, "Invalid split planning lookback (negative or 0): %s", lookback);
     Preconditions.checkArgument(openFileCost >= 0, "Invalid file open cost (negative): %s", openFileCost);
 
-    Function<FileScanTask, Long> weightFunc = file -> Math.max(file.length(), openFileCost);
+    // For V2Format, we should check the size of delete file as well to avoid unbalanced bin-packing
+    Function<FileScanTask, Long> weightFunc = file -> Math.max(
+        file.length() + file.deletes().stream().mapToLong(ContentFile::fileSizeInBytes).sum(),
+        (1 + file.deletes().size()) * openFileCost);

Review comment:
       Looks like there is no test for this change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3073: Core: Support items size setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913054535


   I think this is a good feature to add but perhaps the name should be something along the lines of 
   `items-per-bin` or `max-items-per-bin` ? 
   
   Or maybe we could hit this use case directly and set a max number of `delete files` referenced per bin?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-913321192


   I'll take a full pass on tuesday, I'm just getting settled back home but we have power and internet so should be plenty of time then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] WinkerDu commented on pull request #3073: Core: Support items per bin setting in BinPacking iterator

Posted by GitBox <gi...@apache.org>.
WinkerDu commented on pull request #3073:
URL: https://github.com/apache/iceberg/pull/3073#issuecomment-916132094


   @openinx @kbendick @rdblue Thanks for your reply.
   I think `weightFunc` @rdblue suggests is a good option to solve the problem you mentioned.
   Will push a new commit later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org