You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/06 08:15:07 UTC

[GitHub] [iceberg] singhpk234 opened a new pull request #3676: Core : Dont consider groups for compaction if they have a single file…

singhpk234 opened a new pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676


   Closes #3236 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827315183



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed

Review comment:
       I would drop the comment and move the message into the assert like above




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826230903



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       I am thinking of a case like Group with (1 Data file with 1 delete) with deleteFileThreshold = 2
   3 as per my understanding would return false I beleive as deletes  = 1 which is < 2
   ```java
     private boolean taskHasTooManyDeletes(FileScanTask task) {
       return task.deletes() != null && task.deletes().size() >= deleteFileThreshold;
     }
   ```
   
   4 as per my understanding would return return true  as delete = 1 which >= 1
   ```java
     private boolean taskHasDeletes(FileScanTask task) {
       return task.deletes() != null && task.deletes().size() >= 1;
     }
   ```
   
   Appologies seems like I am missing something here, can you please elaborate more
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826093117



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,38 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // single file with size > targetSize should be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(4 * MB);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);
+
+    Assert.assertEquals("Should plan 1 group", 1, Iterables.size(grouped2));
+
+    // single file with a delete should be planned, even when deletedFileThreshold is 2

Review comment:
       This is the only behavior I'm not sure of




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827313594



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.

Review comment:
       I think we are aiming to not add concluding punctuation on comments. So I would keep it consistent with your other comments in the file and drop the final period.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826292931



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       Why is this test changed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826319197



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       Hmmm this makes me think that maybe I was wrong to change that parameter. I think we want groups in in any of the following conditions
   
   1. Any group that has more than min_input_files
   2. Any group which will successfully write a brand new file closer to target file size
   3. Any group which contains a file larger than max_target file size
   4. Any group which has more deletes than the dele threshold
   
   I think 2 is sizeOfInput(files) > target file size and numFiles > 1 
   I think 3 is sizeOfInput(files) > maxTargetFileSize




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826302669



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       when I updated the sizeOfInputFiles to be >  `maxFileSize`, it started breaking as default value of `maxFileSize` is 1.8 * (default value of targetFileSize), where `targetFileSize` defaults to `512 MB` which makes maxFileSize now to be `921.6 MB`.
   
   We were planning 2 groups with 900 MB sizeOfInputFiles size which now started to fail.. hence made groups of 1200 MB now to restore the assertions of `testMaxGroupSize`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826302669



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       when I updated the sizeOfInputFiles to be >  `maxFileSize`, it started breaking as default value of `maxFileSize` is 1.8 * (default value of targetFileSize) which makes maxFileSize now to be `921.6 MB`.
   
   We were planning 2 groups with 900 MB sizeOfInputFiles size which now started to fail.. hence made groups of 1200 MB now to restore the assertions of `testMaxGroupSize`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826302669



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       when I updated the sizeOfInputFiles to be >  `maxFileSize`, it started breaking as default value is 1.8 * (default value of targetFileSize) which makes maxFileSize now to be `921.6 MB`.
   
   We were planning 2 groups with 900 MB sizeOfInputFiles size which now started to fail.. hence made groups of 1200 MB now to restore the assertions of `testMaxGroupSize`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#issuecomment-1068500603


   Thanks @singhpk234!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826089667



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       Why do we need this check? Shouldn't "taskHasToManyDeletes" be the catch here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826230903



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       I am thinking of a case like Group with (1 Data file with 1 delete) with deleteFileThreshold = 2
   3 as per my understanding would return false I beleive as deletes  = 1 which is < 2
   ```java
     private boolean taskHasTooManyDeletes(FileScanTask task) {
       return task.deletes() != null && task.deletes().size() >= deleteFileThreshold;
     }
   ```
   
   4 as per my understanding would return return true  as delete = 1 which >= 1
   ```java
     private boolean taskHasDeletes(FileScanTask task) {
       return task.deletes() != null && task.deletes().size() >= 1;
     }
   ```
   
   Apologies seems like I am missing something here, can you please elaborate more
   
   Are you suggesting we don't require 4 ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826352994



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       Yep for 1) we can add in files > 1, 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826178425



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,38 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // single file with size > targetSize should be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(4 * MB);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);
+
+    Assert.assertEquals("Should plan 1 group", 1, Iterables.size(grouped2));
+
+    // single file with a delete should be planned, even when deletedFileThreshold is 2

Review comment:
       was trying to handle this :
   > A group with a single file that is modified by deletes should be rewritten (if the rewrite deletes filter is on)
   
   here in previous scenario if minInputFiles was 1, even it didn't pass the deleteFileThreshold we would have planned it, I thought by the line above if there are groups with 1 file effected by deletes we should attempt to re-write it.. 
   
   Apologies, Am I missing something here ?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827316179



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);

Review comment:
       I'm not sure how this is different than the first test? Shouldn't this be multiple files less than target file size and less than minInputFiles?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827356760



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);

Review comment:
       Thanks, you are correct these two are exactly the same .
   
   Added test case where 
   - multiple files less than target file size and less than minInputFiles will not be planned 
   - multiple files greater than target file size and less than minInputFiles will be planned 

##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);

Review comment:
       Thanks, you are correct these two are exactly the same
   
   Added test case where 
   - multiple files less than target file size and less than minInputFiles will not be planned 
   - multiple files greater than target file size and less than minInputFiles will be planned 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827314942



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +

Review comment:
       Is the comment above needed if we have this message? I think we can drop the comment since you explain the test here in the same words. Minor suggestion on message:
   
   "Should plan 0 groups, 1 file is too small but no deletes are present so rewriting is a NOOP"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827316644



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);
+
+    Assert.assertEquals("Should plan 0 group", 0, Iterables.size(grouped2));
+
+    // group with single file with size > maxFileSize should be planed

Review comment:
       Move comment message into the assert




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827317198



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);
+
+    Assert.assertEquals("Should plan 0 group", 0, Iterables.size(grouped2));
+
+    // group with single file with size > maxFileSize should be planed
+    Iterable<FileScanTask> testFiles3 = filesOfSize(4);
+    Iterable<List<FileScanTask>> grouped3 = strategy.planFileGroups(testFiles3);
+
+    Assert.assertEquals("Should plan 1 group", 1, Iterables.size(grouped3));

Review comment:
       "Should plan 1 group because the file present is larger than maxFileSize and can be split"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826089667



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       Why do we need this check? Shouldn't "taskHasToManyDeletes" be sufficent?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826194666



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       But you've changed the code here, we have
   1. More than one file in the group and minInputFiles statisfied
   2. A group is larger than the maxFile size (needs to be split)
   3. A group (regardless of number of files) has too many deletes
   4. A group (if it has 1 file) has deletes
   
   Doesn't 3 also cover 4?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826324631



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       For 1 if we just allow `Any group that has more than min_input_files` then when min_input_files = 1 we will plan it irrespective if the sizes were already within threshold, is it ok doing I believe this was the situation we were trying to avoid.
   
   +1 on 2, 3 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827318058



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);
+
+    Assert.assertEquals("Should plan 0 group", 0, Iterables.size(grouped2));
+
+    // group with single file with size > maxFileSize should be planed
+    Iterable<FileScanTask> testFiles3 = filesOfSize(4);
+    Iterable<List<FileScanTask>> grouped3 = strategy.planFileGroups(testFiles3);
+
+    Assert.assertEquals("Should plan 1 group", 1, Iterables.size(grouped3));
+
+    // group with single file and with deletes >= deleteThreshold should be planned
+    List<FileScanTask> testFiles4 = Lists.newArrayList();
+    testFiles4.add(MockFileScanTask.mockTaskWithDeletes(MB, 2));
+    Iterable<List<FileScanTask>> grouped4 = strategy.planFileGroups(testFiles4);
+    Assert.assertEquals("Should plan 1 group", 1, Iterables.size(grouped4));

Review comment:
       Should plan 1 group, the data file has delete files and can be re-written without deleted rows.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer merged pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer merged pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#issuecomment-1068704598


   Thanks @RussellSpitzer @rajarshisarkar :) !!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rajarshisarkar commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
rajarshisarkar commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r825888334



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       In a file group, shouldn't we rewrite a single file if it is very large?

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -245,6 +248,10 @@ private boolean taskHasTooManyDeletes(FileScanTask task) {
     return task.deletes() != null && task.deletes().size() >= deleteFileThreshold;
   }
 
+  private boolean taskHasDeletes(FileScanTask task) {
+    return task.deletes() != null && task.deletes().size() >= 1;

Review comment:
       You can use `return task.deletes() != null && !task.deletes().isEmpty();`

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -245,6 +248,10 @@ private boolean taskHasTooManyDeletes(FileScanTask task) {
     return task.deletes() != null && task.deletes().size() >= deleteFileThreshold;
   }
 
+  private boolean taskHasDeletes(FileScanTask task) {
+    return task.deletes() != null && task.deletes().size() >= 1;

Review comment:
       You can use `return task.deletes() != null && !task.deletes().isEmpty();`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826185680



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       was trying to handle this in [comment](https://github.com/apache/iceberg/issues/3236#issuecomment-989504872):
   > A group with a single file that is modified by deletes should be rewritten (if the rewrite deletes filter is on)
   
   here in previous scenario if minInputFiles was 1 and group.size() = 1, even if the group didn't pass the deleteFileThreshold (checked in taskHasToManyDeletes) we would have planned it, I thought by the line above if there are groups with 1 file effected by deletes we should attempt to plan it, hence skipped delete threshold check.
   Am I missing something here ?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826264586



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,38 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // single file with size > targetSize should be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(4 * MB);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);
+
+    Assert.assertEquals("Should plan 1 group", 1, Iterables.size(grouped2));
+
+    // single file with a delete should be planned, even when deletedFileThreshold is 2

Review comment:
       removed based on discussion above.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826324631



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -187,10 +212,10 @@ public void testGroupingWithDeletes() {
   @Test
   public void testMaxGroupSize() {
     RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
-        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1000 * MB)
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES, Long.toString(1300 * MB)

Review comment:
       For 1 if we just allow `Any group that has more than min_input_files` then when min_input_files = 1 we will plan it irrespective if the sizes were already within threshold .
   
   +1 on 2, 3 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826248294



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       Yes I don't see how 4 is different than "3" where deleteFileThreshold is 1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r827316387



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestBinPackStrategy.java
##########
@@ -138,6 +138,43 @@ public void testGroupingMinInputFilesInvalid() {
         0, Iterables.size(grouped));
   }
 
+  @Test
+  public void testGroupingMinInputFilesAsOne() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+            BinPackStrategy.MIN_INPUT_FILES, Integer.toString(1),
+            BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(3 * MB),
+            RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(2 * MB),
+            BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(MB),
+            BinPackStrategy.DELETE_FILE_THRESHOLD, Integer.toString(2)
+    ));
+
+    // single file within a group, with size within threshold and no deletes should be considered as NOOP.
+    Iterable<FileScanTask> testFiles1 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped1 = strategy.planFileGroups(testFiles1);
+
+    Assert.assertEquals("Should plan 0 group, as 1 minInputFile with size within threshold and no deletes should be " +
+            "treated as NOOP",
+            0, Iterables.size(grouped1));
+
+    // group with single file with size > targetFileSize should not be planed
+    Iterable<FileScanTask> testFiles2 = filesOfSize(1);
+    Iterable<List<FileScanTask>> grouped2 = strategy.planFileGroups(testFiles2);
+
+    Assert.assertEquals("Should plan 0 group", 0, Iterables.size(grouped2));

Review comment:
       Should plan 0 groups with "the test explanation"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rajarshisarkar commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
rajarshisarkar commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r825888334



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||
+                group.stream().anyMatch(this::taskHasTooManyDeletes) ||
+                (group.stream().anyMatch(this::taskHasDeletes) && group.size() == 1)

Review comment:
       In a file group, shouldn't we rewrite a single file if it is very large?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#discussion_r826087288



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -155,9 +155,12 @@ public RewriteStrategy options(Map<String, String> options) {
   public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
     ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
     List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+
     return potentialGroups.stream().filter(group ->
-      group.size() >= minInputFiles || sizeOfInputFiles(group) > targetFileSize ||
-              group.stream().anyMatch(this::taskHasTooManyDeletes)
+            (group.size() >= minInputFiles && group.size() > 1) ||
+                sizeOfInputFiles(group) > targetFileSize ||

Review comment:
       I think this would be easier to understand if was > maxFileSize although realistically at this point in the code it could only get to this point if the size was larger than maxFileSize.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on pull request #3676: Core : Dont consider groups for compaction if they have a single file…

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on pull request #3676:
URL: https://github.com/apache/iceberg/pull/3676#issuecomment-1065852178


   cc @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org