You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/06/29 21:36:02 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #2760: Core: Refactors to BinPack and Rewrite APIs

RussellSpitzer opened a new pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760


   Changes to support the implementation of the Spark RewriteDatafileAciton. Fixes JavaDoc links,
   adds in BinPackStrategy comments to better binpack real world usecases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661652100



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {
+      // Round down and distribute remainder amongst other files
+      return fileCountWithoutRemainder;
+    } else {
+      // Keep the remainder file
+      return fileCountWithRemainder;
+    }
+  }
+
+  /**
+   * Returns the smaller of our max write file threshold, and our estimated split size based on
+   * the number of output files we want to generate.
+   */
+  protected long splitSize(long totalSizeInBytes) {
+    long estimatedSplitSize = totalSizeInBytes / numOutputFiles(totalSizeInBytes);
+    return Math.min(estimatedSplitSize, writeMaxFileSize());
+  }
+
+  protected static long inputFileSize(List<FileScanTask> fileToRewrite) {

Review comment:
       No reason, i'll remove it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661033048



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {

Review comment:
       This logic looks correct to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661034280



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {
+      // Round down and distribute remainder amongst other files
+      return fileCountWithoutRemainder;
+    } else {
+      // Keep the remainder file
+      return fileCountWithRemainder;
+    }
+  }
+
+  /**
+   * Returns the smaller of our max write file threshold, and our estimated split size based on
+   * the number of output files we want to generate.
+   */
+  protected long splitSize(long totalSizeInBytes) {
+    long estimatedSplitSize = totalSizeInBytes / numOutputFiles(totalSizeInBytes);
+    return Math.min(estimatedSplitSize, writeMaxFileSize());
+  }
+
+  protected static long inputFileSize(List<FileScanTask> fileToRewrite) {

Review comment:
       This seems the only one static method. Is that on purpose?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661791251



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {

Review comment:
       `Math.min(1.1 * targetFileSize, maxWriteSize)` seems reasonable to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661034085



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {
+      // Round down and distribute remainder amongst other files
+      return fileCountWithoutRemainder;
+    } else {
+      // Keep the remainder file
+      return fileCountWithRemainder;
+    }
+  }
+
+  /**
+   * Returns the smaller of our max write file threshold, and our estimated split size based on
+   * the number of output files we want to generate.
+   */
+  protected long splitSize(long totalSizeInBytes) {
+    long estimatedSplitSize = totalSizeInBytes / numOutputFiles(totalSizeInBytes);
+    return Math.min(estimatedSplitSize, writeMaxFileSize());
+  }
+
+  protected static long inputFileSize(List<FileScanTask> fileToRewrite) {
+    return fileToRewrite.stream().mapToLong(FileScanTask::length).sum();
+  }
+

Review comment:
       nit: extra line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661657295



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {

Review comment:
       it isn't here, i'm not sure how we want to settle this. We could do say that if the avgFileSizeWithoutRemainder is < maxFileSize? I think we discussed that previously




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661661131



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {

Review comment:
       ```java
       long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
       long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
       if (avgFileSizeWithoutRemainder < Math.min(1.1 * targetFileSize, writeMaxFileSize())) {
         // Round down and distribute remainder amongst other files
         return fileCountWithoutRemainder;
       } else {
         // Keep the remainder file
         return fileCountWithRemainder;
       }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661034018



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {

Review comment:
       Am I correct that `numOutputFiles` handles the case when `1.1 * targetFileSize` is larger that the max value?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#discussion_r661660204



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -137,6 +139,78 @@ public RewriteStrategy options(Map<String, String> options) {
     ).collect(Collectors.toList());
   }
 
+  protected long targetFileSize() {
+    return this.targetFileSize;
+  }
+
+  /**
+   * Determine how many output files to create when rewriting. We use this to determine the split-size
+   * we want to use when actually writing files to avoid the following situation.
+   * <p>
+   * If we are writing 10.1 G of data with a target file size of 1G we would end up with
+   * 11 files, one of which would only have 0.1g. This would most likely be less preferable to
+   * 10 files each of which was 1.01g. So here we decide whether to round up or round down
+   * based on what the estimated average file size will be if we ignore the remainder (0.1g). If
+   * the new file size is less than 10% greater than the target file size then we will round down
+   * when determining the number of output files.
+   * @param totalSizeInBytes total data size for a file group
+   * @return the number of files this strategy should create
+   */
+  protected long numOutputFiles(long totalSizeInBytes) {
+    if (totalSizeInBytes < targetFileSize) {
+      return 1;
+    }
+
+    long fileCountWithRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.CEILING);
+    if (LongMath.mod(totalSizeInBytes, targetFileSize) > minFileSize) {
+      // Our Remainder file is of valid size for this compaction so keep it
+      return fileCountWithRemainder;
+    }
+
+    long fileCountWithoutRemainder = LongMath.divide(totalSizeInBytes, targetFileSize, RoundingMode.FLOOR);
+    long avgFileSizeWithoutRemainder = totalSizeInBytes / fileCountWithoutRemainder;
+    if (avgFileSizeWithoutRemainder < 1.1 * targetFileSize) {

Review comment:
       I think we are getting into edge cases here, but maybe we could do Math.min(1.1 * targetFileSize, maxWriteSize)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2760: Core: Refactors to BinPack and Rewrite APIs

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2760:
URL: https://github.com/apache/iceberg/pull/2760#issuecomment-871713991


   https://github.com/apache/iceberg/pull/2770 - Moving to this PR because I accidentally made this branch inside the apache repo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org