You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/12 02:33:01 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #2585: Adds abstract BinPackStrategyClass

RussellSpitzer opened a new pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585


   Adds a Rewrite Strategy for bin-packing. This strategy uses size
   to determine whether or not files should be included to be rewritten
   and groups files based on a bin-packing alorithm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631350919



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";

Review comment:
       Is `num` implied? Files is the unit here, so I think we can probably drop it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631867707



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    Preconditions.checkArgument(maxFileSize > minFileSize,
+        "Cannot set %s greater than %s, %d >= %d",
+        MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize);
+
+    Preconditions.checkArgument(targetFileSize > minFileSize,
+        "Cannot set %s greater than %s, all files written will be smaller than the threshold, %d >= %d",
+        MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize);
+
+    Preconditions.checkArgument(targetFileSize < maxFileSize,
+        "Cannot set %s is greater than %s, all files written will be larger than the threshold, %d >= %d",
+        MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize);
+
+    Preconditions.checkArgument(minInputFiles > 0,
+        "Cannot set %s is less than 1. All values less than 1 have the same effect as 1. %d < 1",
+        MIN_INPUT_FILES, minInputFiles);
+
+    Preconditions.checkArgument(minOutputFiles > 0,
+        "Cannot set %s to less than 1. All values less than 1 have the same effect as 1. %d < 1",
+        MIN_OUTPUT_FILES, minOutputFiles);
+
+    Preconditions.checkArgument(targetFileSize > 0,

Review comment:
       Sure, I dropped this condition and added in a requirement that minFile size be greater than or equal to 0




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631516013



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",

Review comment:
       I think these were confusing because they were all inverted, I switched them all to precondition(positive predicate) instead of if (negative predicate)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631420237



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;

Review comment:
       nit: I think we should group `public static final` variables together and then all the private variables, based on what the other files are doing right now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r632000569



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite

Review comment:
       Thanks for this description. I think it clarifies the behavior.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631365077



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";

Review comment:
       I think we should be fine with absolute values as long as we validate they make sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631428598



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;

Review comment:
       Actually, I did not notice `org.apache.iceberg.actions.rewrite` is used as the package path for `RewriteStrategy` in the API module. We should probably keep those things consistent, if we think placing `BinPackStrategy` in `actions` is preferred, then I think we should also change the path for the API.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631419472



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;

Review comment:
       +1 for placing it in `actions`, currently it only has the action and result classes, but I think it is reasonable enough to have this under that path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631494413



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);

Review comment:
       Question. Do we see any possibility for the task size to be different than the underlying data file size? Like including the size of the delete files or something? Is it too paranoid? We could unwrap the underlying file.

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",
+              MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize));
+    }
+    if (minFileSize > targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be smaller than the threshold, %d >= %d",
+              MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize));
+    }
+    if (maxFileSize < targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be larger than the threshold, %d >= %d",
+              MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize));
+    }
+    if (minInputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_INPUT_FILES, minInputFiles)
+      );
+    }
+    if (minOutputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_OUTPUT_FILES, minOutputFiles)
+      );
+    }
+    if (targetFileSize <= 0) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use a non-positive number for %s. %d <= 0",
+              RewriteDataFiles.TARGET_FILE_SIZE_BYTES, targetFileSize)

Review comment:
       nit: just `TARGET_FILE_SIZE_BYTES`?

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite

Review comment:
       nit: `datafiles` -> `data files`

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",

Review comment:
       Greater than or equal? Or simply invert arguments.
   
   

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(

Review comment:
       nit: `Preconditions.checkArgument`?

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",
+              MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize));
+    }
+    if (minFileSize > targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be smaller than the threshold, %d >= %d",
+              MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize));
+    }
+    if (maxFileSize < targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be larger than the threshold, %d >= %d",
+              MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize));
+    }
+    if (minInputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_INPUT_FILES, minInputFiles)
+      );
+    }
+    if (minOutputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_OUTPUT_FILES, minOutputFiles)
+      );
+    }
+    if (targetFileSize <= 0) {

Review comment:
       Is 0 allowed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631997865



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",

Review comment:
       Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631034143



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";

Review comment:
       These are constants for end users to use, so they should be public. In the discussion on the RewriteStrategy we decided they were bin-pack specific since the other "sort" strategy will probably be filtering files based on clustering rather than the number of files that are a specific size




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r630690164



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";

Review comment:
       why do these constants need to be public? alternatively, should these options be defined in the `RewrtieStrategy`? are they specific to `BinPackStrategy`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631423118



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+

Review comment:
       Based on the conversations and clarifications needed below, I think we need a section of documentation here to summarize how the configurations work together with all the ANDs and ORs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631403880



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";

Review comment:
       Target is used for 2 purposes, defaults and the target size for rewriting files




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631754224



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    Preconditions.checkArgument(maxFileSize > minFileSize,
+        "Cannot set %s greater than %s, %d >= %d",
+        MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize);
+
+    Preconditions.checkArgument(targetFileSize > minFileSize,
+        "Cannot set %s greater than %s, all files written will be smaller than the threshold, %d >= %d",
+        MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize);
+
+    Preconditions.checkArgument(targetFileSize < maxFileSize,
+        "Cannot set %s is greater than %s, all files written will be larger than the threshold, %d >= %d",
+        MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize);
+
+    Preconditions.checkArgument(minInputFiles > 0,
+        "Cannot set %s is less than 1. All values less than 1 have the same effect as 1. %d < 1",
+        MIN_INPUT_FILES, minInputFiles);
+
+    Preconditions.checkArgument(minOutputFiles > 0,
+        "Cannot set %s to less than 1. All values less than 1 have the same effect as 1. %d < 1",
+        MIN_OUTPUT_FILES, minOutputFiles);
+
+    Preconditions.checkArgument(targetFileSize > 0,

Review comment:
       Can we allow setting `minFileSize` to less than zero? We could save this check if not. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#issuecomment-840772473


   Thanks, @RussellSpitzer! Thanks for reviewing, @jackye1995 @rdblue @stevenzwu @chenjunjiedada!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631998199



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",
+              MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize));
+    }
+    if (minFileSize > targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be smaller than the threshold, %d >= %d",
+              MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize));
+    }
+    if (maxFileSize < targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be larger than the threshold, %d >= %d",
+              MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize));
+    }
+    if (minInputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_INPUT_FILES, minInputFiles)
+      );
+    }
+    if (minOutputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_OUTPUT_FILES, minOutputFiles)
+      );
+    }
+    if (targetFileSize <= 0) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use a non-positive number for %s. %d <= 0",
+              RewriteDataFiles.TARGET_FILE_SIZE_BYTES, targetFileSize)

Review comment:
       I missed it is from another class. That's fine then.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631514299



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(

Review comment:
       I knew I was forgetting something




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631394983



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;

Review comment:
       sure, I think it's fine to keep them in the same package if you like, I think it just kept things a little neater since these aren't actions themselves




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631955958



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    Preconditions.checkArgument(minFileSize >= 0,
+        "Cannot set %s to a negative number, %d < 0",
+        MIN_FILE_SIZE_BYTES, minFileSize);
+
+    Preconditions.checkArgument(maxFileSize > minFileSize,
+        "Cannot set %s greater than %s, %d >= %d",

Review comment:
       Wait, can they be equal actually?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631427798



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,

Review comment:
       I think all values are ok here, although anything less than 0 would behave the same as 1, I can prohibit negatives?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631996219



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";

Review comment:
       Resolving this as it has been discussed and addressed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#issuecomment-840025913


   cc @yyanyy @jackye1995 too who reviewed the original change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#issuecomment-839391150


   @chenjunjiedada + @openinx Could you please take a look as well? And of course anyone else who would like to contribute, I really value the comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631996616



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";

Review comment:
       The new validation makes sense to me. I am going to close this one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631996780



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,

Review comment:
       Makes sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#issuecomment-840735796


   Looks ready to go. Let's wait for tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631385801



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";

Review comment:
       I also think we can drop `num`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631354527



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";

Review comment:
       Ratios are harder for people to reason about so min/max are generally better. If we are using min/max, then are we ignoring target?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#issuecomment-840772481


   Thanks everybody!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r632002157



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;

Review comment:
       This has been resolved.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631958270



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    Preconditions.checkArgument(minFileSize >= 0,
+        "Cannot set %s to a negative number, %d < 0",
+        MIN_FILE_SIZE_BYTES, minFileSize);
+
+    Preconditions.checkArgument(maxFileSize > minFileSize,
+        "Cannot set %s greater than %s, %d >= %d",

Review comment:
       I think technically there really isn't any value in allowing minfile == maxfile == targetfile




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631401387



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;

Review comment:
       They are currently AND I can add a note




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] stevenzwu commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
stevenzwu commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r630694444



##########
File path: core/src/test/java/org/apache/iceberg/actions/rewrite/TestBinPackStrategy.java
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableTestBase;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestBinPackStrategy extends TableTestBase {
+
+  @Parameterized.Parameters(name = "formatVersion = {0}")
+  public static Object[] parameters() {
+    return new Object[] {2}; // We don't actually use the format version since everything is mock
+  }
+
+  private static final long MB = 1024 * 1024;
+
+  public TestBinPackStrategy(int formatVersion) {
+    super(formatVersion);
+  }
+
+  class TestBinPackStrategyImpl extends BinPackStrategy {
+
+    @Override
+    public Table table() {
+      return table;
+    }
+
+    @Override
+    public List<DataFile> rewriteFiles(List<FileScanTask> filesToRewrite) {
+      throw new UnsupportedOperationException();
+    }
+  }
+
+  private List<FileScanTask> filesOfSize(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> new MockFileScanTask(size * MB)).collect(Collectors.toList());
+  }
+
+  private RewriteStrategy defaultBinPack() {
+    return new TestBinPackStrategyImpl().options(Collections.emptyMap());
+  }
+
+  @Test
+  public void testFilteringAllValid() {
+    RewriteStrategy strategy = defaultBinPack();
+    Iterable<FileScanTask> testFiles = filesOfSize(100, 100, 100, 100, 1000);
+
+    Iterable<FileScanTask> filtered = strategy.selectFilesToRewrite(testFiles);
+
+    Assert.assertEquals("No files should be removed from the set", testFiles, filtered);
+  }
+
+  @Test
+  public void testFilteringRemoveInvalid() {
+    RewriteStrategy strategy = defaultBinPack();
+    Iterable<FileScanTask> testFiles = filesOfSize(500, 500, 500, 600, 600);
+
+    Iterable<FileScanTask> filtered = strategy.selectFilesToRewrite(testFiles);
+
+    Assert.assertEquals("All files should be removed from the set", Collections.emptyList(), filtered);
+  }
+
+  @Test
+  public void testFilteringCustomMinMaxFileSize() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(550 * MB),
+        BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(490 * MB)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(500, 500, 480, 480, 560, 520);
+    Iterable<FileScanTask> expectedFiles = filesOfSize(480, 480, 560);
+
+    Iterable<FileScanTask> filtered = strategy.selectFilesToRewrite(testFiles);
+
+    Assert.assertEquals("Should remove files that exceed or are smaller than new bounds", expectedFiles, filtered);
+  }
+
+  @Test
+  public void testGroupingMinInputFilesInvalid() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MIN_NUM_INPUT_FILES, Integer.toString(5)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(1, 1, 1, 1);
+
+    Iterable<List<FileScanTask>> grouped = strategy.planFileGroups(testFiles);
+
+    Assert.assertEquals("Should plan 0 groups, not enough input files",
+        0, Iterables.size(grouped));
+  }
+
+  @Test
+  public void testGroupingMinInputFilesValid() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MIN_NUM_INPUT_FILES, Integer.toString(5)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(1, 1, 1, 1, 1);
+
+    Iterable<List<FileScanTask>> grouped = strategy.planFileGroups(testFiles);
+
+    Assert.assertEquals("Should plan 1 groups since there are enough input files",
+        ImmutableList.of(testFiles), grouped);
+  }
+
+  @Test
+  public void testGroupingMinOutputFilesInvalid() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MIN_NUM_OUTPUT_FILES, Integer.toString(3)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(200, 200, 200, 200, 200);
+
+    Iterable<List<FileScanTask>> grouped = strategy.planFileGroups(testFiles);
+
+    Assert.assertEquals("Should plan 1 groups since there would be 2 output files",

Review comment:
       is the error message correct?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631323302



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;

Review comment:
       I am not sure `rewrite` package is very descriptive. I'd either put directly into `actions` as we don't have separate packages for other actions or try to come up with a more descriptive name as we have rewrite manifests, rewrite data files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631364322



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";

Review comment:
       Yes, we are ignoring the target size while picking files to compact and use min/max sizes instead. However, we still have to validate the passed min/max values make sense as we will use the target file size for writing new files. This validation will prevent weird behavior when users update the target file size but forget to update the optimize jobs with new min/max values.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631944821



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";

Review comment:
       nit: same here, `{@link MIN_OUTPUT_FILES}`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r632003090



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",
+              MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize));
+    }
+    if (minFileSize > targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be smaller than the threshold, %d >= %d",
+              MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize));
+    }
+    if (maxFileSize < targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be larger than the threshold, %d >= %d",
+              MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize));
+    }
+    if (minInputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_INPUT_FILES, minInputFiles)
+      );
+    }
+    if (minOutputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_OUTPUT_FILES, minOutputFiles)
+      );
+    }
+    if (targetFileSize <= 0) {

Review comment:
       Don't think so. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631427297



##########
File path: core/src/test/java/org/apache/iceberg/MockFileScanTask.java
##########
@@ -32,4 +32,28 @@ public MockFileScanTask(long length) {
   public long length() {
     return length;
   }
+
+  @Override
+  public String toString() {
+    return "Mock Scan Task Size: " + length;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+    if (this == o) {
+      return true;
+    }
+    if (o == null || getClass() != o.getClass()) {
+      return false;
+    }
+
+    MockFileScanTask that = (MockFileScanTask) o;
+

Review comment:
       nit: extra space




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631504964



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",
+              MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize));
+    }
+    if (minFileSize > targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be smaller than the threshold, %d >= %d",
+              MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize));
+    }
+    if (maxFileSize < targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be larger than the threshold, %d >= %d",
+              MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize));
+    }
+    if (minInputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_INPUT_FILES, minInputFiles)
+      );
+    }
+    if (minOutputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_OUTPUT_FILES, minOutputFiles)
+      );
+    }
+    if (targetFileSize <= 0) {

Review comment:
       no? Should it be?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631944303



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both

Review comment:
       nit: I think the doc wasn't updated. You could refer to it as `{@link #MIN_OUTPUT_FILES}` to make sure we catch all changes. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631514155



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    if (minFileSize >= maxFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, %d >= %d",
+              MIN_FILE_SIZE_BYTES, MAX_FILE_SIZE_BYTES, minFileSize, maxFileSize));
+    }
+    if (minFileSize > targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be smaller than the threshold, %d >= %d",
+              MIN_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, minFileSize, targetFileSize));
+    }
+    if (maxFileSize < targetFileSize) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is greater than %s, " +
+                  "all files written will be larger than the threshold, %d >= %d",
+              MAX_FILE_SIZE_BYTES, RewriteDataFiles.TARGET_FILE_SIZE_BYTES, maxFileSize, targetFileSize));
+    }
+    if (minInputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_INPUT_FILES, minInputFiles)
+      );
+    }
+    if (minOutputFiles < 1) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use bin pack when %s is less than 1. All values less than 1" +
+                  "have the same effect as 1. %d < 1",
+              MIN_OUTPUT_FILES, minOutputFiles)
+      );
+    }
+    if (targetFileSize <= 0) {
+      throw new IllegalArgumentException(
+          String.format("Cannot use a non-positive number for %s. %d <= 0",
+              RewriteDataFiles.TARGET_FILE_SIZE_BYTES, targetFileSize)

Review comment:
       It's from the Action not the Strategy, I can do the static import?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r632002268



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+

Review comment:
       This has been resolved.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631364322



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";

Review comment:
       Yes, we are ignoring the target size while picking files to compact and use min/max sizes instead. However, we still have to validate the passed min/max values make sense as we will use the target file size for writing new files. This validation will prevent weird behavior when users update the target file size but does not update the optimize jobs.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631957997



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    Preconditions.checkArgument(minFileSize >= 0,
+        "Cannot set %s to a negative number, %d < 0",
+        MIN_FILE_SIZE_BYTES, minFileSize);
+
+    Preconditions.checkArgument(maxFileSize > minFileSize,
+        "Cannot set %s greater than %s, %d >= %d",

Review comment:
       This is the "error" condition which is the inverse of the "valid condition"
   Min File is greater than or equal to Max file size 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631951293



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);

Review comment:
       I was worried we could add the size of deletes files. I checked the doc for this method and I don't think that's going to happen. I think we are good here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631944821



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";

Review comment:
       nit: same here, `{@link MIN_INPUT_FILES}`.

##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";

Review comment:
       nit: same here, `{@link MIN_INPUT_FILES}` in the doc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631955640



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    Preconditions.checkArgument(minFileSize >= 0,
+        "Cannot set %s to a negative number, %d < 0",
+        MIN_FILE_SIZE_BYTES, minFileSize);
+
+    Preconditions.checkArgument(maxFileSize > minFileSize,
+        "Cannot set %s greater than %s, %d >= %d",

Review comment:
       nit: `greater than` -> `greater than or equal to`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631301202



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,

Review comment:
       These could be validated as well.

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,

Review comment:
       Should we add validation that min/max sizes make sense compared to the target file size?

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;

Review comment:
       It is a bit unclear how min and max files options work together. Given the names, I'd interpret that them as AND. If I have 10 small files to compact files but don't have 2 output files (suppose the user configures 2), the compaction won't start?
   
   Is it AND or OR?

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_INPUT_FILES_DEFAULT);
+
+    return this;
+  }
+
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return StreamSupport.stream(dataFiles.spliterator(), false)

Review comment:
       Will it be helpful to expose a predicate here instead that would be run by the action? My worry is that this creates an intermediate filtered list. We could use `FluentIterable` and `filter` or something in the action to avoid creating another list before the tasks are passed to `planFileGroups`.

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";

Review comment:
       typo: `...-file` -> `...-files`

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_INPUT_FILES_DEFAULT);
+
+    return this;
+  }
+

Review comment:
       nit: extra line

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";

Review comment:
       I think it should be helpful to expose these. The order of fields is a bit unusual, though. I'd probably move non-static vars below and group them together. 

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_INPUT_FILES_DEFAULT);
+
+    return this;
+  }
+
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return StreamSupport.stream(dataFiles.spliterator(), false)
+        .filter(scanTask ->
+          scanTask.length() < minFileSize || scanTask.length() > maxFileSize
+        ).collect(Collectors.toList());
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minNumOutputFiles && group.size() >= minNumInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(scanTask -> scanTask.length()).sum();
+    long numOutputFiles = (long) Math.ceil((double) groupSize / targetFileSize);

Review comment:
       nit: just return?

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_INPUT_FILES_DEFAULT);
+
+    return this;
+  }
+
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return StreamSupport.stream(dataFiles.spliterator(), false)
+        .filter(scanTask ->
+          scanTask.length() < minFileSize || scanTask.length() > maxFileSize
+        ).collect(Collectors.toList());
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minNumOutputFiles && group.size() >= minNumInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(scanTask -> scanTask.length()).sum();

Review comment:
       nit: `FileScanTask::length`

##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";

Review comment:
       Having absolute values may be easier in some cases but we have to be really careful with validation. As the target file size is subject to change, people will have to update their optimize jobs too. I think folks will frequently forget to do so and we have to prevent invalid configs. Having ratios here would be a bit more portable but absolute values are probably easier to interpret. I am 50/50 here.
   
   @rdblue, I remember you had some concerns about using ratios. Could you elaborate a bit?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r632003402



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for data files which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);
+  }
+
+  @Override
+  public Iterable<List<FileScanTask>> planFileGroups(Iterable<FileScanTask> dataFiles) {
+    ListPacker<FileScanTask> packer = new BinPacking.ListPacker<>(maxGroupSize, 1, false);
+    List<List<FileScanTask>> potentialGroups = packer.pack(dataFiles, FileScanTask::length);
+    return potentialGroups.stream().filter(group ->
+        numOutputFiles(group) >= minOutputFiles && group.size() >= minInputFiles
+    ).collect(Collectors.toList());
+  }
+
+  private long numOutputFiles(List<FileScanTask> group) {
+    long groupSize = group.stream().mapToLong(FileScanTask::length).sum();
+    return (long) Math.ceil((double) groupSize / targetFileSize);
+  }
+
+  private void validateOptions() {
+    Preconditions.checkArgument(minFileSize >= 0,
+        "Cannot set %s to a negative number, %d < 0",
+        MIN_FILE_SIZE_BYTES, minFileSize);
+
+    Preconditions.checkArgument(maxFileSize > minFileSize,
+        "Cannot set %s greater than %s, %d >= %d",

Review comment:
       Agree




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631035003



##########
File path: core/src/test/java/org/apache/iceberg/actions/rewrite/TestBinPackStrategy.java
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableTestBase;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestBinPackStrategy extends TableTestBase {
+
+  @Parameterized.Parameters(name = "formatVersion = {0}")
+  public static Object[] parameters() {
+    return new Object[] {2}; // We don't actually use the format version since everything is mock
+  }
+
+  private static final long MB = 1024 * 1024;
+
+  public TestBinPackStrategy(int formatVersion) {
+    super(formatVersion);
+  }
+
+  class TestBinPackStrategyImpl extends BinPackStrategy {
+
+    @Override
+    public Table table() {
+      return table;
+    }
+
+    @Override
+    public List<DataFile> rewriteFiles(List<FileScanTask> filesToRewrite) {
+      throw new UnsupportedOperationException();
+    }
+  }
+
+  private List<FileScanTask> filesOfSize(long... sizes) {
+    return Arrays.stream(sizes).mapToObj(size -> new MockFileScanTask(size * MB)).collect(Collectors.toList());
+  }
+
+  private RewriteStrategy defaultBinPack() {
+    return new TestBinPackStrategyImpl().options(Collections.emptyMap());
+  }
+
+  @Test
+  public void testFilteringAllValid() {
+    RewriteStrategy strategy = defaultBinPack();
+    Iterable<FileScanTask> testFiles = filesOfSize(100, 100, 100, 100, 1000);
+
+    Iterable<FileScanTask> filtered = strategy.selectFilesToRewrite(testFiles);
+
+    Assert.assertEquals("No files should be removed from the set", testFiles, filtered);
+  }
+
+  @Test
+  public void testFilteringRemoveInvalid() {
+    RewriteStrategy strategy = defaultBinPack();
+    Iterable<FileScanTask> testFiles = filesOfSize(500, 500, 500, 600, 600);
+
+    Iterable<FileScanTask> filtered = strategy.selectFilesToRewrite(testFiles);
+
+    Assert.assertEquals("All files should be removed from the set", Collections.emptyList(), filtered);
+  }
+
+  @Test
+  public void testFilteringCustomMinMaxFileSize() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MAX_FILE_SIZE_BYTES, Long.toString(550 * MB),
+        BinPackStrategy.MIN_FILE_SIZE_BYTES, Long.toString(490 * MB)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(500, 500, 480, 480, 560, 520);
+    Iterable<FileScanTask> expectedFiles = filesOfSize(480, 480, 560);
+
+    Iterable<FileScanTask> filtered = strategy.selectFilesToRewrite(testFiles);
+
+    Assert.assertEquals("Should remove files that exceed or are smaller than new bounds", expectedFiles, filtered);
+  }
+
+  @Test
+  public void testGroupingMinInputFilesInvalid() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MIN_NUM_INPUT_FILES, Integer.toString(5)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(1, 1, 1, 1);
+
+    Iterable<List<FileScanTask>> grouped = strategy.planFileGroups(testFiles);
+
+    Assert.assertEquals("Should plan 0 groups, not enough input files",
+        0, Iterables.size(grouped));
+  }
+
+  @Test
+  public void testGroupingMinInputFilesValid() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MIN_NUM_INPUT_FILES, Integer.toString(5)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(1, 1, 1, 1, 1);
+
+    Iterable<List<FileScanTask>> grouped = strategy.planFileGroups(testFiles);
+
+    Assert.assertEquals("Should plan 1 groups since there are enough input files",
+        ImmutableList.of(testFiles), grouped);
+  }
+
+  @Test
+  public void testGroupingMinOutputFilesInvalid() {
+    RewriteStrategy strategy = defaultBinPack().options(ImmutableMap.of(
+        BinPackStrategy.MIN_NUM_OUTPUT_FILES, Integer.toString(3)
+    ));
+
+    Iterable<FileScanTask> testFiles = filesOfSize(200, 200, 200, 200, 200);
+
+    Iterable<List<FileScanTask>> grouped = strategy.planFileGroups(testFiles);
+
+    Assert.assertEquals("Should plan 1 groups since there would be 2 output files",

Review comment:
       Nope, I copy pasted that, should be empty in this message.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631997643



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;

Review comment:
       This has been fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631396732



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";

Review comment:
       Sure, I was using this ordering because it was easier for me to see the properties and related variables but I can swap it to all the statics first




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631515637



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+/**
+ * A rewrite strategy for datafiles which determines which files to rewrite
+ * based on their size. If files are either smaller than the min-file-size-bytes threshold or
+ * larger than the max-file-size-bytes threshold, they are considered targets for being rewritten.
+ * <p>
+ * Once filtered files are grouped based on a {@link BinPacking} into groups defined
+ * by max group size. Groups will be considered for rewriting if they contain more files
+ * than min-input-files and would produce more files than min-output-files.
+ */
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting. This is considered in conjunction with min-num-output-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_INPUT_FILES = "min-input-files";
+  public static final int MIN_INPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten. This is considered in conjunction with min-num-input-files, both
+   * conditions must pass to consider a group of files to be rewritten.
+   */
+  public static final String MIN_OUTPUT_FILES = "min-output-files";
+  public static final int MIN_OUTPUT_FILES_DEFAULT = 1;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * min-file-size-bytes will be considered for rewriting. This functions independently
+   * of max-file-size-bytes.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * max-file-size-bytes will be considered for rewriting. This functions independently
+   * of min-file-size-bytes.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+
+  private int minInputFiles;
+  private int minOutputFiles;
+  private long minFileSize;
+  private long maxFileSize;
+  private long targetFileSize;
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_INPUT_FILES,
+        MIN_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_OUTPUT_FILES,
+        MIN_OUTPUT_FILES_DEFAULT);
+
+    minInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_INPUT_FILES,
+        MIN_INPUT_FILES_DEFAULT);
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return FluentIterable.from(dataFiles)
+        .filter(scanTask -> scanTask.length() < minFileSize || scanTask.length() > maxFileSize);

Review comment:
       Currently they are synonymous, BaseFileScanTask just calls file.length. I'm not super worried about this just yet. I think this won't be an issue in the future issue since you can't determine how many records are deleted and the new scan task size without actually reading the file so I don't think that info will be there?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631429814



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;

Review comment:
       Yep moving both files




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631434185



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_INPUT_FILES_DEFAULT);
+
+    return this;
+  }
+
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return StreamSupport.stream(dataFiles.spliterator(), false)

Review comment:
       Not sure if there is any use case that only uses the predicate, but agree that even if we don't change the interface, at least in this implementation we can just use `FluentIterable.from(dataFiles).filter(..)` instead of creating a list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2585: Adds abstract BinPackStrategyClass

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2585:
URL: https://github.com/apache/iceberg/pull/2585#discussion_r631997437



##########
File path: core/src/main/java/org/apache/iceberg/actions/rewrite/BinPackStrategy.java
##########
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions.rewrite;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.StreamSupport;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Strategy;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.BinPacking;
+import org.apache.iceberg.util.BinPacking.ListPacker;
+import org.apache.iceberg.util.PropertyUtil;
+
+abstract class BinPackStrategy implements RewriteStrategy {
+
+  /**
+   * Minimum number of files that need to be in a file group to be considered
+   * for rewriting.
+   */
+  public static final String MIN_NUM_INPUT_FILES = "min-num-input-files";
+  public static final int MIN_NUM_INPUT_FILES_DEFAULT = 1;
+  private int minNumInputFiles;
+
+  /**
+   * Minimum number of files we want to be created by file group when being
+   * rewritten.
+   */
+  public static final String MIN_NUM_OUTPUT_FILES = "min-num-output-file";
+  public static final int MIN_NUM_OUTPUT_FILES_DEFAULT = 1;
+  private int minNumOutputFiles;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files smaller than
+   * MIN_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 75% of the target file size
+   */
+  public static final String MIN_FILE_SIZE_BYTES = "min-file-size-bytes";
+  public static final double MIN_FILE_SIZE_DEFAULT_RATIO = 0.75d;
+  private long minFileSize;
+
+  /**
+   * Adjusts files which will be considered for rewriting. Files larger than
+   * MAX_FILE_SIZE_BYTES will be considered for rewriting.
+   * <p>
+   * Defaults to 180% of the target file size
+   */
+  public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes";
+  public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d;
+  private long maxFileSize;
+
+  private long targetFileSize;
+
+  private long maxGroupSize;
+
+  @Override
+  public String name() {
+    return Strategy.BINPACK.name();
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.of(
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_FILE_SIZE_BYTES,
+        MAX_FILE_SIZE_BYTES
+    );
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    targetFileSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
+        PropertyUtil.propertyAsLong(
+            table().properties(),
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES,
+            TableProperties.WRITE_TARGET_FILE_SIZE_BYTES_DEFAULT));
+
+    minFileSize = PropertyUtil.propertyAsLong(options,
+        MIN_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MIN_FILE_SIZE_DEFAULT_RATIO));
+
+    maxFileSize = PropertyUtil.propertyAsLong(options,
+        MAX_FILE_SIZE_BYTES,
+        (long) (targetFileSize * MAX_FILE_SIZE_DEFAULT_RATIO));
+
+    maxGroupSize = PropertyUtil.propertyAsLong(options,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES,
+        RewriteDataFiles.MAX_FILE_GROUP_SIZE_BYTES_DEFAULT);
+
+    minNumOutputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_OUTPUT_FILES,
+        MIN_NUM_OUTPUT_FILES_DEFAULT);
+
+    minNumInputFiles = PropertyUtil.propertyAsInt(options,
+        MIN_NUM_INPUT_FILES,
+        MIN_NUM_INPUT_FILES_DEFAULT);
+
+    return this;
+  }
+
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    return StreamSupport.stream(dataFiles.spliterator(), false)

Review comment:
       Using `FluentIterable` here makes sense to me. We can rethink it later since we don't expose the strategy.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org