You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/30 03:01:15 UTC

[GitHub] [iceberg] yyanyy commented on a change in pull request #3207: Spark: extend RewriteDataFiles action to merge deletes

yyanyy commented on a change in pull request #3207:
URL: https://github.com/apache/iceberg/pull/3207#discussion_r719020923



##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -60,7 +60,9 @@
    * {@link #MIN_FILE_SIZE_BYTES} will be considered for rewriting. This functions independently
    * of {@link #MAX_FILE_SIZE_BYTES}.
    * <p>
-   * Defaults to 75% of the target file size
+   * Defaults to 75% of the target file size when {@link RewriteDataFiles#REMOVE_PARTITION_DELETES} and

Review comment:
       I wonder if it would be possible to still respect the threshold when taking deletes into consideration, that if we have delete files that can apply to this file, even if its size is within the threshold and does not need to be compacted, we will still rewrite this file. This requires us to know the relationship between data and delete files beforehand though. 
   

##########
File path: spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
##########
@@ -326,13 +333,35 @@ private void validateAndInitOptions() {
         PARTIAL_PROGRESS_ENABLED,
         PARTIAL_PROGRESS_ENABLED_DEFAULT);
 
+    removePartitionDeletes = PropertyUtil.propertyAsBoolean(options(),
+        REMOVE_PARTITION_DELETES,
+        REMOVE_PARTITION_DELETES_DEFAULT);
+
+    removeGlobalDeletes = PropertyUtil.propertyAsBoolean(options(),
+        REMOVE_GLOBAL_DELETES,
+        REMOVE_GLOBAL_DELETES_DEFAULT);
+
     Preconditions.checkArgument(maxConcurrentFileGroupRewrites >= 1,
         "Cannot set %s to %s, the value must be positive.",
         MAX_CONCURRENT_FILE_GROUP_REWRITES, maxConcurrentFileGroupRewrites);
 
     Preconditions.checkArgument(!partialProgressEnabled || partialProgressEnabled && maxCommits > 0,
         "Cannot set %s to %s, the value must be positive when %s is true",
         PARTIAL_PROGRESS_MAX_COMMITS, maxCommits, PARTIAL_PROGRESS_ENABLED);
+
+    Preconditions.checkArgument(removePartitionDeletes || !removeGlobalDeletes,

Review comment:
       Ideally it would be great if we can have a global delete file -> partition delete files compaction mechanism for partitioned tables on this so that in here we can direct people to do that first. Also it might be possible for people to set it like this if the table is unpartitioned? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org