You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "szehon-ho (via GitHub)" <gi...@apache.org> on 2023/05/19 00:27:29 UTC

[GitHub] [iceberg] szehon-ho opened a new pull request, #7651: Core: Compacted position delete files should use the max data sequence number of source files

szehon-ho opened a new pull request, #7651:
URL: https://github.com/apache/iceberg/pull/7651

   Fixes: #7624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho merged pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho merged PR #7651:
URL: https://github.com/apache/iceberg/pull/7651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199492293


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.
+   *
+   * @param deleteFile a new delete file
+   * @param dataSequenceNumber data sequence number to append on the file
+   * @return this for method chaining
+   */
+  default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) {

Review Comment:
   Okay, let's stick with a safer option. Let me review the rest of the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199416919


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.
+   *
+   * @param deleteFile a new delete file
+   * @param dataSequenceNumber data sequence number to append on the file
+   * @return this for method chaining
+   */
+  default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) {

Review Comment:
   Right now, we don't have a way to set the sequence number in the public API. I think that's a good thing because we don't want people to think that this is something they can generally control. So I do prefer setting this here.
   
   However, for the implementation in `BaseRewriteDataFiles` I think it would make sense to set the sequence number and avoid needing the `DeleteFileHolder`. Not a big deal either way, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199495995


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.

Review Comment:
   Can we add a note explaining that when rewriting position delete files, the sequence number of the new file must be the max sequence number of the original files? We may want to note that rewriting equality deletes that belong to different data sequence numbers is not allowed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho closed pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho closed pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files
URL: https://github.com/apache/iceberg/pull/7651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199490813


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.
+   *
+   * @param deleteFile a new delete file
+   * @param dataSequenceNumber data sequence number to append on the file
+   * @return this for method chaining
+   */
+  default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) {

Review Comment:
   > Right now, we don't have a way to set the sequence number in the public API.
   
   That's can be easily changed by extending `TableMetadata$Builder`. I am not a big fan of `DeleteFileHolder` and I do think places that actually set data sequence numbers (like rewrite position deletes) would benefit from passing it as part of delete file object. That said, I am concerned about edge cases when the passed object may contain a wrong data sequence number. Are there use cases we can think of? Like cherry-picks maybe?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199219656


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.
+   *
+   * @param deleteFile a new delete file
+   * @param dataSequenceNumber data sequence number to append on the file
+   * @return this for method chaining
+   */
+  default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) {

Review Comment:
   If we consume it directly from the file, it would mean this would apply for appends too. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199491515


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.
+   *
+   * @param deleteFile a new delete file
+   * @param dataSequenceNumber data sequence number to append on the file
+   * @return this for method chaining
+   */
+  default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) {

Review Comment:
   Yeah, cherry picking seems like a case where we'd have the wrong sequence number. The good news is that we don't allow cherry picking unless the commit is an append or a dynamic partition overwrite. Otherwise we lose too much information.
   
   I'm also a bit concerned about letting sequence number on the file be used. I'd probably opt to ignore it and have an explicit sequence number in the API like this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199501306


##########
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesGroup.java:
##########
@@ -38,12 +38,18 @@
 public class RewritePositionDeletesGroup {
   private final FileGroupInfo info;
   private final List<PositionDeletesScanTask> tasks;
+  private final long maxRewrittenDataSequenceNumber;
 
   private Set<DeleteFile> addedDeleteFiles = Collections.emptySet();
 
   public RewritePositionDeletesGroup(FileGroupInfo info, List<PositionDeletesScanTask> tasks) {
     this.info = info;
     this.tasks = tasks;
+    this.maxRewrittenDataSequenceNumber =
+        tasks.stream()
+            .map(t -> t.file().dataSequenceNumber())

Review Comment:
   Optional: You could use `mapToLong(...).max().getAsLong()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199219656


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.
+   *
+   * @param deleteFile a new delete file
+   * @param dataSequenceNumber data sequence number to append on the file
+   * @return this for method chaining
+   */
+  default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) {

Review Comment:
   If we consume it directly from the file, it would mean this would apply for appends too, which I am not sure a bad thing. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199500163


##########
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesCommitManager.java:
##########
@@ -55,12 +55,13 @@ public void commit(Set<RewritePositionDeletesGroup> fileGroups) {
     RewriteFiles rewriteFiles = table.newRewrite().validateFromSnapshot(startingSnapshotId);
 
     for (RewritePositionDeletesGroup group : fileGroups) {
+      long maxSequenceNumber = group.maxRewrittenDataSequenceNumber();

Review Comment:
   What about either calling this `maxRewrittenDataSequenceNumber` and adding an empty line before the for loop below or using `group.maxRewrittenDataSequenceNumber()` directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#issuecomment-1555860100

   Test fail looks unrelated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199232563


##########
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesGroup.java:
##########
@@ -38,12 +38,18 @@
 public class RewritePositionDeletesGroup {
   private final FileGroupInfo info;
   private final List<PositionDeletesScanTask> tasks;
+  private final long maxRewrittenDataSequenceNumber;
 
   private Set<DeleteFile> addedDeleteFiles = Collections.emptySet();
 
   public RewritePositionDeletesGroup(FileGroupInfo info, List<PositionDeletesScanTask> tasks) {
     this.info = info;
     this.tasks = tasks;
+    this.maxRewrittenDataSequenceNumber =
+        tasks.stream()
+            .map(t -> t.file().dataSequenceNumber())
+            .max(Long::compare)
+            .orElseThrow(() -> new IllegalArgumentException("Empty file group"));

Review Comment:
   Nit: Could we just Preconditions.checkArgument the passed in task list?



##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java:
##########
@@ -739,4 +789,43 @@ private void checkResult(
         size(newDeletes),
         result.rewriteResults().stream().mapToLong(FileGroupRewriteResult::addedBytesCount).sum());
   }
+
+  private void checkSequenceNumbers(
+      Table table, List<DeleteFile> rewrittenDeletes, List<DeleteFile> addedDeletes) {
+    StructLikeMap<List<DeleteFile>> rewrittenFilesPerPartition =
+        groupPerPartition(table, rewrittenDeletes);
+    StructLikeMap<List<DeleteFile>> addedFilesPerPartition = groupPerPartition(table, addedDeletes);
+    for (StructLike partition : rewrittenFilesPerPartition.keySet()) {
+      Long maxRewrittenSeq =
+          rewrittenFilesPerPartition.get(partition).stream()
+              .map(ContentFile::dataSequenceNumber)
+              .max(Long::compare)
+              .get();
+      List<DeleteFile> addedPartitionFiles = addedFilesPerPartition.get(partition);
+      if (addedPartitionFiles != null) {
+        addedPartitionFiles.forEach(
+            d ->
+                Assert.assertEquals(
+                    "Sequence number should be max of rewritten set",
+                    d.dataSequenceNumber(),
+                    maxRewrittenSeq));
+      }
+    }
+  }
+
+  private StructLikeMap<List<DeleteFile>> groupPerPartition(
+      Table table, List<DeleteFile> deleteFiles) {
+    StructLikeMap<List<DeleteFile>> result =
+        StructLikeMap.create(Partitioning.partitionType(table));
+    for (DeleteFile deleteFile : deleteFiles) {
+      StructLike partition = deleteFile.partition();
+      List<DeleteFile> partitionFiles = result.get(partition);
+      if (partitionFiles == null) {
+        partitionFiles = Lists.newArrayList();
+      }

Review Comment:
   Nit: newline after the if block



##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1129,4 +1145,39 @@ protected ManifestReader<DeleteFile> newManifestReader(ManifestFile manifest) {
       return MergingSnapshotProducer.this.newDeleteManifestReader(manifest);
     }
   }
+
+  private static class DeleteFileHolder {
+    private final DeleteFile deleteFile;
+    private final Long dataSequenceNumber;
+
+    /**
+     * Queue a delete file for commit with a given data sequence number

Review Comment:
   Nit: Not sure if `queue` is the right word here, it's a bit redundant considering the class name but could we just use the word `Holds` or `Contains`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199499158


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -246,12 +246,21 @@ protected void add(DataFile file) {
 
   /** Add a delete file to the new snapshot. */
   protected void add(DeleteFile file) {
-    Preconditions.checkNotNull(file, "Invalid delete file: null");

Review Comment:
   The check moved to the `DeleteFileHolder` constructor, but is present in only one of two constructors now. It would be more explicit in `add`, I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199508053


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1129,4 +1145,39 @@ protected ManifestReader<DeleteFile> newManifestReader(ManifestFile manifest) {
       return MergingSnapshotProducer.this.newDeleteManifestReader(manifest);
     }
   }
+
+  private static class DeleteFileHolder {
+    private final DeleteFile deleteFile;
+    private final Long dataSequenceNumber;
+
+    /**
+     * Queue a delete file for commit with a given data sequence number

Review Comment:
   Yep, changed the name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#issuecomment-1556085041

   Thanks @rdblue @aokolnychyi @amogh-jahagirdar for reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199214912


##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.
+   *
+   * @param deleteFile a new delete file
+   * @param dataSequenceNumber data sequence number to append on the file
+   * @return this for method chaining
+   */
+  default RewriteFiles addFile(DeleteFile deleteFile, long dataSequenceNumber) {

Review Comment:
   My original thought was to use `DeleteFile.dataSequenceNumber()` if set (could be done by extending the builder) but this approach can be a bit more reliable as it is explicit. Question: is there any case when the *added* new delete file data sequence number can be wrong?
   
   Any thoughts, @rdblue @RussellSpitzer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199507845


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -246,12 +246,21 @@ protected void add(DataFile file) {
 
   /** Add a delete file to the new snapshot. */
   protected void add(DeleteFile file) {
-    Preconditions.checkNotNull(file, "Invalid delete file: null");

Review Comment:
   Yea initially tried to move it , but did it incomplete.  Added back to original location.



##########
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesCommitManager.java:
##########
@@ -55,12 +55,13 @@ public void commit(Set<RewritePositionDeletesGroup> fileGroups) {
     RewriteFiles rewriteFiles = table.newRewrite().validateFromSnapshot(startingSnapshotId);
 
     for (RewritePositionDeletesGroup group : fileGroups) {
+      long maxSequenceNumber = group.maxRewrittenDataSequenceNumber();

Review Comment:
   Done, use the call directly



##########
api/src/main/java/org/apache/iceberg/RewriteFiles.java:
##########
@@ -97,6 +97,22 @@ default RewriteFiles addFile(DeleteFile deleteFile) {
         this.getClass().getName() + " does not implement addFile");
   }
 
+  /**
+   * Add a new delete file with the given data sequence number.
+   *
+   * <p>This rewrite operation may change the size or layout of the delete files. When applicable,
+   * it is also recommended to discard delete records for files that are no longer part of the table
+   * state. However, the set of applicable delete records must never change.

Review Comment:
   Added a paragraph below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "rdblue (via GitHub)" <gi...@apache.org>.
rdblue commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199417330


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -246,12 +246,21 @@ protected void add(DataFile file) {
 
   /** Add a delete file to the new snapshot. */
   protected void add(DeleteFile file) {
-    Preconditions.checkNotNull(file, "Invalid delete file: null");

Review Comment:
   Why remove this check? Should this add it back in `add(DeleteFileHolder)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199508255


##########
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java:
##########
@@ -739,4 +789,43 @@ private void checkResult(
         size(newDeletes),
         result.rewriteResults().stream().mapToLong(FileGroupRewriteResult::addedBytesCount).sum());
   }
+
+  private void checkSequenceNumbers(
+      Table table, List<DeleteFile> rewrittenDeletes, List<DeleteFile> addedDeletes) {
+    StructLikeMap<List<DeleteFile>> rewrittenFilesPerPartition =
+        groupPerPartition(table, rewrittenDeletes);
+    StructLikeMap<List<DeleteFile>> addedFilesPerPartition = groupPerPartition(table, addedDeletes);
+    for (StructLike partition : rewrittenFilesPerPartition.keySet()) {
+      Long maxRewrittenSeq =
+          rewrittenFilesPerPartition.get(partition).stream()
+              .map(ContentFile::dataSequenceNumber)
+              .max(Long::compare)
+              .get();
+      List<DeleteFile> addedPartitionFiles = addedFilesPerPartition.get(partition);
+      if (addedPartitionFiles != null) {
+        addedPartitionFiles.forEach(
+            d ->
+                Assert.assertEquals(
+                    "Sequence number should be max of rewritten set",
+                    d.dataSequenceNumber(),
+                    maxRewrittenSeq));
+      }
+    }
+  }
+
+  private StructLikeMap<List<DeleteFile>> groupPerPartition(
+      Table table, List<DeleteFile> deleteFiles) {
+    StructLikeMap<List<DeleteFile>> result =
+        StructLikeMap.create(Partitioning.partitionType(table));
+    for (DeleteFile deleteFile : deleteFiles) {
+      StructLike partition = deleteFile.partition();
+      List<DeleteFile> partitionFiles = result.get(partition);
+      if (partitionFiles == null) {
+        partitionFiles = Lists.newArrayList();
+      }

Review Comment:
   Normally agree, but this is one of those create if not exist for the map key, so logically looked like it fit together here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199508053


##########
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##########
@@ -1129,4 +1145,39 @@ protected ManifestReader<DeleteFile> newManifestReader(ManifestFile manifest) {
       return MergingSnapshotProducer.this.newDeleteManifestReader(manifest);
     }
   }
+
+  private static class DeleteFileHolder {
+    private final DeleteFile deleteFile;
+    private final Long dataSequenceNumber;
+
+    /**
+     * Queue a delete file for commit with a given data sequence number

Review Comment:
   Yep, changed the word.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7651: Core: Compacted position delete files should use the max data sequence number of source files

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #7651:
URL: https://github.com/apache/iceberg/pull/7651#discussion_r1199232563


##########
core/src/main/java/org/apache/iceberg/actions/RewritePositionDeletesGroup.java:
##########
@@ -38,12 +38,18 @@
 public class RewritePositionDeletesGroup {
   private final FileGroupInfo info;
   private final List<PositionDeletesScanTask> tasks;
+  private final long maxRewrittenDataSequenceNumber;
 
   private Set<DeleteFile> addedDeleteFiles = Collections.emptySet();
 
   public RewritePositionDeletesGroup(FileGroupInfo info, List<PositionDeletesScanTask> tasks) {
     this.info = info;
     this.tasks = tasks;
+    this.maxRewrittenDataSequenceNumber =
+        tasks.stream()
+            .map(t -> t.file().dataSequenceNumber())
+            .max(Long::compare)
+            .orElseThrow(() -> new IllegalArgumentException("Empty file group"));

Review Comment:
   Nit: Could we just Preconditions.checkArgument the passed in task list to make sure it's not empty?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org