You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/04 08:17:37 UTC

[GitHub] [iceberg] openinx opened a new pull request #2294: Core: Support rewriting delete files.

openinx opened a new pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294


   In iceberg format v2,  we've introduced the row-level delete feature,  which would introduce many delete files ( include positional delete files and insert data files).   Currently the **batch** RewriteAction only works for iceberg format v2, say compacting data files only.  Before expose the format v2 to end users, we need to check whether the common features are OK,   we should provide the ability to rewrite both data files and delete files. 
   
   In this PR,  I'm trying to extend the **RewriteFiles** API to work for both data files and delete files, so that we could implement the format v2's  RewriteFiles Action.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r601861488



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -40,19 +40,55 @@ protected String operation() {
     return DataOperations.REPLACE;
   }
 
+  private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+                                         Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
+    int filesToDelete = 0;
+    if (dataFilesToDelete != null) {
+      filesToDelete += dataFilesToDelete.size();
+    }
+
+    if (deleteFilesToDelete != null) {
+      filesToDelete += deleteFilesToDelete.size();
+    }
+
+    Preconditions.checkArgument(filesToDelete > 0, "Files to delete cannot be null or empty");
+
+    if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) {
+      // When there is no delete files in the rewrite action, data files to add cannot be null or empty.
+      Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0,
+          "Data files to add can not be null or empty because there's no delete file to rewrite");
+      Preconditions.checkArgument(deleteFilesToAdd == null || deleteFilesToAdd.isEmpty(),
+          "Delete files to add must be null or empty because there's no delete file to rewrite");
+    }
+  }
+
   @Override
-  public RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
-    Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(),
-        "Files to delete cannot be null or empty");
-    Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(),

Review comment:
       I agree that they can be empty, I guess my major point is that since it is the new API we introduced here that allows empty input, we can enforce inputs to be not null by adding a precondition check at the beginning of the method to fail if null is passed in, so that we don't have to do all the `!= null` everywhere to make the code a bit more clean, while still allowing `ImmutableSet.of()` which should logically be equivalent to null.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#issuecomment-792543432


   Ping @rdblue @aokolnychyi @chenjunjiedada for reviewing,  extending the RewriteFiles API is the first step for further format v2's rewrite actions (See the linked issues), thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r601994135



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -40,19 +40,55 @@ protected String operation() {
     return DataOperations.REPLACE;
   }
 
+  private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+                                         Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
+    int filesToDelete = 0;
+    if (dataFilesToDelete != null) {
+      filesToDelete += dataFilesToDelete.size();
+    }
+
+    if (deleteFilesToDelete != null) {
+      filesToDelete += deleteFilesToDelete.size();
+    }
+
+    Preconditions.checkArgument(filesToDelete > 0, "Files to delete cannot be null or empty");
+
+    if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) {
+      // When there is no delete files in the rewrite action, data files to add cannot be null or empty.
+      Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0,
+          "Data files to add can not be null or empty because there's no delete file to rewrite");
+      Preconditions.checkArgument(deleteFilesToAdd == null || deleteFilesToAdd.isEmpty(),
+          "Delete files to add must be null or empty because there's no delete file to rewrite");
+    }
+  }
+
   @Override
-  public RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
-    Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(),
-        "Files to delete cannot be null or empty");
-    Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(),

Review comment:
       OK, I think it's good to simplify this null check, just updated this patch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx merged pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx merged pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r590008800



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -38,8 +39,27 @@
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
    * @param filesToDelete files that will be replaced (deleted), cannot be null or empty.
-   * @param filesToAdd files that will be added, cannot be null or empty.
+   * @param filesToAdd    files that will be added, cannot be null or empty.
    * @return this for method chaining
    */
-  RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd);
+  default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {

Review comment:
       I think you want to expose this API for replacing equality deletions with positional deletions,  for me that seems like an internal usage we may don't have to define such a specific API for end users.  I prefer to expose the following common API to end users,  for our internal usage we could rewrite files based on that one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r596629785



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -40,19 +40,55 @@ protected String operation() {
     return DataOperations.REPLACE;
   }
 
+  private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+                                         Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
+    int filesToDelete = 0;
+    if (dataFilesToDelete != null) {
+      filesToDelete += dataFilesToDelete.size();
+    }
+
+    if (deleteFilesToDelete != null) {
+      filesToDelete += deleteFilesToDelete.size();
+    }
+
+    Preconditions.checkArgument(filesToDelete > 0, "Files to delete cannot be null or empty");
+
+    if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) {
+      // When there is no delete files in the rewrite action, data files to add cannot be null or empty.
+      Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0,
+          "Data files to add can not be null or empty because there's no delete file to rewrite");
+      Preconditions.checkArgument(deleteFilesToAdd == null || deleteFilesToAdd.isEmpty(),
+          "Delete files to add must be null or empty because there's no delete file to rewrite");
+    }
+  }
+
   @Override
-  public RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
-    Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(),
-        "Files to delete cannot be null or empty");
-    Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(),

Review comment:
       > Looks like we are changing the logic now to not enforce input sets to be non-nullable?
   
   Yes, in current version we are required to pass  non-empty and non-null `Set` because we must ensure the input files and output files have some rows to rewrite & generate.  After introducing the format v2,  the data input files can be empty (if we plan to convert eq-deletes to pos-deletes),  then I think we also don't have to require the user to pass a non-null empty set. Here passing a `null` or `ImmutableMap.of()` , both of them looks good to me.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r589272454



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -38,8 +39,27 @@
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
    * @param filesToDelete files that will be replaced (deleted), cannot be null or empty.
-   * @param filesToAdd files that will be added, cannot be null or empty.
+   * @param filesToAdd    files that will be added, cannot be null or empty.
    * @return this for method chaining
    */
-  RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd);
+  default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {

Review comment:
       How about adding `RewriteFiles rewriteDeletes(Set<DeleteFile> fileToDelete, Set<DeleteFile> filesToAdd` for rewrite deletes? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#issuecomment-790442945


   > I wrote an API in #2216 you may be interested.
   
   Thanks for the great work,  I'd like to review that patch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#issuecomment-790427191


   I wrote an API in https://github.com/apache/iceberg/pull/2216 you may be interested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#issuecomment-808002861


   Filed an issue to address the flakey unit test https://github.com/apache/iceberg/issues/2384, will retry the travis CI. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r590016384



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -40,19 +40,62 @@ protected String operation() {
     return DataOperations.REPLACE;
   }
 
+  private void checkFilesToDelete(Set<DataFile> dataFilesToDelete,
+                                  Set<DeleteFile> deleteFilesToDelete) {
+    int filesToDelete = 0;
+    if (dataFilesToDelete != null) {
+      filesToDelete += dataFilesToDelete.size();
+    }
+
+    if (deleteFilesToDelete != null) {
+      filesToDelete += deleteFilesToDelete.size();
+    }
+
+    Preconditions.checkArgument(filesToDelete > 0, "Files to delete cannot be null or empty");
+  }
+
+  private void checkFilesToAdd(Set<DataFile> dataFilesToAdd,
+                               Set<DeleteFile> deleteFilesToAdd) {
+    int filesToAdd = 0;
+    if (dataFilesToAdd != null) {
+      filesToAdd += dataFilesToAdd.size();
+    }
+
+    if (deleteFilesToAdd != null) {
+      filesToAdd += deleteFilesToAdd.size();
+    }
+
+    Preconditions.checkArgument(filesToAdd > 0, "Files to add can not be null or empty");
+  }
+
   @Override
-  public RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
-    Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(),
-        "Files to delete cannot be null or empty");
-    Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(),
-        "Files to add can not be null or empty");
+  public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+                                   Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
+    checkFilesToDelete(dataFilesToDelete, deleteFilesToDelete);
+    checkFilesToAdd(dataFilesToAdd, deleteFilesToAdd);

Review comment:
       The files to add seems could be empty since we have introduced the row-level deletes.   For example,  we insert <1, 'aaa'>. then delete <1, 'aaa'>, finally do the rewrite files action. In the end, there should be no file to add when rewriting.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx closed pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx closed pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r596637870



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -228,6 +383,163 @@ public void testRecovery() {
     Assert.assertEquals("Only 3 manifests should exist", 3, listManifestFiles().size());
   }
 
+  @Test
+  public void testRecoverWhenRewriteBothDataAndDeleteFiles() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+
+    table.newRowDelta()
+        .addRows(FILE_A)
+        .addRows(FILE_B)
+        .addRows(FILE_C)
+        .addDeletes(FILE_A_DELETES)
+        .addDeletes(FILE_B_DELETES)
+        .commit();
+
+    long baseSnapshotId = readMetadata().currentSnapshot().snapshotId();
+    table.ops().failCommits(3);
+
+    RewriteFiles rewrite = table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_A_DELETES, FILE_B_DELETES),
+            ImmutableSet.of(FILE_D), ImmutableSet.of());
+    Snapshot pending = rewrite.apply();
+
+    Assert.assertEquals("Should produce 3 manifests", 3, pending.allManifests().size());
+    ManifestFile manifest1 = pending.allManifests().get(0);
+    ManifestFile manifest2 = pending.allManifests().get(1);
+    ManifestFile manifest3 = pending.allManifests().get(2);
+
+    validateManifestEntries(manifest1,
+        ids(pending.snapshotId()),
+        files(FILE_D),
+        statuses(ADDED));
+
+    validateManifestEntries(manifest2,
+        ids(pending.snapshotId(), baseSnapshotId, baseSnapshotId),
+        files(FILE_A, FILE_B, FILE_C),
+        statuses(DELETED, EXISTING, EXISTING));
+
+    validateDeleteManifest(manifest3,
+        seqs(2, 2),
+        ids(pending.snapshotId(), pending.snapshotId()),
+        files(FILE_A_DELETES, FILE_B_DELETES),
+        statuses(DELETED, DELETED));
+
+    rewrite.commit();
+
+    Assert.assertTrue("Should reuse new manifest", new File(manifest1.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest2.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest3.path()).exists());
+
+    TableMetadata metadata = readMetadata();
+    List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3);
+    Assert.assertTrue("Should committed the manifests",
+        metadata.currentSnapshot().allManifests().containsAll(committedManifests));
+
+    // As commit success all the manifests added with rewrite should be available.
+    Assert.assertEquals("Only 5 manifest should exist", 5, listManifestFiles().size());
+  }
+
+  @Test
+  public void testReplaceEqualityDeletesWithPositionDeletes() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+
+    table.newRowDelta()
+        .addRows(FILE_A2)
+        .addDeletes(FILE_A2_DELETES)
+        .commit();
+
+    TableMetadata metadata = readMetadata();
+    long baseSnapshotId = metadata.currentSnapshot().snapshotId();
+
+    // Apply and commit the rewrite transaction.
+    RewriteFiles rewrite = table.newRewrite().rewriteFiles(
+        ImmutableSet.of(), ImmutableSet.of(FILE_A2_DELETES),
+        ImmutableSet.of(), ImmutableSet.of(FILE_B_DELETES)
+    );
+    Snapshot pending = rewrite.apply();
+
+    Assert.assertEquals("Should produce 3 manifests", 3, pending.allManifests().size());
+    ManifestFile manifest1 = pending.allManifests().get(0);
+    ManifestFile manifest2 = pending.allManifests().get(1);
+    ManifestFile manifest3 = pending.allManifests().get(2);
+
+    validateManifestEntries(manifest1,
+        ids(baseSnapshotId),

Review comment:
       I think we'd better to add the seq number verification after we have reached agreement on this issue: https://github.com/apache/iceberg/issues/2308. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r596622281



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -35,11 +36,30 @@
  */
 public interface RewriteFiles extends SnapshotUpdate<RewriteFiles> {
   /**
-   * Add a rewrite that replaces one set of files with another set that contains the same data.
+   * Add a rewrite that replaces one set of data files with another set that contains the same data.
    *
    * @param filesToDelete files that will be replaced (deleted), cannot be null or empty.
-   * @param filesToAdd files that will be added, cannot be null or empty.
+   * @param filesToAdd    files that will be added, cannot be null or empty.
    * @return this for method chaining
    */
-  RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd);
+  default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
+    return rewriteFiles(
+        filesToDelete,
+        ImmutableSet.of(),
+        filesToAdd,
+        ImmutableSet.of()
+    );
+  }
+
+  /**
+   * Add a rewrite that replaces one set of files with another set that contains the same data (format v2).

Review comment:
       Agreed !




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#discussion_r596509652



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -228,6 +383,163 @@ public void testRecovery() {
     Assert.assertEquals("Only 3 manifests should exist", 3, listManifestFiles().size());
   }
 
+  @Test
+  public void testRecoverWhenRewriteBothDataAndDeleteFiles() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+
+    table.newRowDelta()
+        .addRows(FILE_A)
+        .addRows(FILE_B)
+        .addRows(FILE_C)
+        .addDeletes(FILE_A_DELETES)
+        .addDeletes(FILE_B_DELETES)
+        .commit();
+
+    long baseSnapshotId = readMetadata().currentSnapshot().snapshotId();
+    table.ops().failCommits(3);
+
+    RewriteFiles rewrite = table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_A_DELETES, FILE_B_DELETES),
+            ImmutableSet.of(FILE_D), ImmutableSet.of());
+    Snapshot pending = rewrite.apply();
+
+    Assert.assertEquals("Should produce 3 manifests", 3, pending.allManifests().size());
+    ManifestFile manifest1 = pending.allManifests().get(0);
+    ManifestFile manifest2 = pending.allManifests().get(1);
+    ManifestFile manifest3 = pending.allManifests().get(2);
+
+    validateManifestEntries(manifest1,
+        ids(pending.snapshotId()),
+        files(FILE_D),
+        statuses(ADDED));
+
+    validateManifestEntries(manifest2,
+        ids(pending.snapshotId(), baseSnapshotId, baseSnapshotId),
+        files(FILE_A, FILE_B, FILE_C),
+        statuses(DELETED, EXISTING, EXISTING));
+
+    validateDeleteManifest(manifest3,
+        seqs(2, 2),
+        ids(pending.snapshotId(), pending.snapshotId()),
+        files(FILE_A_DELETES, FILE_B_DELETES),
+        statuses(DELETED, DELETED));
+
+    rewrite.commit();
+
+    Assert.assertTrue("Should reuse new manifest", new File(manifest1.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest2.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest3.path()).exists());
+
+    TableMetadata metadata = readMetadata();
+    List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3);
+    Assert.assertTrue("Should committed the manifests",
+        metadata.currentSnapshot().allManifests().containsAll(committedManifests));

Review comment:
       nit: could do `metadata.currentSnapshot().allManifests().equals(committedManifests)`

##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -228,6 +383,163 @@ public void testRecovery() {
     Assert.assertEquals("Only 3 manifests should exist", 3, listManifestFiles().size());
   }
 
+  @Test
+  public void testRecoverWhenRewriteBothDataAndDeleteFiles() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+
+    table.newRowDelta()
+        .addRows(FILE_A)
+        .addRows(FILE_B)
+        .addRows(FILE_C)
+        .addDeletes(FILE_A_DELETES)
+        .addDeletes(FILE_B_DELETES)
+        .commit();
+
+    long baseSnapshotId = readMetadata().currentSnapshot().snapshotId();
+    table.ops().failCommits(3);
+
+    RewriteFiles rewrite = table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_A_DELETES, FILE_B_DELETES),
+            ImmutableSet.of(FILE_D), ImmutableSet.of());
+    Snapshot pending = rewrite.apply();
+
+    Assert.assertEquals("Should produce 3 manifests", 3, pending.allManifests().size());
+    ManifestFile manifest1 = pending.allManifests().get(0);
+    ManifestFile manifest2 = pending.allManifests().get(1);
+    ManifestFile manifest3 = pending.allManifests().get(2);
+
+    validateManifestEntries(manifest1,
+        ids(pending.snapshotId()),
+        files(FILE_D),
+        statuses(ADDED));
+
+    validateManifestEntries(manifest2,
+        ids(pending.snapshotId(), baseSnapshotId, baseSnapshotId),
+        files(FILE_A, FILE_B, FILE_C),
+        statuses(DELETED, EXISTING, EXISTING));
+
+    validateDeleteManifest(manifest3,
+        seqs(2, 2),
+        ids(pending.snapshotId(), pending.snapshotId()),
+        files(FILE_A_DELETES, FILE_B_DELETES),
+        statuses(DELETED, DELETED));
+
+    rewrite.commit();
+
+    Assert.assertTrue("Should reuse new manifest", new File(manifest1.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest2.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest3.path()).exists());
+
+    TableMetadata metadata = readMetadata();
+    List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3);
+    Assert.assertTrue("Should committed the manifests",
+        metadata.currentSnapshot().allManifests().containsAll(committedManifests));
+
+    // As commit success all the manifests added with rewrite should be available.
+    Assert.assertEquals("Only 5 manifest should exist", 5, listManifestFiles().size());
+  }
+
+  @Test
+  public void testReplaceEqualityDeletesWithPositionDeletes() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+
+    table.newRowDelta()
+        .addRows(FILE_A2)
+        .addDeletes(FILE_A2_DELETES)
+        .commit();
+
+    TableMetadata metadata = readMetadata();
+    long baseSnapshotId = metadata.currentSnapshot().snapshotId();
+
+    // Apply and commit the rewrite transaction.
+    RewriteFiles rewrite = table.newRewrite().rewriteFiles(
+        ImmutableSet.of(), ImmutableSet.of(FILE_A2_DELETES),
+        ImmutableSet.of(), ImmutableSet.of(FILE_B_DELETES)
+    );
+    Snapshot pending = rewrite.apply();
+
+    Assert.assertEquals("Should produce 3 manifests", 3, pending.allManifests().size());
+    ManifestFile manifest1 = pending.allManifests().get(0);
+    ManifestFile manifest2 = pending.allManifests().get(1);
+    ManifestFile manifest3 = pending.allManifests().get(2);
+
+    validateManifestEntries(manifest1,
+        ids(baseSnapshotId),
+        files(FILE_A2),
+        statuses(ADDED));
+
+    validateDeleteManifest(manifest2,
+        seqs(2),
+        ids(pending.snapshotId()),
+        files(FILE_B_DELETES),
+        statuses(ADDED));
+
+    validateDeleteManifest(manifest3,
+        seqs(2),
+        ids(pending.snapshotId()),
+        files(FILE_A2_DELETES),
+        statuses(DELETED));
+
+    rewrite.commit();
+
+    Assert.assertTrue("Should reuse new manifest", new File(manifest1.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest2.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest3.path()).exists());
+
+    metadata = readMetadata();
+    List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3);
+    Assert.assertTrue("Should committed the manifests",
+        metadata.currentSnapshot().allManifests().containsAll(committedManifests));

Review comment:
       nit: also `equals` here

##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -35,11 +36,30 @@
  */
 public interface RewriteFiles extends SnapshotUpdate<RewriteFiles> {
   /**
-   * Add a rewrite that replaces one set of files with another set that contains the same data.
+   * Add a rewrite that replaces one set of data files with another set that contains the same data.
    *
    * @param filesToDelete files that will be replaced (deleted), cannot be null or empty.
-   * @param filesToAdd files that will be added, cannot be null or empty.
+   * @param filesToAdd    files that will be added, cannot be null or empty.
    * @return this for method chaining
    */
-  RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd);
+  default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
+    return rewriteFiles(
+        filesToDelete,
+        ImmutableSet.of(),
+        filesToAdd,
+        ImmutableSet.of()
+    );
+  }
+
+  /**
+   * Add a rewrite that replaces one set of files with another set that contains the same data (format v2).

Review comment:
       nit: probably don't need to mention format v2 specifically

##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -164,6 +207,66 @@ public void testAddAndDelete() {
     Assert.assertEquals("Only 3 manifests should exist", 3, listManifestFiles().size());
   }
 
+  @Test
+  public void testRewriteDataAndDeleteFiles() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+    Assert.assertEquals("Table should start empty", 0, listManifestFiles().size());
+
+    table.newRowDelta()
+        .addRows(FILE_A)
+        .addRows(FILE_B)
+        .addRows(FILE_C)
+        .addDeletes(FILE_A_DELETES)
+        .addDeletes(FILE_B_DELETES)
+        .commit();
+
+    TableMetadata base = readMetadata();
+    Snapshot baseSnap = base.currentSnapshot();
+    long baseSnapshotId = baseSnap.snapshotId();
+    Assert.assertEquals("Should create 2 manifests for initial write", 2, baseSnap.allManifests().size());
+    List<ManifestFile> initialManifests = baseSnap.allManifests();
+
+    validateManifestEntries(initialManifests.get(0),
+        ids(baseSnapshotId, baseSnapshotId, baseSnapshotId),
+        files(FILE_A, FILE_B, FILE_C),
+        statuses(ADDED, ADDED, ADDED));
+    validateDeleteManifest(initialManifests.get(1),
+        seqs(1, 1),
+        ids(baseSnapshotId, baseSnapshotId),
+        files(FILE_A_DELETES, FILE_B_DELETES),
+        statuses(ADDED, ADDED));
+
+    // Rewrite the files.
+    Snapshot pending = table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_A_DELETES),
+            ImmutableSet.of(FILE_D), ImmutableSet.of())
+        .apply();
+
+    Assert.assertEquals("Should contain 3 manifest", 3, pending.allManifests().size());
+    Assert.assertFalse("Should not contain manifest from initial write",
+        pending.allManifests().containsAll(initialManifests));

Review comment:
       I think we want to check if any manifest matches initial manifests by `.allManifests.stream().anyMatch(initialManifests::contains)` ?

##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -40,19 +40,55 @@ protected String operation() {
     return DataOperations.REPLACE;
   }
 
+  private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+                                         Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
+    int filesToDelete = 0;
+    if (dataFilesToDelete != null) {
+      filesToDelete += dataFilesToDelete.size();
+    }
+
+    if (deleteFilesToDelete != null) {
+      filesToDelete += deleteFilesToDelete.size();
+    }
+
+    Preconditions.checkArgument(filesToDelete > 0, "Files to delete cannot be null or empty");
+
+    if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) {
+      // When there is no delete files in the rewrite action, data files to add cannot be null or empty.
+      Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0,
+          "Data files to add can not be null or empty because there's no delete file to rewrite");
+      Preconditions.checkArgument(deleteFilesToAdd == null || deleteFilesToAdd.isEmpty(),
+          "Delete files to add must be null or empty because there's no delete file to rewrite");
+    }
+  }
+
   @Override
-  public RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> filesToAdd) {
-    Preconditions.checkArgument(filesToDelete != null && !filesToDelete.isEmpty(),
-        "Files to delete cannot be null or empty");
-    Preconditions.checkArgument(filesToAdd != null && !filesToAdd.isEmpty(),

Review comment:
       Looks like we are changing the logic now to not enforce input sets to be non-nullable? I think for the new code we can do a precondition check on the four input sets to ensure they are all non-null, to save all the null check everywhere

##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -228,6 +383,163 @@ public void testRecovery() {
     Assert.assertEquals("Only 3 manifests should exist", 3, listManifestFiles().size());
   }
 
+  @Test
+  public void testRecoverWhenRewriteBothDataAndDeleteFiles() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+
+    table.newRowDelta()
+        .addRows(FILE_A)
+        .addRows(FILE_B)
+        .addRows(FILE_C)
+        .addDeletes(FILE_A_DELETES)
+        .addDeletes(FILE_B_DELETES)
+        .commit();
+
+    long baseSnapshotId = readMetadata().currentSnapshot().snapshotId();
+    table.ops().failCommits(3);
+
+    RewriteFiles rewrite = table.newRewrite()
+        .rewriteFiles(ImmutableSet.of(FILE_A), ImmutableSet.of(FILE_A_DELETES, FILE_B_DELETES),
+            ImmutableSet.of(FILE_D), ImmutableSet.of());
+    Snapshot pending = rewrite.apply();
+
+    Assert.assertEquals("Should produce 3 manifests", 3, pending.allManifests().size());
+    ManifestFile manifest1 = pending.allManifests().get(0);
+    ManifestFile manifest2 = pending.allManifests().get(1);
+    ManifestFile manifest3 = pending.allManifests().get(2);
+
+    validateManifestEntries(manifest1,
+        ids(pending.snapshotId()),
+        files(FILE_D),
+        statuses(ADDED));
+
+    validateManifestEntries(manifest2,
+        ids(pending.snapshotId(), baseSnapshotId, baseSnapshotId),
+        files(FILE_A, FILE_B, FILE_C),
+        statuses(DELETED, EXISTING, EXISTING));
+
+    validateDeleteManifest(manifest3,
+        seqs(2, 2),
+        ids(pending.snapshotId(), pending.snapshotId()),
+        files(FILE_A_DELETES, FILE_B_DELETES),
+        statuses(DELETED, DELETED));
+
+    rewrite.commit();
+
+    Assert.assertTrue("Should reuse new manifest", new File(manifest1.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest2.path()).exists());
+    Assert.assertTrue("Should reuse new manifest", new File(manifest3.path()).exists());
+
+    TableMetadata metadata = readMetadata();
+    List<ManifestFile> committedManifests = Lists.newArrayList(manifest1, manifest2, manifest3);
+    Assert.assertTrue("Should committed the manifests",
+        metadata.currentSnapshot().allManifests().containsAll(committedManifests));
+
+    // As commit success all the manifests added with rewrite should be available.
+    Assert.assertEquals("Only 5 manifest should exist", 5, listManifestFiles().size());
+  }
+
+  @Test
+  public void testReplaceEqualityDeletesWithPositionDeletes() {
+    Assume.assumeTrue("Rewriting delete files is only supported in iceberg format v2. ", formatVersion > 1);
+
+    table.newRowDelta()
+        .addRows(FILE_A2)
+        .addDeletes(FILE_A2_DELETES)
+        .commit();
+
+    TableMetadata metadata = readMetadata();
+    long baseSnapshotId = metadata.currentSnapshot().snapshotId();
+
+    // Apply and commit the rewrite transaction.
+    RewriteFiles rewrite = table.newRewrite().rewriteFiles(
+        ImmutableSet.of(), ImmutableSet.of(FILE_A2_DELETES),
+        ImmutableSet.of(), ImmutableSet.of(FILE_B_DELETES)
+    );
+    Snapshot pending = rewrite.apply();
+
+    Assert.assertEquals("Should produce 3 manifests", 3, pending.allManifests().size());
+    ManifestFile manifest1 = pending.allManifests().get(0);
+    ManifestFile manifest2 = pending.allManifests().get(1);
+    ManifestFile manifest3 = pending.allManifests().get(2);
+
+    validateManifestEntries(manifest1,
+        ids(baseSnapshotId),

Review comment:
       nit: should we verify that sequence number for this is 1? 

##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -40,19 +40,55 @@ protected String operation() {
     return DataOperations.REPLACE;
   }
 
+  private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+                                         Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
+    int filesToDelete = 0;
+    if (dataFilesToDelete != null) {
+      filesToDelete += dataFilesToDelete.size();
+    }
+
+    if (deleteFilesToDelete != null) {
+      filesToDelete += deleteFilesToDelete.size();
+    }
+
+    Preconditions.checkArgument(filesToDelete > 0, "Files to delete cannot be null or empty");
+
+    if (deleteFilesToDelete == null || deleteFilesToDelete.isEmpty()) {
+      // When there is no delete files in the rewrite action, data files to add cannot be null or empty.
+      Preconditions.checkArgument(dataFilesToAdd != null && dataFilesToAdd.size() > 0,
+          "Data files to add can not be null or empty because there's no delete file to rewrite");

Review comment:
       Nit: might want to update to something like "Data files to add can not be null or empty when there is no delete file to be rewritten", similar applies to L61




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2294: Core: Support rewriting delete files.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2294:
URL: https://github.com/apache/iceberg/pull/2294#issuecomment-809034581


   Got this merged,  thanks @yyanyy and @chenjunjiedada for reviewing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org