You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/30 23:12:14 UTC

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2892: API: Add validation that delete file referenced files are not rewritten

aokolnychyi commented on a change in pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#discussion_r680262290



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -641,4 +641,115 @@ public void testNewDeleteFile() {
         .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
         .apply();
   }
+
+  @Test
+  public void testRewriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeRewriteFileA = table.currentSnapshot().snapshotId();
+
+    // rewrite FILE_A as FILE_A2
+    table.newRewrite()
+        .validateFromSnapshot(table.currentSnapshot().snapshotId())
+        .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
+        .commit();
+
+    AssertHelpers.assertThrows("Should fail because a referenced file was rewritten",
+        ValidationException.class, "Cannot commit, missing data files",
+        () -> table.newRewrite()
+            .validateFromSnapshot(snapshotBeforeRewriteFileA)
+            .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+            .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))
+            .apply());
+  }
+
+  @Test
+  public void testOverwriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeOverwriteFileA = table.currentSnapshot().snapshotId();
+
+    // overwrite FILE_A with FILE_A2
+    table.newOverwrite()
+        .deleteFile(FILE_A)
+        .addFile(FILE_A2)
+        .commit();
+
+    // the rewrite succeeds because the overwrite is required to read FILE_A correctly
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeOverwriteFileA)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))

Review comment:
       One of the assumptions we made while implementing the new data compaction is that we will always apply deletes while compacting data and will use a new sequence number for compacted files to make sure no deletes apply to them. Is there a good use case not to do that?
   
    > But assuming that you wanted to, this may not even be possible if the files that are rewritten are from different sequence numbers, with different sets of equality delete files that must be applied.
   
   Well, if we wanted to compact data files without applying deletes, the implementation would have to be really complicated to address the point above.
   
   If there is a use case (even though there is no implementation yet), I think we should add this validation. If not, I'd probably skip it. I don't have a use case in mind right now.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org