You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/30 00:03:00 UTC

[GitHub] [iceberg] rdblue opened a new pull request #2892: API: Add validation that delete file referenced files are not rewritten

rdblue opened a new pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892


   This is a follow-up to [OpenInx's comment](https://github.com/apache/iceberg/pull/2865#discussion_r677072072) on #2865 that adds a validation to check that referenced data files are not concurrently rewritten when rewriting delete files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2892: API: Add validation that delete file referenced files are not rewritten

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#discussion_r680262344



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -641,4 +641,115 @@ public void testNewDeleteFile() {
         .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
         .apply();
   }
+
+  @Test
+  public void testRewriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeRewriteFileA = table.currentSnapshot().snapshotId();
+
+    // rewrite FILE_A as FILE_A2
+    table.newRewrite()
+        .validateFromSnapshot(table.currentSnapshot().snapshotId())
+        .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
+        .commit();
+
+    AssertHelpers.assertThrows("Should fail because a referenced file was rewritten",
+        ValidationException.class, "Cannot commit, missing data files",
+        () -> table.newRewrite()
+            .validateFromSnapshot(snapshotBeforeRewriteFileA)
+            .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+            .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))
+            .apply());
+  }
+
+  @Test
+  public void testOverwriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeOverwriteFileA = table.currentSnapshot().snapshotId();
+
+    // overwrite FILE_A with FILE_A2
+    table.newOverwrite()
+        .deleteFile(FILE_A)
+        .addFile(FILE_A2)
+        .commit();
+
+    // the rewrite succeeds because the overwrite is required to read FILE_A correctly
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeOverwriteFileA)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))

Review comment:
       cc @RussellSpitzer, any use case you have?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2892: API: Add validation that delete file referenced files are not rewritten

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#issuecomment-890568188


   @openinx, if you want to add this, please merge it. Otherwise we'll move forward with the release without it. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2892: API: Add validation that delete file referenced files are not rewritten

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#discussion_r679657531



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -641,4 +641,115 @@ public void testNewDeleteFile() {
         .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
         .apply();
   }
+
+  @Test
+  public void testRewriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeRewriteFileA = table.currentSnapshot().snapshotId();
+
+    // rewrite FILE_A as FILE_A2
+    table.newRewrite()
+        .validateFromSnapshot(table.currentSnapshot().snapshotId())
+        .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
+        .commit();
+
+    AssertHelpers.assertThrows("Should fail because a referenced file was rewritten",
+        ValidationException.class, "Cannot commit, missing data files",
+        () -> table.newRewrite()
+            .validateFromSnapshot(snapshotBeforeRewriteFileA)
+            .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+            .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))
+            .apply());
+  }
+
+  @Test
+  public void testOverwriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeOverwriteFileA = table.currentSnapshot().snapshotId();
+
+    // overwrite FILE_A with FILE_A2
+    table.newOverwrite()
+        .deleteFile(FILE_A)
+        .addFile(FILE_A2)
+        .commit();
+
+    // the rewrite succeeds because the overwrite is required to read FILE_A correctly
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeOverwriteFileA)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))

Review comment:
       Reconsidered the [comment](https://github.com/apache/iceberg/pull/2865#discussion_r677076862),  I agree that conflicts between REPLACE and OVERWRITE/INSERT (say conflict-1) is another different story compared to conflicts between REPLACE and REPLACE (say conflict-2).
   
   For this case `The REPLACE operation remove the data files that was relied by other committed APPEND/OVERWRITE/DELTE operations`,  both conflict-1 and conflict-2 should be avoided because both of them will lead to incorrect data set.
   
   For the other case `The APPEND/OVERWRITE/DELETE operations removed the data files that was relied by other committing REPLACE operation`,   conflict-1 won't lead to incorrect data set although there will be some remaining dangling positional deletes (as you said in this [comment](https://github.com/apache/iceberg/pull/2865#discussion_r679563193)).  but it's possible to lead to incorrect data set when considering conflict-2: 
   
   Step.1  :  Table has FILE_A and EQ_DELETE_FILE_A;
   Step.2  :  RewriteAction_1  rewrite the FILE_A to another FILE_B - not commit; 
   Step.3  :  RewriteAction_2 rewrite the EQ_DELETE_FILE_A to POS_DELETE_FILE_C which reference to FILE_A  - not commit. 
   Step.4. :   Committed RewriteAction_1 ; 
   Step.5  :   Committed RewriteAction_2.
   
   In the end, the POS_DELETE_FILE_C won't be able to apply to the newly rewritten FILE_B, which create the incorrect data set.  Using older sequence number for RewriteAction also cannot fix this bug.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2892: API: Add validation that delete file referenced files are not rewritten

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#issuecomment-889538973


   I wanted to post this in case anyone finds it useful, but I don't think this is actually needed. That's why it is a draft. See https://github.com/apache/iceberg/pull/2865#discussion_r679563193 for more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2892: API: Add validation that delete file referenced files are not rewritten

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#discussion_r680262290



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -641,4 +641,115 @@ public void testNewDeleteFile() {
         .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
         .apply();
   }
+
+  @Test
+  public void testRewriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeRewriteFileA = table.currentSnapshot().snapshotId();
+
+    // rewrite FILE_A as FILE_A2
+    table.newRewrite()
+        .validateFromSnapshot(table.currentSnapshot().snapshotId())
+        .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
+        .commit();
+
+    AssertHelpers.assertThrows("Should fail because a referenced file was rewritten",
+        ValidationException.class, "Cannot commit, missing data files",
+        () -> table.newRewrite()
+            .validateFromSnapshot(snapshotBeforeRewriteFileA)
+            .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+            .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))
+            .apply());
+  }
+
+  @Test
+  public void testOverwriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeOverwriteFileA = table.currentSnapshot().snapshotId();
+
+    // overwrite FILE_A with FILE_A2
+    table.newOverwrite()
+        .deleteFile(FILE_A)
+        .addFile(FILE_A2)
+        .commit();
+
+    // the rewrite succeeds because the overwrite is required to read FILE_A correctly
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeOverwriteFileA)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))

Review comment:
       One of the assumptions we made while implementing the new data compaction is that we will always apply deletes while compacting data and will use a new sequence number for compacted files to make sure no deletes apply to them. Is there a good use case not to do that?
   
    > But assuming that you wanted to, this may not even be possible if the files that are rewritten are from different sequence numbers, with different sets of equality delete files that must be applied.
   
   Well, if we wanted to compact data files without applying deletes, the implementation would have to be really complicated to address the point above.
   
   If there is a use case (even though there is no implementation yet), I think we should add this validation. If not, I'd probably skip it. I don't have a use case in mind right now.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2892: API: Add validation that delete file referenced files are not rewritten

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#discussion_r680235789



##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -641,4 +641,115 @@ public void testNewDeleteFile() {
         .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
         .apply();
   }
+
+  @Test
+  public void testRewriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeRewriteFileA = table.currentSnapshot().snapshotId();
+
+    // rewrite FILE_A as FILE_A2
+    table.newRewrite()
+        .validateFromSnapshot(table.currentSnapshot().snapshotId())
+        .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
+        .commit();
+
+    AssertHelpers.assertThrows("Should fail because a referenced file was rewritten",
+        ValidationException.class, "Cannot commit, missing data files",
+        () -> table.newRewrite()
+            .validateFromSnapshot(snapshotBeforeRewriteFileA)
+            .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+            .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))
+            .apply());
+  }
+
+  @Test
+  public void testOverwriteReferencedDataFile() {
+    Assume.assumeTrue("Delete files are only supported in v2", formatVersion > 1);
+
+    table.newAppend()
+        .appendFile(FILE_A)
+        .commit();
+
+    table.newRowDelta()
+        .addDeletes(FILE_A_DELETES)
+        .commit();
+
+    long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+    // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES), Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+        .commit();
+
+    long snapshotBeforeOverwriteFileA = table.currentSnapshot().snapshotId();
+
+    // overwrite FILE_A with FILE_A2
+    table.newOverwrite()
+        .deleteFile(FILE_A)
+        .addFile(FILE_A2)
+        .commit();
+
+    // the rewrite succeeds because the overwrite is required to read FILE_A correctly
+    table.newRewrite()
+        .validateFromSnapshot(snapshotBeforeOverwriteFileA)
+        .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+        .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES), Sets.newSet(), Sets.newSet(FILE_A_DELETES))

Review comment:
       This is the case I was thinking about in my last comment on the other PR. In order for `RewriteAction_1` to be valid, it must reuse the sequence from `FILE_A` for `FILE_B`. Otherwise, the rewrite on its own would have un-deleted a row because `EQ_DELETE_FILE_A` would no longer apply.
   
   The validation in this PR can catch this case because the files referenced by `POS_DELETE_FILE_C` would be passed to the validation. That's `FILE_A` in this case and the commit for `RewriteAction_2` would check that `FILE_A` still exists and would fail. I'm fine merging this PR if you think that this is something that may happen.
   
   But, I think that it is really unlikely that rewrites will alter sequence numbers to avoid applying deletes. That makes little sense because you may as well apply deletes as long as you're rewriting the data. But assuming that you wanted to, this may not even be possible if the files that are rewritten are from different sequence numbers, with different sets of equality delete files that must be applied. I think a far better option is to apply the deletes when rewriting.
   
   I'm fine merging this if you think we need it. I'll remove the draft status so that we can.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org