You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/25 22:13:38 UTC

[GitHub] [iceberg] rdblue opened a new pull request #2865: Core: Add validation for row-level deletes with rewrites

rdblue opened a new pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865


   This adds a new validation for `RewriteFiles` that checks whether delete files have been added for any of the files that are being rewritten. The check constructs a delete file index for all the delete files in snapshots written since a starting snapshot and filters the delete file entries using the sequence number of the starting snapshot to ensure that only delete files that were added after the starting snapshot are indexed. Next, if any data file being replaced matches a delete file in the index, a validation exception is thrown.
   
   This also updates actions to correctly set the starting snapshot ID for validation.
   
   This fixes #2308.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677027772



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,38 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no new delete files have been added to the table since a starting snapshot.

Review comment:
       I updated the wording here to mention that it tests the data files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676775636



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations

Review comment:
       No. A rewrite may not change the data in the table. As a result, delete rewrites may rewrite an equality delete file to position deletes, or may rewrite position deletes in a more efficient file layout. That cannot change the actual deletes. A compaction operation must apply or carry forward any deletes to the data files that are rewritten, so it must be aware of any existing deletes at the start of the operation. That means a compaction already knows about all of the deletes rewritten by a concurrent delete rewrite, and so it is safe for the two to happen at the same time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677072072



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);

Review comment:
       I think we should also handle the case : 
   
   * t1:  table has FILE_A and EQ_DELETE_FILE_B;
   * t2:  RewriteFiles action to convert the EQ_DELETE_FILE_B to POS_DELETE_FILE_C which hold the file_id and offset from FILE_A,  just started the txn but not commit;
   * t3:  DeleteFiles action committed the txn which deletes  the FILE_A;
   * t4:  Try to commit the RewriteFiles action. 
   
   At timestamp `t4`,  the RewriteFiles txn should be aborted because the data file FILE_A which is relied by POS_DELETE_FILE_C has been deleted. 
   
   Do we plan to add this validation in the following PR, or did not consider this case because we currently don't have introduced any RewriteFiles action that converting equality delete files into positional delete files ? 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676557241



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -63,15 +67,16 @@ private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<Dele
   }
 
   @Override
-  public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                                    Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
-    verifyInputAndOutputFiles(dataFilesToDelete, deleteFilesToDelete, dataFilesToAdd, deleteFilesToAdd);
+    verifyInputAndOutputFiles(dataFilesToReplace, deleteFilesToReplace, dataFilesToAdd, deleteFilesToAdd);

Review comment:
       As we've already changed the `dataFilesToDelete` and `deleteFilesToDelete` to `dataFilesToReplace` & `deleteFilesToReplace`,  then I think we should also change the all the variables & arguments in `verifyInputAndOutputFiles` for consistence ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676569195



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -322,31 +365,25 @@ protected void validateDataFilesExist(TableMetadata base, Long startingSnapshotI
 
       if (matchingOperations.contains(currentSnapshot.operation())) {
         newSnapshots.add(currentSnapshotId);
-        for (ManifestFile manifest : currentSnapshot.dataManifests()) {
-          if (manifest.snapshotId() == (long) currentSnapshotId) {
-            manifests.add(manifest);
+        if (content == ManifestContent.DATA) {
+          for (ManifestFile manifest : currentSnapshot.dataManifests()) {
+            if (manifest.snapshotId() == (long) currentSnapshotId) {

Review comment:
       Is it possible that the `manifest.snapshotId()` returns a `null` value  ? If so, then we'd better to change this line to `Objects.equals(manifest.snapshotId(), currentSnapshotId)` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676708611



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -309,6 +327,31 @@ protected void validateDataFilesExist(TableMetadata base, Long startingSnapshotI
         VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS :
         VALIDATE_DATA_FILES_EXIST_OPERATIONS;
 
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, matchingOperations, ManifestContent.DATA);
+    List<ManifestFile> manifests = history.first();
+    Set<Long> newSnapshots = history.second();
+
+    ManifestGroup matchingDeletesGroup = new ManifestGroup(ops.io(), manifests, ImmutableList.of())
+        .filterManifestEntries(entry -> entry.status() != ManifestEntry.Status.ADDED &&
+            newSnapshots.contains(entry.snapshotId()) && requiredDataFiles.contains(entry.file().path()))
+        .specsById(base.specsById())
+        .ignoreExisting();
+
+    try (CloseableIterator<ManifestEntry<DataFile>> deletes = matchingDeletesGroup.entries().iterator()) {
+      if (deletes.hasNext()) {
+        throw new ValidationException("Cannot commit, missing data files: %s",

Review comment:
       "Cannot commit, deletes have been added to the table which refer to data files which no longer exist"?
   
   I think I go that right here ... It would help me a lot if the message here was a bit more explanatory about what went wrong and why.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#issuecomment-886264910


   @cwsteinbach FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677027062



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =

Review comment:
       I had the same thought about using the DELETE operation type when we just add delete files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676779653



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -54,12 +54,23 @@ default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> fil
   /**
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
-   * @param dataFilesToDelete   data files that will be replaced (deleted).
-   * @param deleteFilesToDelete delete files that will be replaced (deleted).
+   * @param dataFilesToReplace   data files that will be replaced (deleted).
+   * @param deleteFilesToReplace delete files that will be replaced (deleted).
    * @param dataFilesToAdd      data files that will be added.
    * @param deleteFilesToAdd    delete files that will be added.
    * @return this for method chaining.
    */
-  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                             Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
+
+  /**
+   * Set the snapshot ID used in any reads for this operation.
+   * <p>
+   * Validations will check changes after this snapshot ID. If the from snapshot is not set, all ancestor snapshots
+   * through the table's initial snapshot are validated.
+   *
+   * @param snapshotId a snapshot ID
+   * @return this for method chaining
+   */
+  RewriteFiles validateFromSnapshot(long snapshotId);

Review comment:
       You can also see this in the test case, where we actually create commits and then run the rewrite differently to simulate concurrent operations: https://github.com/apache/iceberg/pull/2865/files#diff-b914d99ace31a09150792c10e3b0f1e40331edb1c24233ee6912d4608cfde0c4R615-R643




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676552928



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -54,12 +54,23 @@ default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> fil
   /**
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
-   * @param dataFilesToDelete   data files that will be replaced (deleted).
-   * @param deleteFilesToDelete delete files that will be replaced (deleted).
+   * @param dataFilesToReplace   data files that will be replaced (deleted).
+   * @param deleteFilesToReplace delete files that will be replaced (deleted).
    * @param dataFilesToAdd      data files that will be added.
    * @param deleteFilesToAdd    delete files that will be added.
    * @return this for method chaining.
    */
-  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                             Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
+
+  /**
+   * Set the snapshot ID used in any reads for this operation.
+   * <p>
+   * Validations will check changes after this snapshot ID. If the from snapshot is not set, all ancestor snapshots

Review comment:
       Nit:  `If the from snapshot is not set`,   do we miss a `ID` in this line ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676759679



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -63,15 +67,16 @@ private void verifyInputAndOutputFiles(Set<DataFile> dataFilesToDelete, Set<Dele
   }
 
   @Override
-  public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                                    Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd) {
-    verifyInputAndOutputFiles(dataFilesToDelete, deleteFilesToDelete, dataFilesToAdd, deleteFilesToAdd);
+    verifyInputAndOutputFiles(dataFilesToReplace, deleteFilesToReplace, dataFilesToAdd, deleteFilesToAdd);

Review comment:
       @openinx, I had a version where I did that and fixed the error messages, but it ended up being too many unrelated test changes so I'll do that in a follow-up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677076862



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);

Review comment:
       In my thought,   the data conflicts between REPLACE operation and APPEND/OVERWRITE/DELETE operations is because of the dependency relationship from positional delete files to data files. Then there will be two cases:
   
   1.  The REPLACE operation remove the data files that was relied by other committed APPEND/OVERWRITE/DELTE  operations ( among the three operations,  currently only the RowDelta will introduce new positional delete files which reference to committed data files).  I think we handle this case perfectly in this PR.
   2. The APPEND/OVERWRITE/DELETE operations removed the data files that was relied by other committing REPLACE operation.  The above [comment](https://github.com/apache/iceberg/pull/2865/files#r677072072) is an example. 
   
   As a final solution,  I think we should handle both the cases perfectly, so that we won't encounter any unexpected interruption when reading the iceberg table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677076862



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);

Review comment:
       In my thought,   the data conflicts between REPLACE operation and APPEND/OVERWRITE/DELETE operations is because of the dependency relationship from positional delete files to data files. Then there will be two cases:
   
   1.  The REPLACE operation remove the data files that was relied by other committed APPEND/OVERWRITE/DELTE  operations ( among the three operations,  currently only the RowDelta will introduce new positional delete files which reference to committed data files).  I think we handle this case perfectly in this PR.
   2. The APPEND/OVERWRITE/DELETE operations remove the data files that was relied by other committed REPLACE operation.  The above [comment](https://github.com/apache/iceberg/pull/2865/files#r677072072) is an example. 
   
   As a final solution,  I think we should handle both the cases perfectly, so that we won't encounter any unexpected interruption when reading the iceberg table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676742860



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -54,12 +54,23 @@ default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> fil
   /**
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
-   * @param dataFilesToDelete   data files that will be replaced (deleted).
-   * @param deleteFilesToDelete delete files that will be replaced (deleted).
+   * @param dataFilesToReplace   data files that will be replaced (deleted).
+   * @param deleteFilesToReplace delete files that will be replaced (deleted).
    * @param dataFilesToAdd      data files that will be added.
    * @param deleteFilesToAdd    delete files that will be added.
    * @return this for method chaining.
    */
-  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                             Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
+
+  /**
+   * Set the snapshot ID used in any reads for this operation.
+   * <p>
+   * Validations will check changes after this snapshot ID. If the from snapshot is not set, all ancestor snapshots
+   * through the table's initial snapshot are validated.
+   *
+   * @param snapshotId a snapshot ID
+   * @return this for method chaining
+   */
+  RewriteFiles validateFromSnapshot(long snapshotId);

Review comment:
       I was wondering if this needs to be an open api? Seems like we always want to set it to table.currentSnapshot.snapshotID?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677026422



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =

Review comment:
       I'll update this. And I'll also add "delete" as an operation. Just because we only create "overwrite" commits for row-level deletes today doesn't mean that we won't create "delete" commits later. I considered updating the `RowDelta` operation so that it created delete commits if only delete files were added, but didn't end up doing it because we don't do that for overwrites that only remove data files.
   
   We can discuss whether to convert overwrite commits to delete commits that later, but for now I think that we should add "delete" to this list because it would be a valid commit. Just because the reference implementation doesn't produce them doesn't mean it isn't valid.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677083552



##########
File path: core/src/main/java/org/apache/iceberg/DeleteFileIndex.java
##########
@@ -312,6 +312,7 @@ static Builder builderFor(FileIO io, Iterable<ManifestFile> deleteManifests) {
   static class Builder {
     private final FileIO io;
     private final Set<ManifestFile> deleteManifests;
+    private long minSequenceNumber = 0L;

Review comment:
       I remember the sequence number 0 is kept for the data files for iceberg v1, so in theory the sequence number from delete files should start from 1.   So setting it to 0 as the default value sounds correct.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677025572



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =
+      ImmutableSet.of(DataOperations.OVERWRITE);

Review comment:
       I've considered this before and keep coming back to not splitting the existing operations. Right now, the operations convey what's happening to the table at a logical level. An append never removes rows, a delete never adds rows, and a replace never changes rows. Then overwrite is used for anything that adds and removes rows in the same commit. I think those definitions are easy to understand and reason about.
   
   If we added an operation for deltas, then we would either add a single "delta" operation or "delta-delete", "delta-overwrite", and "delta-replace" operations. If we use just "delta" then it wouldn't be clear what was happening logically, so we'd lose the ability to ignore a row-level deletes in a logical replace/rewrite. So I would conclude that we can either introduce 3 new operations or no new operations.
   
   So far, I don't think there's a strong justification for adding the 3 new operations because the existing ones already cover what you're doing to the data logically. We would only be adding more information about how that logical operation is carried out with files. There may be some value there, but I'd probably instead add a new property to signal that row-level deletes were used so that we don't alter the contract of `DataOperation`.
   
   Then the question becomes whether it is useful to know that row-level deletes are present. If so, we can add a property for it. In this case, it would save us some time reading the manifest list for an overwrite or delete. Do you think we should add something to signal this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r679563193



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);

Review comment:
       @openinx, I opened a draft PR, https://github.com/apache/iceberg/pull/2892, for to fix this, but I'm not sure that we actually want to merge it.
   
   Originally, I copied the `validateDataFilesExist` implementation from `RowDelta` and added tests. That works great and is a fairly simple commit if we want it. But as I was testing it, I added cases for delete, overwrite, rewrite, and a transactional delete and append to overwrite.
   
   Then I realized that not all of those cases apply. For example, if I delete a file that has row deltas, that's perfectly okay. The operation actually deletes all rows that weren't already deleted and any dangling positional deletes will just never get applied. Similarly, an overwrite from, for example, a MERGE INTO operation that replaces a file will need to first read that file and then build its replacement. The read must merge the existing deletes, so the rows will already be removed from the replacement file. Then the deletes in that case are also fine to leave dangling even if they are rewritten into a new file.
   
   I ended up leaving a validation that the data files are not rewritten, but I also think that may not be necessary because I don't think it would be reasonable to rewrite a data file without also applying the deletes. If we build a compaction operation we will probably apply deletes during compaction. We could also build one that compacts just data files, but in order for that operation to be correct it would also need to rewrite delete files (that are pointing to the compacted data files) or use an older sequence number with equality deletes. If it rewrote delete files, then the delete rewrite would fail. If it changed sequence numbers, then we would have a lot more validation to do when building that plan.
   
   After thinking about it, I don't think that we actually need to handle any cases where a compaction and delete file rewrite are concurrent, at least not from the delete file rewrite side. Any compaction should handle deletes as part of the compaction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677027062



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =

Review comment:
       I had the same thought about using DELETE operation types when we just add delete files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677009679



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =
+      ImmutableSet.of(DataOperations.OVERWRITE);

Review comment:
       I wonder whether we should introduce a new `DataOperation` for row deltas before adopting v2. Right now, we use `OVERWRITE` for deltas as well as other operations such copy-on-write MERGE and replace partitions. This means the new validation logic will apply to operations that cannot produce delete files.
   
   It probably does not matter much in this particular use case as the delete index will be empty but it is something we should do now or never.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677010425



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =

Review comment:
       nit: I am not sure the var name conveys its purpose and does not follow the convention compared used in 2 vars above. If I understand correctly, this is a set of operations that may produce delete files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676778611



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -309,6 +327,31 @@ protected void validateDataFilesExist(TableMetadata base, Long startingSnapshotI
         VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS :
         VALIDATE_DATA_FILES_EXIST_OPERATIONS;
 
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, matchingOperations, ManifestContent.DATA);
+    List<ManifestFile> manifests = history.first();
+    Set<Long> newSnapshots = history.second();
+
+    ManifestGroup matchingDeletesGroup = new ManifestGroup(ops.io(), manifests, ImmutableList.of())
+        .filterManifestEntries(entry -> entry.status() != ManifestEntry.Status.ADDED &&
+            newSnapshots.contains(entry.snapshotId()) && requiredDataFiles.contains(entry.file().path()))
+        .specsById(base.specsById())
+        .ignoreExisting();
+
+    try (CloseableIterator<ManifestEntry<DataFile>> deletes = matchingDeletesGroup.entries().iterator()) {
+      if (deletes.hasNext()) {
+        throw new ValidationException("Cannot commit, missing data files: %s",

Review comment:
       This is an existing error message. We can fix it, but I'd prefer to do it in a separate commit so that we don't have changes to more test cases than necessary. I also plan to fix the initial validation in `RewriteFiles` later for the same reason. Is it okay to leave this as-is?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676770243



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -54,12 +54,23 @@ default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> fil
   /**
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
-   * @param dataFilesToDelete   data files that will be replaced (deleted).
-   * @param deleteFilesToDelete delete files that will be replaced (deleted).
+   * @param dataFilesToReplace   data files that will be replaced (deleted).
+   * @param deleteFilesToReplace delete files that will be replaced (deleted).
    * @param dataFilesToAdd      data files that will be added.
    * @param deleteFilesToAdd    delete files that will be added.
    * @return this for method chaining.
    */
-  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                             Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
+
+  /**
+   * Set the snapshot ID used in any reads for this operation.
+   * <p>
+   * Validations will check changes after this snapshot ID. If the from snapshot is not set, all ancestor snapshots
+   * through the table's initial snapshot are validated.
+   *
+   * @param snapshotId a snapshot ID
+   * @return this for method chaining
+   */
+  RewriteFiles validateFromSnapshot(long snapshotId);

Review comment:
       This is necessary because the current snapshot's ID may not be the one that was used for the operation. This `RewriteFiles` operation may be created significantly later than the original planning that is done for compaction. For example, in the new Spark compaction code, the rewrite is created when a group is ready to commit. The table state may have changed from that time so using the current snapshot ID could easily skip delta commits that were interleaved, leaving the same issue that we are fixing here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677010865



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,39 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no new delete files that must be applied to the given data files have been added to the table since
+   * a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param dataFiles data files to validate have no new row deletes
+   */
+  protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long startingSnapshotId,
+                                                  Iterable<DataFile> dataFiles) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_REPLACED_DATA_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    long startingSequenceNumber = startingSnapshotId == null ? 0 : base.snapshot(startingSnapshotId).sequenceNumber();
+    DeleteFileIndex deletes = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .afterSequenceNumber(startingSequenceNumber)
+        .specsById(ops.current().specsById())
+        .build();
+
+    for (DataFile dataFile : dataFiles) {

Review comment:
       nit: Do we have to iterate through data files if the index is empty and we did not find any delete manifests? This logic will be triggered during various operations including copy-on-write MERGE.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677027300



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =

Review comment:
       Actually, I think the variable name does follow the convention. If we are validating replaced data files then those are the types of commits that are needed for validation.
   
   I'm going to update this by adding `delete`, but not renaming.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677020238



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,39 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no new delete files that must be applied to the given data files have been added to the table since
+   * a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param dataFiles data files to validate have no new row deletes
+   */
+  protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long startingSnapshotId,
+                                                  Iterable<DataFile> dataFiles) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_REPLACED_DATA_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    long startingSequenceNumber = startingSnapshotId == null ? 0 : base.snapshot(startingSnapshotId).sequenceNumber();
+    DeleteFileIndex deletes = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .afterSequenceNumber(startingSequenceNumber)
+        .specsById(ops.current().specsById())
+        .build();
+
+    for (DataFile dataFile : dataFiles) {

Review comment:
       The data files here are the ones that are in memory because they are being replaced. I don't think that this is going to be a significant slow-down since we're just doing an index check, but we can follow up with a couple improvements to make it faster.
   
   One improvement I'd opt for before avoiding this loop is to only read manifest files that were created in the new snapshots. That is, when the snapshot ID of the delete file is one of the snapshots newer than the starting snapshot. We don't currently do that because the delete file index builder doesn't support it and it looked more invasive to update the index builder (and we're trying to get 0.12.0 out).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r680229193



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);

Review comment:
       > You mean people will need to know whether the data-files-only compaction will break the existing data set ? This puts too much burden on users ? I think we still need to provide the validation because we developers are sure that those cases will break the data correctness ...
   
   The contract of the `RewriteFiles` operation is that it does not change the table data. That is explicitly stated in the `rewriteFiles` configuration method's javadoc:
   
   > Add a rewrite that replaces one set of files with another set that **contains the same data**.
   
   I think that is a reasonable requirement for the operation. Compacting only data files without accounting for deletes would change the table data, even without a concurrent delete file rewrite. For example, if I have
   
   | `file-a.parquet` |
   | -- |
   | `1, 'x'` |
   | `2, 'y'` |
   
   | `file-b.parquet` |
   | -- |
   | `3, 'z'` |
   
   | `file-a-deletes.parquet` |
   | -- |
   | `'file-a.parquet', 0` |
   
   Compacting `file-a.parquet` and `file-b.parquet` together would undelete row `id=1`. There is no way for Iceberg to check whether the rewrite ignored `file-a-deletes.parquet` or applied the deletes. The correctness problem is entirely in the compaction operation.
   
   We have to trust the `RewriteFiles` caller to do the right thing and preserve correctness.
   
   That means either the caller must do one of the following:
   1. Remove the deleted row
   2. Rewrite `file-a-deletes.parquet` to point the delete at the new data file
   3. Write the new data file with a compatible sequence number (only valid if the deletes are equality deletes so they would still apply)
   
   If the caller uses option 1, then it doesn't matter to concurrent operations that are rewriting the delete. It becomes a dangling delete that doesn't need to be applied. If the caller uses option 2, then there is a new file tracking the delete that would not be part of a concurrent rewrite operation; a concurrent rewrite may contain a dangling delete.
   
   The only case we would need to worry about is option 3, and only if a concurrent delete rewrites an equality delete as a positional delete. To be careful, we could merge #2892 to be able to validate this case. But I'm skeptical that altering the sequence number of the compacted file is a good idea. It may not even be possible if the sequence number of a delete that must be applied is between the sequence numbers of the starting data files.
   
   As I said, I'm okay not merging the validation for option 3 because I think it is unlikely that this is going to be a viable strategy. But I'm fine merging #2892 if you think it is needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676235707



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations

Review comment:
       The rewrite deletes action will replace old deletes with new ones. Does that count add deletes operation?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676768147



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -322,31 +365,25 @@ protected void validateDataFilesExist(TableMetadata base, Long startingSnapshotI
 
       if (matchingOperations.contains(currentSnapshot.operation())) {
         newSnapshots.add(currentSnapshotId);
-        for (ManifestFile manifest : currentSnapshot.dataManifests()) {
-          if (manifest.snapshotId() == (long) currentSnapshotId) {
-            manifests.add(manifest);
+        if (content == ManifestContent.DATA) {
+          for (ManifestFile manifest : currentSnapshot.dataManifests()) {
+            if (manifest.snapshotId() == (long) currentSnapshotId) {

Review comment:
       No, there should never be a null snapshot ID returned from `ManifestFile`. `null` was allowed because manifest lists were added later, but manifest metadata is read any time a table is written so that the manifest data is reliable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677010865



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,39 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no new delete files that must be applied to the given data files have been added to the table since
+   * a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param dataFiles data files to validate have no new row deletes
+   */
+  protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long startingSnapshotId,
+                                                  Iterable<DataFile> dataFiles) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_REPLACED_DATA_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    long startingSequenceNumber = startingSnapshotId == null ? 0 : base.snapshot(startingSnapshotId).sequenceNumber();
+    DeleteFileIndex deletes = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .afterSequenceNumber(startingSequenceNumber)
+        .specsById(ops.current().specsById())
+        .build();
+
+    for (DataFile dataFile : dataFiles) {

Review comment:
       nit: Do we have to iterate through data files if the index is empty and we did not find any delete manifests? This logic will be triggered on top of various operations including copy-on-write MERGE.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676586299



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -54,12 +54,23 @@ default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> fil
   /**
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
-   * @param dataFilesToDelete   data files that will be replaced (deleted).
-   * @param deleteFilesToDelete delete files that will be replaced (deleted).
+   * @param dataFilesToReplace   data files that will be replaced (deleted).
+   * @param deleteFilesToReplace delete files that will be replaced (deleted).
    * @param dataFilesToAdd      data files that will be added.
    * @param deleteFilesToAdd    delete files that will be added.
    * @return this for method chaining.
    */
-  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                             Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
+
+  /**
+   * Set the snapshot ID used in any reads for this operation.
+   * <p>
+   * Validations will check changes after this snapshot ID. If the from snapshot is not set, all ancestor snapshots
+   * through the table's initial snapshot are validated.
+   *
+   * @param snapshotId a snapshot ID
+   * @return this for method chaining
+   */
+  RewriteFiles validateFromSnapshot(long snapshotId);

Review comment:
       I see all the `snapshotId` are set to be `ops.current().snapshotId()`  except the test cases ?   I'm thinking is there neccesary to introduce a method that validating since a older history snaphsot id.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677065617



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -54,12 +54,23 @@ default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> fil
   /**
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
-   * @param dataFilesToDelete   data files that will be replaced (deleted).
-   * @param deleteFilesToDelete delete files that will be replaced (deleted).
+   * @param dataFilesToReplace   data files that will be replaced (deleted).
+   * @param deleteFilesToReplace delete files that will be replaced (deleted).
    * @param dataFilesToAdd      data files that will be added.
    * @param deleteFilesToAdd    delete files that will be added.
    * @return this for method chaining.
    */
-  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                             Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
+
+  /**
+   * Set the snapshot ID used in any reads for this operation.
+   * <p>
+   * Validations will check changes after this snapshot ID. If the from snapshot is not set, all ancestor snapshots
+   * through the table's initial snapshot are validated.
+   *
+   * @param snapshotId a snapshot ID
+   * @return this for method chaining
+   */
+  RewriteFiles validateFromSnapshot(long snapshotId);

Review comment:
       OK, that make sense ! 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#issuecomment-887635306


   Thanks for the reviews, everyone!
   
   @openinx, I agree that we should handle the other case you pointed out. I'm going to open a new PR for that case later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677033560



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,39 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no new delete files that must be applied to the given data files have been added to the table since
+   * a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param dataFiles data files to validate have no new row deletes
+   */
+  protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long startingSnapshotId,
+                                                  Iterable<DataFile> dataFiles) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_REPLACED_DATA_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    long startingSequenceNumber = startingSnapshotId == null ? 0 : base.snapshot(startingSnapshotId).sequenceNumber();
+    DeleteFileIndex deletes = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .afterSequenceNumber(startingSequenceNumber)
+        .specsById(ops.current().specsById())
+        .build();
+
+    for (DataFile dataFile : dataFiles) {

Review comment:
       Oh, that would be way more important than skipping this loop.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676771407



##########
File path: api/src/main/java/org/apache/iceberg/RewriteFiles.java
##########
@@ -54,12 +54,23 @@ default RewriteFiles rewriteFiles(Set<DataFile> filesToDelete, Set<DataFile> fil
   /**
    * Add a rewrite that replaces one set of files with another set that contains the same data.
    *
-   * @param dataFilesToDelete   data files that will be replaced (deleted).
-   * @param deleteFilesToDelete delete files that will be replaced (deleted).
+   * @param dataFilesToReplace   data files that will be replaced (deleted).
+   * @param deleteFilesToReplace delete files that will be replaced (deleted).
    * @param dataFilesToAdd      data files that will be added.
    * @param deleteFilesToAdd    delete files that will be added.
    * @return this for method chaining.
    */
-  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile> deleteFilesToDelete,
+  RewriteFiles rewriteFiles(Set<DataFile> dataFilesToReplace, Set<DeleteFile> deleteFilesToReplace,
                             Set<DataFile> dataFilesToAdd, Set<DeleteFile> deleteFilesToAdd);
+
+  /**
+   * Set the snapshot ID used in any reads for this operation.
+   * <p>
+   * Validations will check changes after this snapshot ID. If the from snapshot is not set, all ancestor snapshots

Review comment:
       I think it is clear from context that the snapshot this refers to is he one identified by the snapshot ID. It may be more clear to change it to "If this method is not called" though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r679711940



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);

Review comment:
       > We could also build one that compacts just data files, but in order for that operation to be correct it would also need to rewrite delete files (that are pointing to the compacted data files) or use an older sequence number with equality deletes
   
   You mean people will need to know whether the  data-files-only compaction will break the existing data set ? This puts too much burden on users ?  I think we still need to provide the validation because we developers are sure that those cases will break the data correctness ...  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677032700



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, DataOperations.DELETE);
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =
+      ImmutableSet.of(DataOperations.OVERWRITE);

Review comment:
       I agree with your interpretation of `DataOperations`. If it describes what happens to the table from a logical perspective, then whether we use delete files or not is indeed an implementation detail. I think this detail may be important as we move forward as there will be a bigger focus on the performance. I'd opt for having that flag and being able to distinguish such cases. If we had that, we would be able to skip more snapshots during the validation added in this PR.
   
   But I am sometimes too paranoid about extra work during commits so it may be just me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r676239283



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -297,6 +283,38 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no new delete files have been added to the table since a starting snapshot.

Review comment:
       I think this is no new delete files of _the specified data files_ have been added to the table. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677072072



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);

Review comment:
       I think we should also handle the case : 
   
   * t1:  table has FILE_A and EQ_DELETE_FILE_B;
   * t2:  RewriteFiles action to convert the EQ_DELETE_FILE_B to POS_DELETE_FILE_C which hold the file_id and offset from FILE_A,  just start the txn but not commit;
   * t3:  DeleteFiles action committed the txn which deletes  the FILE_A;
   * t4:  Try to commit the RewriteFiles action. 
   
   At timestamp `t4`,  the RewriteFiles txn should be aborted because the data file FILE_A which is relied by POS_DELETE_FILE_C has been deleted. 
   
   Do we plan to add this validation in the following PR, or did not consider this case because we currently don't have introduced any RewriteFiles action that converting equality delete files into positional delete files ? 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org