You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/03 18:40:23 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

aokolnychyi opened a new pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069


   This PR adds validation for concurrently added delete files in `RowDelta` and `OverwriteFiles`. Previously, if we had a conflicting row delta operation, we would ignore it and corrupt the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702198317



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Couldn't this use the existing validation that passes a list of data files that were removed? Why not do that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706545508



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Well, I think we could approach it differently. Let me update and then we can discuss more.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r717035292



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +140,20 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+    if (appendConflictDetectionFilter != null && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+    }
+
+    boolean validateNewDeletes = deleteConflictDetectionFilter != null && base.currentSnapshot() != null;

Review comment:
       > Basing `validateNewDeletes` on whether the conflict detection filter was set doesn't seem correct to me.
   
   If I got you correctly, you are proposing that `validateFromSnapshot` will now indicate whether we should validate delete files. I think that is different compared to how `RowDelta` and `OverwriteFiles` work right now. I'd actually say calling `validateFromSnapshot` is an optimization that tells us from which snapshot to start looking. We never validate new appends if the append conflict detection filter is null. Moreover, it is not always possible to set the starting snapshot. If we start on an empty table, we must validate all snapshots. Here is our copy-on-write commit logic.
   
   ```
   Long scanSnapshotId = scan.snapshotId();
   if (scanSnapshotId != null) {
     overwriteFiles.validateFromSnapshot(scanSnapshotId);
   }
   
   Expression conflictDetectionFilter = conflictDetectionFilter();
   overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);
   ``` 
   
   Also, in your snippet, why call `validateNoNewDeletesForDataFiles` if we already know the overwrite filter is set? I think `validateNoNewDeletesForDataFiles` is simply a more efficient version of `validateNoNewDeletes` that can open delete files that match the filter to check their content for conflicts. The problem is that we can use `validateNoNewDeletesForDataFiles` only if we overwrite specific files.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702098625



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
-   * @param readSnapshotId the snapshot id that was used to read the data or null if the table was empty
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
-   * @deprecated this will be removed in 0.11.0;
-   *             use {@link #validateNoConflictingAppends(Expression)} and {@link #validateFromSnapshot(long)} instead
    */
-  @Deprecated
-  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);

Review comment:
       I removed the old version that we said we would remove in 0.11.0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r717035292



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +140,20 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+    if (appendConflictDetectionFilter != null && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+    }
+
+    boolean validateNewDeletes = deleteConflictDetectionFilter != null && base.currentSnapshot() != null;

Review comment:
       > Basing `validateNewDeletes` on whether the conflict detection filter was set doesn't seem correct to me.
   
   If I got you correctly, you are proposing that `validateFromSnapshot` will now indicate whether we should validate delete files. I think that is different compared to how `RowDelta` and `OverwriteFiles` work right now. I'd actually say calling `validateFromSnapshot` is an optimization that tells us from which snapshot to start looking. We never validate new appends if the append conflict detection filter is null. Moreover, it is not always possible to set the starting snapshot. If we start on an empty table, we must validate all snapshots. Here is our copy-on-write commit logic.
   
   ```
   Long scanSnapshotId = scan.snapshotId();
   if (scanSnapshotId != null) {
     overwriteFiles.validateFromSnapshot(scanSnapshotId);
   }
   
   Expression conflictDetectionFilter = conflictDetectionFilter();
   overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);
   ``` 
   
   Also, in your snippet, why call `validateNoNewDeletesForDataFiles` if we already know the overwrite filter is set? I think `validateNoNewDeletesForDataFiles` is simply a more efficient version of `validateNoNewDeletes` that can open delete files that match the filter to check their content for conflicts. The problem is that we can use `validateNoNewDeletesForDataFiles` only if we overwrite specific files.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702571487



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead

Review comment:
       I am not sure we would do anything differently in `ReplacePartitions` if anybody commits new delete files.
   Any particular thoughts you had in mind, @rdblue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r711802574



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently

Review comment:
       Nit: it _may_ undelete rows that were removed concurrently. We don't really know because the overwrite might not.
   
   (Actually, we could check this. If there are no added data files then it was a pure delete)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r716264821



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -316,6 +329,57 @@ protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long startin
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param dataFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case-sensitive
+   */
+  protected void validateNoNewDeletes(TableMetadata base, Long startingSnapshotId,

Review comment:
       I think a slightly more accurate name would be `validateNoNewDeleteFiles` since this checks that there aren't any new delete files, but data files could have been deleted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r711803863



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently
+    if (deletedDataFiles.size() > 0) {
+      validateNoNewDeletesForDataFiles(
+          base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, caseSensitive);

Review comment:
       By the way, this is a way to add serializable isolation for replacing partitions as well.
   
   @szehon-ho ^^




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702571487



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead

Review comment:
       I am not sure we would do anything differently in `ReplacePartitions` if anyone commits delete files concurrently. Any particular thoughts you had in mind, @rdblue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705788222



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)
+        .caseSensitive(caseSensitive)
+        .specsById(ops.current().specsById());
+
+    // the starting snapshot could have been expired concurrently
+    if (startingSnapshotId != null && base.snapshot(startingSnapshotId) != null) {
+      Snapshot startingSnapshot = base.snapshot(startingSnapshotId);
+      long startingSequenceNumber = startingSnapshot.sequenceNumber();
+      deleteIndexBuilder.afterSequenceNumber(startingSequenceNumber);
+    }
+
+    DeleteFileIndex deletes = deleteIndexBuilder.build();
+
+    ValidationException.check(deletes.isEmpty(),
+        "Found new conflicting delete files that can apply to records matching %s",

Review comment:
       Yeah, I agree here. It would be great to get the list.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705789335



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)

Review comment:
       Looks like that's https://github.com/apache/iceberg/pull/3071. Thanks for fixing it!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707597364



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702246516



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead

Review comment:
       Let me check during the weekend




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705788573



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,

Review comment:
       Could you update this to `validateNoAddedDeleteFiles` or something similar? It isn't clear from the method name that this is checking that there aren't any. We may want to rename `validateAddedDataFiles` as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r716264198



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +140,20 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+    if (appendConflictDetectionFilter != null && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+    }
+
+    boolean validateNewDeletes = deleteConflictDetectionFilter != null && base.currentSnapshot() != null;

Review comment:
       I think that the behavior here should be slightly different. There are two concerns: 1) whether to check delete files for snapshot isolation and 2) what conflict detection filter to use. Basing `validateNewDeletes` on whether the conflict detection filter was set doesn't seem correct to me.
   
   I don't think there is a case where we don't want to validate delete files if we have called `validateFromSnapshot` to set the base snapshot. I think that we should add this as a boolean field that is set when `validateFromSnapshot` is called.
   
   Then, if we are validating delete files, we should have two separate checks. First, if there are any files in `deletedDataFiles`, then we perform the validation below. If the conflict detection filter wasn't set, then we should use `Expressions.alwaysTrue` to find candidate delete files. Second, if an overwrite filter was set, then we should run `validateNoNewDeletes` with either the delete filter or the delete conflict detection filter. The conflict detection filter should be an optimization, not a way to turn off delete validations.
   
   I think that makes the API more understandable and consistent.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702921054



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       I don't think we will be able to use `DataFile` for `RowDelta` that easily as we are using a regular scan and delete writers just give us referenced data file locations. Then I'd say we probably still need the new method here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712770472



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -95,11 +101,18 @@ public OverwriteFiles caseSensitive(boolean isCaseSensitive) {
   @Override
   public OverwriteFiles validateNoConflictingAppends(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");

Review comment:
       Nit: Should this be `Append conflict detection filter cannot be null` now that we have both appendConflictDetectionFilter and delteConflictDectionFilter?

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -95,11 +101,18 @@ public OverwriteFiles caseSensitive(boolean isCaseSensitive) {
   @Override
   public OverwriteFiles validateNoConflictingAppends(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
-    this.conflictDetectionFilter = newConflictDetectionFilter;
+    this.appendConflictDetectionFilter = newConflictDetectionFilter;
     failMissingDeletePaths();

Review comment:
       Question: Does this call to `failMissingDeletePaths()` still belong here or should it be moved to the `validateNoConflictingDeleteFiles` call?

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -95,11 +101,18 @@ public OverwriteFiles caseSensitive(boolean isCaseSensitive) {
   @Override
   public OverwriteFiles validateNoConflictingAppends(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
-    this.conflictDetectionFilter = newConflictDetectionFilter;
+    this.appendConflictDetectionFilter = newConflictDetectionFilter;
     failMissingDeletePaths();
     return this;
   }
 
+  @Override
+  public OverwriteFiles validateNoConflictingDeleteFiles(Expression newConflictDetectionFilter) {
+    Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");

Review comment:
       Nit: Same comment about saying "Delete conflict detection filter cannot be null` instead of leaving it unqualified and ambiguous.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706866411



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       In fact, if we create a `DataFile`, we can copy the column stats from the `DeleteFile` so that we can do stats overlap comparisons.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702593570



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       We may even keep the old name. We must always validate there are no new deletes for files we read. We can push down the conflict resolution filter if it present but knowing the list of read files should be sufficient.
   
   Thoughts, @rdblue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705769354



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead

Review comment:
       @szehon-ho is adding support for detecting conflicts in `ReplacePartitions`. The original behavior was basically snapshot isolation like Hive. But you could argue that it would be valuable to support serializable by ensuring the replaced partitions haven't changed since some starting snapshot. This is the PR: https://github.com/apache/iceberg/pull/2925




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712630098



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently
+    if (deletedDataFiles.size() > 0) {
+      validateNoNewDeletesForDataFiles(
+          base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, caseSensitive);

Review comment:
       I have updated the PR to match what I described above.

##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -101,7 +101,7 @@ public RewriteFiles validateFromSnapshot(long snapshotId) {
   protected void validate(TableMetadata base) {
     if (replacedDataFiles.size() > 0) {
       // if there are replaced data files, there cannot be any new row-level deletes for those data files
-      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, null, replacedDataFiles, true);

Review comment:
       Added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706542882



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       We only know locations of referenced files. We need actual `DataFile` objects to leverage the delete file index. My question is why we went with locations only in `RowDelta` and whether we should switch to `DataFile` objects instead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702592756



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       I think you mean `validateNoNewDeletesForDataFiles`. We can make that work. Just need to extend it to also accept an optional filter to ignore delete files in other partitions (performance optimization only). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706540247



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Consider we have a data file A and have a copy-on-write operation that overwrites it with a data file B. If a concurrent operation adds a delete file that references records from the data file A, committing the original overwrite would undelete the rows that were deleted concurrently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r715120545



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -375,6 +376,14 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
     private void commitWithSnapshotIsolation(OverwriteFiles overwriteFiles,
                                              int numOverwrittenFiles,
                                              int numAddedFiles) {
+      Long scanSnapshotId = scan.snapshotId();
+      if (scanSnapshotId != null) {
+        overwriteFiles.validateFromSnapshot(scanSnapshotId);
+      }
+
+      Expression conflictDetectionFilter = conflictDetectionFilter();
+      overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);

Review comment:
       Got it, yea I double-checked the [Postgres doc](https://www.postgresql.org/docs/9.5/transaction-iso.html) for REPEATABLE_READ (snapshot) and it makes sense, deletes and updates to same rows must not be overridden in this mode.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712666524



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -375,6 +376,14 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
     private void commitWithSnapshotIsolation(OverwriteFiles overwriteFiles,
                                              int numOverwrittenFiles,
                                              int numAddedFiles) {
+      Long scanSnapshotId = scan.snapshotId();
+      if (scanSnapshotId != null) {
+        overwriteFiles.validateFromSnapshot(scanSnapshotId);
+      }
+
+      Expression conflictDetectionFilter = conflictDetectionFilter();
+      overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);

Review comment:
       Why are we check for conflicting delete files in snapshot isolation?  (I thought we must not check)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707596856



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
-   * @param readSnapshotId the snapshot id that was used to read the data or null if the table was empty
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
-   * @deprecated this will be removed in 0.11.0;
-   *             use {@link #validateNoConflictingAppends(Expression)} and {@link #validateFromSnapshot(long)} instead
    */
-  @Deprecated
-  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);

Review comment:
       No longer applies.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707607388



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -94,9 +102,12 @@ protected void validate(TableMetadata base) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
-      if (conflictDetectionFilter != null) {
-        validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      if (appendConflictDetectionFilter != null) {
+        validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+      }
+
+      if (deleteConflictDetectionFilter != null) {
+        validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);

Review comment:
       This check is quite trivial. For example, we won't be able to resolve conflicts within the same partition. I have outlined a way to optimize it [here](https://github.com/apache/iceberg/pull/3069#discussion_r707604903).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r715325953



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -375,6 +376,14 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
     private void commitWithSnapshotIsolation(OverwriteFiles overwriteFiles,
                                              int numOverwrittenFiles,
                                              int numAddedFiles) {
+      Long scanSnapshotId = scan.snapshotId();
+      if (scanSnapshotId != null) {
+        overwriteFiles.validateFromSnapshot(scanSnapshotId);
+      }
+
+      Expression conflictDetectionFilter = conflictDetectionFilter();

Review comment:
       nit: can combine L384 and L385, `conflictDetectionFilter` only used once

##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -94,9 +102,12 @@ protected void validate(TableMetadata base) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
-      if (conflictDetectionFilter != null) {
-        validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      if (appendConflictDetectionFilter != null) {
+        validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+      }
+
+      if (deleteConflictDetectionFilter != null) {
+        validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);

Review comment:
       A bit late to the whole discussion. Regarding the check, I read the outlined way to optimize it, just want to share some thoughts based on what I am doing for position deletes of my internal distribution as of today. 
   
   In my system, each position delete file written contains exactly 1 `file_path` value, which avoids the requirement from the spec to sort by file path and also greatly simplifies the validation during concurrent commits, because each check can easily find all position deletes of each data file and check against just the position min max to see if there is any potential overlapping of the position range. Of course this cannot be applied to a general use case, it was implemented just to see what can be achieved with a closed system where all delete writers only write that specific type of position delete file.
   
   When I started to compact position delete files to contain multiple `file_path` values, it becomes very easy to have false-negatives, especially in the object storage mode where the `file_path` min and max does not really mean anything anymore. So at least from the object storage use case, secondary index with much better file skipping ability is a must have to make the strategy described truly work efficiently.

##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -316,6 +329,57 @@ protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long startin
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param dataFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case-sensitive
+   */
+  protected void validateNoNewDeletes(TableMetadata base, Long startingSnapshotId,
+                                      Expression dataFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    long startingSequenceNumber = startingSequenceNumber(base, startingSnapshotId);
+    DeleteFileIndex deletes = buildDeleteFileIndex(deleteManifests, startingSequenceNumber, dataFilter, caseSensitive);
+
+    ValidationException.check(deletes.isEmpty(),
+        "Found new conflicting delete files that can apply to records matching %s: %s",
+        dataFilter, Iterables.transform(deletes.referencedDeleteFiles(), ContentFile::path));
+  }
+
+  // use 0 as a starting seq number if the starting snapshot is not set or expired
+  private long startingSequenceNumber(TableMetadata metadata, Long staringSnapshotId) {
+    if (staringSnapshotId != null && metadata.snapshot(staringSnapshotId) != null) {
+      Snapshot startingSnapshot = metadata.snapshot(staringSnapshotId);
+      return startingSnapshot.sequenceNumber();
+    } else {
+      return 0;

Review comment:
       nit: can use `TableMetadata.INITIAL_SEQUENCE_NUMBER` and remove the comment




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r710392877



##########
File path: core/src/main/java/org/apache/iceberg/DeleteFileIndex.java
##########
@@ -86,6 +86,20 @@ public boolean isEmpty() {
     return (globalDeletes == null || globalDeletes.length == 0) && sortedDeletesByPartition.isEmpty();
   }
 
+  public List<DeleteFile> referencedDeleteFiles() {
+    List<DeleteFile> deleteFiles = Lists.newArrayList();

Review comment:
       Will do!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707598205



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)
+        .caseSensitive(caseSensitive)
+        .specsById(ops.current().specsById());
+
+    // the starting snapshot could have been expired concurrently
+    if (startingSnapshotId != null && base.snapshot(startingSnapshotId) != null) {
+      Snapshot startingSnapshot = base.snapshot(startingSnapshotId);
+      long startingSequenceNumber = startingSnapshot.sequenceNumber();
+      deleteIndexBuilder.afterSequenceNumber(startingSequenceNumber);
+    }
+
+    DeleteFileIndex deletes = deleteIndexBuilder.build();
+
+    ValidationException.check(deletes.isEmpty(),
+        "Found new conflicting delete files that can apply to records matching %s",

Review comment:
       Added a list. Could you check, @szehon-ho?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702923194



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       In the example above, txn1 will fail as FILE_A is not there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706865999



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       The reason why we only have file locations in `RowDelta` is that we get the set of referenced files from writing position deletes, which are `(location, position)`. We don't require carrying `DataFile` through delta operations.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r716265375



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -365,6 +365,7 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
 
       Expression conflictDetectionFilter = conflictDetectionFilter();
       overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);
+      overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);

Review comment:
       The check looks good to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r710383644



##########
File path: api/src/main/java/org/apache/iceberg/RowDelta.java
##########
@@ -111,4 +111,19 @@
    * @return this for method chaining
    */
   RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method must be called when the table is queried to produce a row delta for UPDATE and
+   * MERGE operations independently of the isolation level. Calling this method isn't required
+   * for DELETE operations as it is OK when a particular record we are trying to delete
+   * was deleted concurrently.
+   * <p>
+   * Validation applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  RowDelta validateNoConflictingDeleteFiles(Expression conflictDetectionFilter);

Review comment:
       I did that in the first place but then I started to worry it may be confusing. For example, we refer here to concurrently added delete files vs concurrently happened delete operations that removed data files.
   
   I do prefer consistency too but I am not sure whether it is confusing. What do you think, @szehon-ho?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706540247



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Consider we have a data file A and have a copy-on-write operation that overwrites it with a data file B. If a concurrent operation adds a delete file that references records from the data file A, committing the original copy-on-write operation (i.e. overwrite) would undelete the rows that were deleted concurrently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712629711



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       I think we have discussed it on another thread. Resolving.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712631014



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -365,6 +365,7 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
 
       Expression conflictDetectionFilter = conflictDetectionFilter();
       overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);
+      overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);

Review comment:
       @rdblue, could you double check this place?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712619700



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently

Review comment:
       I think we should automatically convert overwrites without added files to deletes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705788790



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)

Review comment:
       This is a good optimization. I think we should see if we can add it to the file-based check as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707605528



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,

Review comment:
       I went for `validateNoNewDeletes` to match `validateNoNewDeletesForDataFiles`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702477592



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)
+        .caseSensitive(caseSensitive)
+        .specsById(ops.current().specsById());
+
+    // the starting snapshot could have been expired concurrently
+    if (startingSnapshotId != null && base.snapshot(startingSnapshotId) != null) {
+      Snapshot startingSnapshot = base.snapshot(startingSnapshotId);
+      long startingSequenceNumber = startingSnapshot.sequenceNumber();
+      deleteIndexBuilder.afterSequenceNumber(startingSequenceNumber);
+    }
+
+    DeleteFileIndex deletes = deleteIndexBuilder.build();
+
+    ValidationException.check(deletes.isEmpty(),
+        "Found new conflicting delete files that can apply to records matching %s",

Review comment:
       For debugging some complex concurrency use case, we find the list of data files in exception message very helpful to identify where it's coming from.  Is a similar thing possible with delete files (or at least print a sample of them)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702922722



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       This use case is already covered as we always guarantee all files we overwrite in `OverwriteFiles` still exist. It is literally new records (added, not rewritten) we trying to catch here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702100320



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -64,7 +64,7 @@
   private static final Set<String> VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
   // delete files can be added in "overwrite" or "delete" operations
-  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =
+  private static final Set<String> VALIDATE_ADDED_DELETE_FILES_OPERATIONS =

Review comment:
       I renamed it to match `VALIDATE_ADDED_FILES_OPERATIONS `.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707597962



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       This point is important. Let me know if I got you correctly, @rdblue. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702100080



##########
File path: api/src/main/java/org/apache/iceberg/RowDelta.java
##########
@@ -109,6 +109,45 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter) {

Review comment:
       Alternatives are welcome.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702201022



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {

Review comment:
       Agreed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702593942



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Basically, keep `validateNoConflictingAppends` as is and always check for new delete files that may match the overwritten data files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702571487



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead

Review comment:
       I am not sure we would do anything differently in `ReplacePartitions` if anyone commits delete files concurrently.
   Any particular thoughts you had in mind, @rdblue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702772776



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
-   * @param readSnapshotId the snapshot id that was used to read the data or null if the table was empty
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
-   * @deprecated this will be removed in 0.11.0;
-   *             use {@link #validateNoConflictingAppends(Expression)} and {@link #validateFromSnapshot(long)} instead
    */
-  @Deprecated
-  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+  OverwriteFiles validateNoConflictingOperations(Expression conflictDetectionFilter);

Review comment:
       Looks like this method is designed to validate whether the copy-on-write batch update/delete operations will be conflicted with the RowDelta operations ?  I will need some time to read the whole copy-on-write code path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r709735067



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -316,6 +316,58 @@ protected void validateNoNewDeletesForDataFiles(TableMetadata base, Long startin
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param dataFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case-sensitive
+   */
+
+  protected void validateNoNewDeletes(TableMetadata base, Long startingSnapshotId,
+                                      Expression dataFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    long startingSequenceNumber = startingSequenceNumber(base, startingSnapshotId);
+    DeleteFileIndex deletes = buildDeleteFileIndex(deleteManifests, startingSequenceNumber, dataFilter, caseSensitive);
+
+    ValidationException.check(deletes.isEmpty(),

Review comment:
       Thanks for adding this!

##########
File path: api/src/main/java/org/apache/iceberg/RowDelta.java
##########
@@ -111,4 +111,19 @@
    * @return this for method chaining
    */
   RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method must be called when the table is queried to produce a row delta for UPDATE and
+   * MERGE operations independently of the isolation level. Calling this method isn't required
+   * for DELETE operations as it is OK when a particular record we are trying to delete
+   * was deleted concurrently.
+   * <p>
+   * Validation applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  RowDelta validateNoConflictingDeleteFiles(Expression conflictDetectionFilter);

Review comment:
       Nit: should we make it a bit consistent with above?  (ie, omit 'files' from the name)

##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -94,9 +102,12 @@ protected void validate(TableMetadata base) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
-      if (conflictDetectionFilter != null) {
-        validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      if (appendConflictDetectionFilter != null) {
+        validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+      }
+
+      if (deleteConflictDetectionFilter != null) {
+        validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);

Review comment:
       Do you mean it cannot resolve within same data file (I thought we are passing data filter)?  Or within the same partition?
   
   And also for my learning, you mean it will be over-aggressive and report false negatives even if rows do not actually conflict, until we make the optimization.

##########
File path: core/src/main/java/org/apache/iceberg/DeleteFileIndex.java
##########
@@ -86,6 +86,20 @@ public boolean isEmpty() {
     return (globalDeletes == null || globalDeletes.length == 0) && sortedDeletesByPartition.isEmpty();
   }
 
+  public List<DeleteFile> referencedDeleteFiles() {
+    List<DeleteFile> deleteFiles = Lists.newArrayList();

Review comment:
       Optional comment: small optimization can be done by knowing the initial length, and checking isEmpty
   ```
   if (isEmpty()) {
      return Lists.empty()
   else
      List<DeleteFile> deleteFiles = Lists.newArrayList(globalDeletes.length + sortedDeletesByPartition.length);
      ...
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#issuecomment-923508735


   Ready for another round.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705784207



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Yeah, I think it would be good to keep the two separate. That would allow fine-grained control of validation and avoid the need to rename the validation config method -- just add a second one. Would we need to validate there are no conflicting deletes in all cases? Seems like we wouldn't necessarily need to if using snapshot isolation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r715198829



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)
+        .caseSensitive(caseSensitive)
+        .specsById(ops.current().specsById());
+
+    // the starting snapshot could have been expired concurrently
+    if (startingSnapshotId != null && base.snapshot(startingSnapshotId) != null) {
+      Snapshot startingSnapshot = base.snapshot(startingSnapshotId);
+      long startingSequenceNumber = startingSnapshot.sequenceNumber();
+      deleteIndexBuilder.afterSequenceNumber(startingSequenceNumber);
+    }
+
+    DeleteFileIndex deletes = deleteIndexBuilder.build();
+
+    ValidationException.check(deletes.isEmpty(),
+        "Found new conflicting delete files that can apply to records matching %s",

Review comment:
       Resolving as this has been addressed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi closed pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi closed pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702099883



##########
File path: api/src/main/java/org/apache/iceberg/RowDelta.java
##########
@@ -109,6 +109,45 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
+
+  /**
+   * Enables validation that no concurrent operation conflicts with this commit.
+   * <p>
+   * This method should be called when the table is queried to determine which records to delete/append.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, this operation will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for update/delete operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  RowDelta validateNoConflictingOperations(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that no new delete files have been added by a concurrent operations with
+   * deletes that may match the filter in {@link #validateNoConflictingOperations(Expression)}.
+   * <p>
+   * This method should be called when the table is queried to produce a row delta during
+   * UPDATE and MERGE operations. If a concurrent operation commits a new delete file,
+   * records that were read are no longer valid.
+   * <p>
+   * Calling this method isn't required during DELETE operations as it is ok when a particular
+   * record we are trying to delete was deleted concurrently.
+   *
+   * @return this for method chaining
+   */
+  RowDelta validateNoConflictingDeleteFiles();

Review comment:
       The behavior in `RowDelta` is slightly different compared to `OverwriteFiles`. If we are doing `DELETE`, it should be fine if there is another delete file added concurrently. It is just `UPDATE` and `MERGE` we have to validate.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702246323



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Could you point to the method you refer?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702201057



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead

Review comment:
       Do we also need to update `ReplacePartitions`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702826793



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Q:  Should the OverwriteFiles also conflict with the REPLACE operation when checking the added data files in `validateAddedDataFiles` ?  Take the following example: 
   
   1. The table has 2 data files:  FILE_A,  FILE_B; 
   2. Start the CopyOnWrite to delete all rows in FILE_A.  Let's say txn1,  it's just started but not committed; 
   3. Someone start the REPLACE txn to rewrite the FILE_A + FILE_B to FILE_C.  Let's say txn2 , started but not committed; 
   4.  Committed the txn2; 
   5.  Committed the txn1
   
   Finally,  we will see all the rows from FILE_A again because the REPLACE operation has added them to table again.
   
   I raise this question because I see the [validateAddedDataFiles](https://github.com/apache/iceberg/pull/3069/files#diff-f2d4265e007483f23a69703b88348743aecf99cc98395cd1730de3900a609eb6R261) only check the `APPEND` & `OVERWRITE` operations ( it's VALIDATE_ADDED_FILES_OPERATIONS ).  So `REPLACE` operations should be also considered ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r710380785



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -94,9 +102,12 @@ protected void validate(TableMetadata base) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
-      if (conflictDetectionFilter != null) {
-        validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      if (appendConflictDetectionFilter != null) {
+        validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+      }
+
+      if (deleteConflictDetectionFilter != null) {
+        validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);

Review comment:
       > you mean it will be over-aggressive and report false negatives even if rows do not actually conflict, until we make the optimization.
   
   Yeah, it may report false positives. The data filter is helpful but I think it won't help much within the same partition. Position deletes are scoped to a partition so the data filter should help us when there is a concurrent delete in another partition. Within the partition, though, most of position deletes will match that row filter as we don't persist the deleted row (by default).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707596601



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {

Review comment:
       I changed this part as there are cases when we want to validate delete files but not data files. For example, a merge-on-read UPDATE with snapshot isolation. Now we have two methods and no renaming.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r716266320



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +140,20 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+    if (appendConflictDetectionFilter != null && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+    }
+
+    boolean validateNewDeletes = deleteConflictDetectionFilter != null && base.currentSnapshot() != null;

Review comment:
       Here's what I changed this to locally while thinking through it:
   
   ```java
       // validateDeletes is set to true in validateFromSnapshot. Maybe we should default it if that method isn't called?
       if (validateDeletes) {
         if (deletedDataFiles.size() > 0) {
           validateNoNewDeletesForDataFiles(
               base, startingSnapshotId, deleteConflictDetectionFilter,
               deletedDataFiles, caseSensitive);
         }
   
         if (rowFilter() != Expressions.alwaysFalse()) {
           if (deleteConflictDetectionFilter != null) {
             validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);
           } else {
             validateNoNewDeletes(base, startingSnapshotId, rowFilter(), caseSensitive);
           }
         }
       }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712666524



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -375,6 +376,14 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
     private void commitWithSnapshotIsolation(OverwriteFiles overwriteFiles,
                                              int numOverwrittenFiles,
                                              int numAddedFiles) {
+      Long scanSnapshotId = scan.snapshotId();
+      if (scanSnapshotId != null) {
+        overwriteFiles.validateFromSnapshot(scanSnapshotId);
+      }
+
+      Expression conflictDetectionFilter = conflictDetectionFilter();
+      overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);

Review comment:
       Maybe mistaken, but why are we check for conflicting delete files in snapshot isolation?  (I thought we must not check)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702595109



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       I won't be able to use `validateNoConflictingAppends` here as `RowDelta` works with paths only.
   What about using `DataFile` in `validateDataFilesExist`, @rdblue?
   Any particular reasons for using paths only?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#issuecomment-912741943


   cc @openinx @rdblue @RussellSpitzer @jackye1995 @yyanyy @flyrain @kbendick @karuppayya @raptond


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702246323



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Could you point to the method you refer to?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#issuecomment-932473088


   Closing this as it was split into multiple PRs and merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702578508



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)
+        .caseSensitive(caseSensitive)
+        .specsById(ops.current().specsById());
+
+    // the starting snapshot could have been expired concurrently
+    if (startingSnapshotId != null && base.snapshot(startingSnapshotId) != null) {
+      Snapshot startingSnapshot = base.snapshot(startingSnapshotId);
+      long startingSequenceNumber = startingSnapshot.sequenceNumber();
+      deleteIndexBuilder.afterSequenceNumber(startingSequenceNumber);
+    }
+
+    DeleteFileIndex deletes = deleteIndexBuilder.build();
+
+    ValidationException.check(deletes.isEmpty(),
+        "Found new conflicting delete files that can apply to records matching %s",

Review comment:
       It is not something possible right away but I can try to expose a list of delete files from the delete file index.
   I agree that will be useful. I'll update.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702098451



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {

Review comment:
       I am not a fan of renaming but I think the old name would be really confusing now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702595109



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       I won't be able to use `validateNoConflictingAppends` here as `RowDelta` works with positions only.
   What about using `DataFile` in `validateDataFilesExist`, @rdblue?
   Any particular reasons for using paths only?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706540300



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead

Review comment:
       I forgot about that PR. I'll take a look on Monday.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r705787905



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       If I understand correctly, the motivation for updating `RowDelta` is the case where we have two concurrent delta commits? So an UPDATE and a MERGE at the same time might both rewrite a row, which could cause a duplicate:
   
   ```sql
   INSERT INTO t VALUES (1, 'a'), (2, 'b'), (3, 'c');
   
   -- running these concurrently causes a problem
   UPDATE t SET data = 'x' WHERE id = 1;
   UPDATE t SET data = 'y' WHERE id = 1;
   ```
   
   If I ran the updates concurrently, both would delete id=1 and both would add a new file with `(1, 'x')` and `(1, 'y')` right?
   
   The validation here is that the file created by the initial insert doesn't have any new delete files written against it. It seems like we want to just call `validateNoNewDeletesForDataFiles` and pass `referencedFiles` in, right? Maybe I'm missing something?
   
   We might want to make this a separate issue to keep changes smaller and reviews easier.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702144413



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
-   * @param readSnapshotId the snapshot id that was used to read the data or null if the table was empty
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
-   * @deprecated this will be removed in 0.11.0;
-   *             use {@link #validateNoConflictingAppends(Expression)} and {@link #validateFromSnapshot(long)} instead
    */
-  @Deprecated
-  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);

Review comment:
       We may avoid removing this in this commit if we plan to cherry-pick this change into 0.12.1.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702594443



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -129,6 +119,7 @@ protected void validate(TableMetadata base) {
 
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       Validating concurrent appends is optional (serializable vs snapshot) while validating concurrent deletes is required to avoid corrupting the table.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707693964



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -283,6 +283,44 @@ protected void validateAddedDataFiles(TableMetadata base, Long startingSnapshotI
     }
   }
 
+  /**
+   * Validates that no delete files matching a filter have been added to the table since a starting snapshot.
+   *
+   * @param base table metadata to validate
+   * @param startingSnapshotId id of the snapshot current at the start of the operation
+   * @param conflictDetectionFilter an expression used to find new conflicting delete files
+   * @param caseSensitive whether expression evaluation should be case sensitive
+   */
+  protected void validateAddedDeleteFiles(TableMetadata base, Long startingSnapshotId,
+                                          Expression conflictDetectionFilter, boolean caseSensitive) {
+    // if there is no current table state, no files have been added
+    if (base.currentSnapshot() == null) {
+      return;
+    }
+
+    Pair<List<ManifestFile>, Set<Long>> history =
+        validationHistory(base, startingSnapshotId, VALIDATE_ADDED_DELETE_FILES_OPERATIONS, ManifestContent.DELETES);
+    List<ManifestFile> deleteManifests = history.first();
+
+    DeleteFileIndex.Builder deleteIndexBuilder = DeleteFileIndex.builderFor(ops.io(), deleteManifests)
+        .filterData(conflictDetectionFilter)

Review comment:
       Done and merged. Resolving this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r711803754



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently
+    if (deletedDataFiles.size() > 0) {
+      validateNoNewDeletesForDataFiles(
+          base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, caseSensitive);

Review comment:
       I'm not sure that I agree that we only need to check the files that were directly deleted.
   
   For example, you can use Spark to replace an expression using `DataFrameWriterV2`:
   
   ```scala
   df.writeTo(t).overwrite($"date" === "2021-10-01")
   ```
   
   Whether the deleted files should be checked for row-level deletes depends on whether the written `df` is a result of reading the table. If you're performing your own merge, then it should be. I realize that there's not currently a way to enable the validation, but we could support write properties for this eventually:
   
   ```scala
   df.writeTo(t)
       .option("validate-snapshot-id", 1234L) // use serializable isolation
       .overwrite($"date" === "2021-10-01")
   ```
   
   Is this something that we don't need to support right now because no one is calling it? I think we should still consider adding the validation for later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712522599



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently
+    if (deletedDataFiles.size() > 0) {
+      validateNoNewDeletesForDataFiles(
+          base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, caseSensitive);

Review comment:
       Okay, if it is something we want to support, it means there are 3 cases:
   
   - Case 1: copy-on-write MERGE -> we must validate deletes.
   - Case 2: overwrite by filter with serializable isolation -> we must validate deletes.
   - Case 3: overwrite by filter with snapshot isolation -> we must NOT validate deletes.
   
   Since delete validation becomes optional, it probably means we need a separate method to trigger it.
   Thoughts, @rdblue @szehon-ho?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r716265608



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -375,6 +376,14 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
     private void commitWithSnapshotIsolation(OverwriteFiles overwriteFiles,
                                              int numOverwrittenFiles,
                                              int numAddedFiles) {
+      Long scanSnapshotId = scan.snapshotId();
+      if (scanSnapshotId != null) {
+        overwriteFiles.validateFromSnapshot(scanSnapshotId);
+      }
+
+      Expression conflictDetectionFilter = conflictDetectionFilter();
+      overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);

Review comment:
       I agree. Deletes should always be validated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712524811



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently
+    if (deletedDataFiles.size() > 0) {
+      validateNoNewDeletesForDataFiles(
+          base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, caseSensitive);

Review comment:
       Basically, `OverwriteFiles` will now match `RowDelta` that has a separate method to trigger the validation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r706865999



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       The reason why we only have file locations in `RowDelta` is that we get the set of referenced files from writing position deletes, which are `(location, position)`. We don't require carrying `DataFile` through delta operations.
   
   We could probably wrap the location in a dummy `DataFile`, or add a way to query the `DeleteFileIndex` by location.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702772776



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
-   * @param readSnapshotId the snapshot id that was used to read the data or null if the table was empty
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
-   * @deprecated this will be removed in 0.11.0;
-   *             use {@link #validateNoConflictingAppends(Expression)} and {@link #validateFromSnapshot(long)} instead
    */
-  @Deprecated
-  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+  OverwriteFiles validateNoConflictingOperations(Expression conflictDetectionFilter);

Review comment:
       Looks like this method is designed to validate whether the copy-on-write batch overwrite operations will be conflicted with the RowDelta operations ?  I will need some time to read the whole copy-on-write code path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702197120



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.

Review comment:
       Nit: This is missing a "that" before "happened". But then it sounds odd to me, so I'd recommend "that happened after" instead of "happened since".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702595109



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       I won't be able to use `validateNoConflictingAppends` here as `RowDelta` works with positions only.
   What about using `DataFile` in `validateDataFilesExist`, @rdblue? Any particular reasons for using paths only?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r702792049



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -122,27 +122,30 @@
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated this will be removed in 0.14.0;
+   *             use {@link #validateNoConflictingOperations(Expression)} instead
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    return validateNoConflictingOperations(conflictDetectionFilter);
+  }
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that no concurrent operation conflicts with this commit.
    * <p>
    * This method should be called when the table is queried to determine which files to delete/append.
-   * If a concurrent operation commits a new file after the data was read and that file might
-   * contain rows matching the specified conflict detection filter, the overwrite operation
-   * will detect this during retries and fail.
+   * If a concurrent operation commits a new data or delete file after the table was queried and
+   * that file might contain either new records or relevant deletes matching the specified conflict
+   * detection filter, the overwrite operation will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
    * serializable isolation for eager update/delete operations. Otherwise, the isolation level
    * will be snapshot isolation.
+   * <p>
+   * Validation applies to operations happened since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
-   * @param readSnapshotId the snapshot id that was used to read the data or null if the table was empty
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
-   * @deprecated this will be removed in 0.11.0;
-   *             use {@link #validateNoConflictingAppends(Expression)} and {@link #validateFromSnapshot(long)} instead
    */
-  @Deprecated
-  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+  OverwriteFiles validateNoConflictingOperations(Expression conflictDetectionFilter);

Review comment:
       Okay,  any operations adding the data/delete files that match the scan files from copy-on-write should be conflicted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r714413876



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -95,11 +101,18 @@ public OverwriteFiles caseSensitive(boolean isCaseSensitive) {
   @Override
   public OverwriteFiles validateNoConflictingAppends(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");

Review comment:
       I'll update if it fits on the same line.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r714413810



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -375,6 +376,14 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
     private void commitWithSnapshotIsolation(OverwriteFiles overwriteFiles,
                                              int numOverwrittenFiles,
                                              int numAddedFiles) {
+      Long scanSnapshotId = scan.snapshotId();
+      if (scanSnapshotId != null) {
+        overwriteFiles.validateFromSnapshot(scanSnapshotId);
+      }
+
+      Expression conflictDetectionFilter = conflictDetectionFilter();
+      overwriteFiles.validateNoConflictingDeleteFiles(conflictDetectionFilter);

Review comment:
       Consider we have a data file A and there is a copy-on-write operation that overwrites it with a data file B. If a concurrent operation adds a delete file that references records from the data file A, committing the original copy-on-write operation (i.e. overwrite) would undelete the rows that were deleted concurrently.
   
   It seems we always have to validate the delete files whenever we overwrite specific files during DELETE/MERGE.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r715340822



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -94,9 +102,12 @@ protected void validate(TableMetadata base) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
-      if (conflictDetectionFilter != null) {
-        validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+      if (appendConflictDetectionFilter != null) {
+        validateAddedDataFiles(base, startingSnapshotId, appendConflictDetectionFilter, caseSensitive);
+      }
+
+      if (deleteConflictDetectionFilter != null) {
+        validateNoNewDeletes(base, startingSnapshotId, deleteConflictDetectionFilter, caseSensitive);

Review comment:
       A bit late to the whole discussion. Regarding the check, I read the outlined way to optimize it, just want to share some thoughts based on what I am doing for position deletes of my internal distribution as of today. 
   
   In my system, each position delete file written contains exactly 1 `file_path` value, which avoids the requirement from the spec to sort by file path and also greatly simplifies the validation during concurrent commits, because each check can easily find all position deletes of each data file and check against just the position min max to see if there is any potential overlapping of the position range. Of course this cannot be applied to a general use case, it was implemented just to see what can be achieved with a closed system where all delete writers only write that specific type of position delete file.
   
   When I started to compact position delete files to contain multiple `file_path` values, it becomes very easy to have false-positives, especially in the object storage mode where the `file_path` min and max does not really mean anything anymore. So at least from the object storage use case, secondary index with much better file skipping ability is a must have to make the strategy described truly work efficiently.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712626652



##########
File path: core/src/main/java/org/apache/iceberg/DeleteFileIndex.java
##########
@@ -86,6 +86,20 @@ public boolean isEmpty() {
     return (globalDeletes == null || globalDeletes.length == 0) && sortedDeletesByPartition.isEmpty();
   }
 
+  public List<DeleteFile> referencedDeleteFiles() {
+    List<DeleteFile> deleteFiles = Lists.newArrayList();

Review comment:
       I was about to implement this but then I realized `sortedDeletesByPartition.length` is not accurate as each entry contains an array of delete files. In order to compute a right estimate, I'd need to iterate through the map.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r707604903



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -81,23 +82,32 @@ public RowDelta validateDataFilesExist(Iterable<? extends CharSequence> referenc
   }
 
   @Override
-  public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilter) {
+  public RowDelta validateNoConflictingOperations(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
     this.conflictDetectionFilter = newConflictDetectionFilter;
     return this;
   }
 
+  @Override
+  public RowDelta validateNoConflictingDeleteFiles() {
+    this.validateNoConflictingDeleteFiles = true;
+    return this;
+  }
+
   @Override
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
         validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
       }
 
-      // TODO: does this need to check new delete files?
       if (conflictDetectionFilter != null) {
         validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
       }
+
+      if (conflictDetectionFilter != null && validateNoConflictingDeleteFiles) {
+        validateAddedDeleteFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);

Review comment:
       I had a chance to think more about the required validation. While using `DataFile`s instead of locations would give us min/max filtering, we can probably do better than that if all delete files are position-based.
   
   We can do something like this if we are working with position-based delete files:
   - Read all position deletes we are trying to commit and build a map with file location -> a list of deleted positions.
   - Iterate through concurrently added delete files applying min/max filtering (secondary indexes in the future) and find which may have conflicts.
   - Read files that potentially conflict and verify that (file, pos) pairs don't overlap.
   
   We may need to add some limit on the overall size of deletes we can scan but we can figure that out.
   Overall, this will allow us to resolve conflicts within the same partition.
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r714414002



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -95,11 +101,18 @@ public OverwriteFiles caseSensitive(boolean isCaseSensitive) {
   @Override
   public OverwriteFiles validateNoConflictingAppends(Expression newConflictDetectionFilter) {
     Preconditions.checkArgument(newConflictDetectionFilter != null, "Conflict detection filter cannot be null");
-    this.conflictDetectionFilter = newConflictDetectionFilter;
+    this.appendConflictDetectionFilter = newConflictDetectionFilter;
     failMissingDeletePaths();

Review comment:
       I'll double check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r716261496



##########
File path: api/src/main/java/org/apache/iceberg/RowDelta.java
##########
@@ -111,4 +111,19 @@
    * @return this for method chaining
    */
   RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method must be called when the table is queried to produce a row delta for UPDATE and
+   * MERGE operations independently of the isolation level. Calling this method isn't required
+   * for DELETE operations as it is OK when a particular record we are trying to delete

Review comment:
       Nit: use of "we" in javadoc is unnecessary. It is simpler to say "it is OK to delete a record that is also deleted concurrently".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r710501248



##########
File path: api/src/main/java/org/apache/iceberg/RowDelta.java
##########
@@ -111,4 +111,19 @@
    * @return this for method chaining
    */
   RowDelta validateNoConflictingAppends(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method must be called when the table is queried to produce a row delta for UPDATE and
+   * MERGE operations independently of the isolation level. Calling this method isn't required
+   * for DELETE operations as it is OK when a particular record we are trying to delete
+   * was deleted concurrently.
+   * <p>
+   * Validation applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  RowDelta validateNoConflictingDeleteFiles(Expression conflictDetectionFilter);

Review comment:
       Yes fine with me then, thanks for clarifying




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r712629926



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
       }
     }
 
+    // validating concurrently added data files is optional and is only needed to achieve serializable isolation
     if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
       validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
     }
+
+    // validating concurrently added delete files is required whenever we overwrite specific data files
+    // if we find a new delete file matching one of the data files we are trying to overwrite,
+    // we must fail this operation as it would undelete rows that were removed concurrently

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3069: Core: Validate conflicting delete files in RowDelta and OverwriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r711803984



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -101,7 +101,7 @@ public RewriteFiles validateFromSnapshot(long snapshotId) {
   protected void validate(TableMetadata base) {
     if (replacedDataFiles.size() > 0) {
       // if there are replaced data files, there cannot be any new row-level deletes for those data files
-      validateNoNewDeletesForDataFiles(base, startingSnapshotId, replacedDataFiles);
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, null, replacedDataFiles, true);

Review comment:
       It would be nice to have an overload with the original args, since it isn't obvious what `true` is doing here. It's an internal API, but an overload still makes this more readable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org