You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/28 20:29:12 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

aokolnychyi opened a new pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199


   This PR validates concurrently added delete files in `BaseOvewriteFiles`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r720373404



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,31 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles) {
+      validateAddedDataFiles(base, startingSnapshotId, dataConflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, rowFilter(), caseSensitive);

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719927387



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       I am not a big fan of negation inside conditions but it is purely a personal thing :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718527924



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       Shouldn't we be using this return value with validateNoConflictingDataFiles?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718924611



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.

Review comment:
       Am I right to understand that we use rowFilter to do the detection, if this is not set?  Should we mention it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r720346080



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       I'm fine with it either way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719927572



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);

Review comment:
       Changed the logic to behave a bit differently now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719932198



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       OK fair enough :)

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and

Review comment:
       Yea its a lot better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718527924



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       Shouldn't we be using this return value with validateNoConflictingAppends?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#issuecomment-932472142


   Thanks for reviewing, @RussellSpitzer @rdblue @szehon-ho!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718929996



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       Can we use deletedDataFiles.isEmpty for consistency with below?

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and

Review comment:
       Am I right to understand that we use rowFilter to do the detection, if conflictDetectionFilter is not set? Should we mention it?

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);

Review comment:
       Takes a bit of time to understand this if - else, especially as conflictDetectionFilter() itself returns conditionally.  Wonder if it can be simplified any way. 
   
   Tracing through the many cases, some cases seem weird, eg:
   
   If (rowFilter != false && !deletedDataFiles.isEmpty && conflictDetectionFilter == true), seems we will call validateNoNewDeleteFiles with conflictDetectionFilter, is it intended?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718939543



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.

Review comment:
       Nit: no need for "during retries" because this is checked every try, not just during retries.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718937868



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.

Review comment:
       This isn't quite accurate because the row filter will be used.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.

Review comment:
       This isn't quite accurate because the row filter will be used if it was set by `overwriteByRowFilter`.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);

Review comment:
       Can you remove this as well? Looks like it was supposed to be removed already.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDeleteFiles();

Review comment:
       I'd prefer to call this `validateNoConflictingDeletes()` because it will call `failMissingDeletePaths()` to check that data files that are being replaced weren't removed and will also validate there are no new delete files. That's checking all deletes, not just delete files.
   
   I think that's the right behavior, too. We don't want to separate this into multiple validations because I don't think there is a case where you'd want to validate just deleted data files or just delete files.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();

Review comment:
       To be consistent with renaming `validateNoConflictingDeleteFiles` to `validateNoConflictingDeletes` how about renaming this to `validateNoConflictingData`?

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.

Review comment:
       Nit: no need for "during retries" because this is checked every try, not just during retries.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.

Review comment:
       Serializable and snapshot isolation? Maybe just say "required to maintain isolation" for non-idempotent commits.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,

Review comment:
       How about "if a concurrent operation deletes data in one of the files being overwritten" to cover both an added delete file and deleted data files?

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {

Review comment:
       Isn't the `base.currentSnapshot()` check done in `validateAddedDataFiles`? If so, we can remove it here.

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       I think that this should not be an `else if`. Instead, it should be an independent `if`. A caller may delete specific files _and_ set a deletion filter and we need to check both cases.

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       When would a user want to make that distinction? I think that it comes down to knowing how the underlying check is implemented, but we don't want users making different calls because they want the underlying system to optimize itself. That seems like it will get users into correctness trouble.

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       These are independent ways to configure the overwrite. There is no guarantee that checking the delete filter will also check the specific data files that were deleted.
   
   I agree with you that when we call `validateNoNewDeletesForDataFiles` we should only pass the actual conflict detection filter and not the row filter. That row filter may not be relevant for the deleted files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718940384



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {

Review comment:
       Isn't the `base.currentSnapshot()` check done in `validateAddedDataFiles`? If so, we can remove it here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718940146



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,

Review comment:
       How about "if a concurrent operation deletes data in one of the files being overwritten" to cover both an added delete file and deleted data files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r720372745



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       Yeah, it looked weird. Updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r720351924



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -372,7 +372,9 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
       }
 
       Expression conflictDetectionFilter = conflictDetectionFilter();
-      overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);
+      overwriteFiles.conflictDetectionFilter(conflictDetectionFilter);
+      overwriteFiles.validateNoConflictingData();
+      overwriteFiles.validateNoConflictingDeletes();

Review comment:
       @aokolnychyi, shouldn't this check whether the operation is a delete? If this is invoked by `DELETE FROM` then we don't need to validate conflicting deletes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719926328



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       Not sure I got you. Could you clarify, @RussellSpitzer?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718939214



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();

Review comment:
       To be consistent with renaming `validateNoConflictingDeleteFiles` to `validateNoConflictingDeletes` how about renaming this to `validateNoConflictingData`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718943483



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);

Review comment:
       Takes a bit of time to understand this if - else, especially as conflictDetectionFilter() itself returns conditionally.  Wonder if it can be simplified any way. 
   
   Tracing through the many cases,I was wondering some cases, eg:
   
   If (rowFilter != false && !deletedDataFiles.isEmpty && conflictDetectionFilter == true), seems we will call validateNoNewDeleteFiles with conflictDetectionFilter, is it intended?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718939020



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDeleteFiles();

Review comment:
       I'd prefer to call this `validateNoConflictingDeletes()` because it will call `failMissingDeletePaths()` to check that data files that are being replaced weren't removed and will also validate there are no new delete files. That's checking all deletes, not just delete files.
   
   I think that's the right behavior, too. We don't want to separate this into multiple validations because I don't think there is a case where you'd want to validate just deleted data files or just delete files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719927135



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       Updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719926827



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.

Review comment:
       It described how it worked before. I've changed the logic and removed this sentence too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718924611



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.

Review comment:
       Am I right to understand that we use rowFilter to do the detection, if this is not set?  Should we mention it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718944053



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       I interpret it like this: call `validateNoNewDeletesForDataFiles` iff we replace only individual files as it is more precise than checking if there are any deletes matching the filter. Call `validateNewDeleteFiles` in all other cases (like there are both individual files replaced and the overwrite filter is set). When calling `validateNewDeleteFiles`, we can use `rowFilter()` as the conflict detection iff we don't replace any individual files (i.e. `deletedDataFiles.isEmpty()`).
   
   In my view, `validateNoNewDeleteFiles` is a stronger guarantee than `validateNoNewDeletesForDataFiles`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719926418



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);

Review comment:
       I'll clean up all interfaces in the api module in a separate PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718944053



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       I interpret it like this: call `validateNoNewDeletesForDataFiles` iff we replace only individual files as it more precise than checking if there are any deletes matching the filter. Call `validateNewDeleteFiles` in all other cases (like there are both individual files replaced and the overwrite filter is set). When calling `validateNewDeleteFiles`, we can use `rowFilter()` as the conflict detection iff we don't replace any individual files (i.e. `deletedDataFiles.isEmpty()`).
   
   In my view, `validateNoNewDeleteFiles` is a stronger guarantee than `validateNoNewDeletesForDataFiles`.

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       I interpret it like this: call `validateNoNewDeletesForDataFiles` iff we replace only individual files as it is more precise than checking if there are any deletes matching the filter. Call `validateNewDeleteFiles` in all other cases (like there are both individual files replaced and the overwrite filter is set). When calling `validateNewDeleteFiles`, we can use `rowFilter()` as the conflict detection iff we don't replace any individual files (i.e. `deletedDataFiles.isEmpty()`).
   
   In my view, `validateNoNewDeleteFiles` is a stronger guarantee than `validateNoNewDeletesForDataFiles`.

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       If so, there is no need to call `validateNoNewDeletesForDataFiles` if the other one is called. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719932306



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);

Review comment:
       Great, this looks a lot better and easier to understand now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718945090



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       When would a user want to make that distinction? I think that it comes down to knowing how the underlying check is implemented, but we don't want users making different calls because they want the underlying system to optimize itself. That seems like it will get users into correctness trouble.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r720372108



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
##########
@@ -372,7 +372,9 @@ private void commitWithSerializableIsolation(OverwriteFiles overwriteFiles,
       }
 
       Expression conflictDetectionFilter = conflictDetectionFilter();
-      overwriteFiles.validateNoConflictingAppends(conflictDetectionFilter);
+      overwriteFiles.conflictDetectionFilter(conflictDetectionFilter);
+      overwriteFiles.validateNoConflictingData();
+      overwriteFiles.validateNoConflictingDeletes();

Review comment:
       I think that can only be done for merge-on-read. If I delete file_A with copy-on-write and overwrite it with file_B, I should still check no deletes happened for file_A, otherwise I'll undelete records.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719926898



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDeleteFiles();

Review comment:
       Sounds good to me. Updated.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();

Review comment:
       Done.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.

Review comment:
       Fixed.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.

Review comment:
       Updated.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,

Review comment:
       Updated.

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {

Review comment:
       Yep, updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719926600



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and

Review comment:
       If it is not set, the validation is tricky. I added a few sentences. Let me know if that makes sense, @szehon-ho.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718944053



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       I interpret it like this: call `validateNoNewDeletesForDataFiles` iff we replace only individual files as it more precise than checking if there are any deletes matching the filter. Call `validateNewDeleteFiles` in all other cases (like there are both individual files replaced and the overwrite filter is set). When calling `validateNewDeleteFiles`, we can use `rowFilter()` as the conflict detection iff we don't replace any individual files (i.e. `deletedDataFiles.isEmpty()`).
   
   In my view, `validateNoNewDeleteFiles` is a stronger guarantee than `validateNoNewDeletesForDataFiles`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718940797



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       I think that this should not be an `else if`. Instead, it should be an independent `if`. A caller may delete specific files _and_ set a deletion filter and we need to check both cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718937868



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.

Review comment:
       This isn't quite accurate because the row filter will be used if it was set by `overwriteByRowFilter`.

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);

Review comment:
       Can you remove this as well? Looks like it was supposed to be removed already.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718924611



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.

Review comment:
       Am I right to understand that we use rowFilter to do the detection, if this is not set?  Should we mention it?

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.

Review comment:
       Am I right to understand that we use rowFilter to do the detection, if this is not set?  Should we mention it?

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       Can we use deletedDataFiles.isEmpty for consistency with below?

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and

Review comment:
       Am I right to understand that we use rowFilter to do the detection, if conflictDetectionFilter is not set? Should we mention it?

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);

Review comment:
       Takes a bit of time to understand this if - else, especially as conflictDetectionFilter() itself returns conditionally.  Wonder if it can be simplified any way. 
   
   Tracing through the many cases, some cases seem weird, eg:
   
   If (rowFilter != false && !deletedDataFiles.isEmpty && conflictDetectionFilter == true), seems we will call validateNoNewDeleteFiles with conflictDetectionFilter, is it intended?

##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);

Review comment:
       Takes a bit of time to understand this if - else, especially as conflictDetectionFilter() itself returns conditionally.  Wonder if it can be simplified any way. 
   
   Tracing through the many cases,I was wondering some cases, eg:
   
   If (rowFilter != false && !deletedDataFiles.isEmpty && conflictDetectionFilter == true), seems we will call validateNoNewDeleteFiles with conflictDetectionFilter, is it intended?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718527924



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       Shouldn't we be using this return value with validateNoConflictingAppends?

##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       Shouldn't we be using this return value with validateNoConflictingDataFiles?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718946725



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       These are independent ways to configure the overwrite. There is no guarantee that checking the delete filter will also check the specific data files that were deleted.
   
   I agree with you that when we call `validateNoNewDeletesForDataFiles` we should only pass the actual conflict detection filter and not the row filter. That row filter may not be relevant for the deleted files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718939818



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.
+   *
+   * @param conflictDetectionFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles conflictDetectionFilter(Expression conflictDetectionFilter);
+
+  /**
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * This method should be called while committing non-idempotent overwrite operations.
+   * If a concurrent operation commits a new file after the data was read and that file might
+   * contain rows matching the specified conflict detection filter, the overwrite operation
+   * will detect this during retries and fail.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
+   * will be snapshot isolation.
+   * <p>
+   * Validation uses the conflict detection filter passed to {@link #conflictDetectionFilter(Expression)} and
+   * applies to operations that happened after the snapshot passed to {@link #validateFromSnapshot(long)}.
+   *
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingDataFiles();
+
+  /**
+   * Enables validation that delete files added concurrently do not conflict with this commit's operation.
+   * <p>
+   * Validating concurrently added delete files is required during non-idempotent overwrite operations.
+   * If a concurrent operation adds a new delete file that applies to one of the data files being overwritten,
+   * the overwrite operation must be aborted as it may undelete rows that were removed concurrently.
+   * <p>
+   * Calling this method with a correct conflict detection filter is required to maintain
+   * serializable isolation for overwrite operations.

Review comment:
       Serializable and snapshot isolation? Maybe just say "required to maintain isolation" for non-idempotent commits.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r720345120



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       Yeah, I think we all have some of these. Personally, I don't use `++` at all. When reviewing I ignore it unless the return value is used, which I think makes code harder to understand.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718937868



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -145,4 +151,50 @@
    */
   @Deprecated
   OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression conflictDetectionFilter);
+
+  /**
+   * Sets a conflict detection filter used to validate concurrently added data and delete files.
+   * <p>
+   * If not called, a true literal will be used as the conflict detection filter.

Review comment:
       This isn't quite accurate because the row filter will be used.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r720350420



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,31 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles) {
+      validateAddedDataFiles(base, startingSnapshotId, dataConflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, rowFilter(), caseSensitive);

Review comment:
       Why not use the conflict detection filter here if it is set?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r719936806



##########
File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
##########
@@ -107,23 +107,29 @@
   OverwriteFiles caseSensitive(boolean caseSensitive);
 
   /**
-   * Enables validation that files added concurrently do not conflict with this commit's operation.
+   * Enables validation that data files added concurrently do not conflict with this commit's operation.
    * <p>
-   * This method should be called when the table is queried to determine which files to delete/append.
+   * This method should be called while committing non-idempotent overwrite operations.
    * If a concurrent operation commits a new file after the data was read and that file might
    * contain rows matching the specified conflict detection filter, the overwrite operation
    * will detect this during retries and fail.
    * <p>
    * Calling this method with a correct conflict detection filter is required to maintain
-   * serializable isolation for eager update/delete operations. Otherwise, the isolation level
+   * serializable isolation for overwrite operations. Otherwise, the isolation level
    * will be snapshot isolation.
    * <p>
    * Validation applies to files added to the table since the snapshot passed to {@link #validateFromSnapshot(long)}.
    *
    * @param conflictDetectionFilter an expression on rows in the table
    * @return this for method chaining
+   * @deprecated since 0.13.0, will be removed in 0.14.0; use {@link #conflictDetectionFilter(Expression)} and
+   *             {@link #validateNoConflictingDataFiles()} instead.
    */
-  OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter);
+  @Deprecated
+  default OverwriteFiles validateNoConflictingAppends(Expression conflictDetectionFilter) {
+    conflictDetectionFilter(conflictDetectionFilter);

Review comment:
       ```return  conflictDetectionFilter(conflictDetectionFilter).validateNoConflictingDataFiles();```
   
   I just feel weird when we don't use the return value of functions that have them. I know it's a setter and we technically don't need to chain it since we are calling inside the object, it just makes me feel weird :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3199: Core: Validate concurrently added delete files in OvewriteFiles

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3199:
URL: https://github.com/apache/iceberg/pull/3199#discussion_r718944860



##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +147,29 @@ protected void validate(TableMetadata base) {
       }
     }
 
-    if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
-      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter, caseSensitive);
+
+    if (validateNewDataFiles && base.currentSnapshot() != null) {
+      validateAddedDataFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+    }
+
+    if (validateNewDeleteFiles && base.currentSnapshot() != null) {
+      if (rowFilter() != Expressions.alwaysFalse()) {
+        validateNoNewDeleteFiles(base, startingSnapshotId, conflictDetectionFilter(), caseSensitive);
+      } else if (deletedDataFiles.size() > 0) {

Review comment:
       If so, there is no need to call `validateNoNewDeletesForDataFiles` if the other one is called. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org