You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/03 23:12:52 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

aokolnychyi opened a new pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071


   This PR optimizes our check for referenced data files in `BaseRowDelta` by pushing down the conflict detection filter. Previously, we would open manifests even though they belonged to partitions out of our interest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-956541168


   Added this as it was included in 0.12.1 (made cherry-picking easier).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702195345



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -91,7 +91,8 @@ public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilt
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
-        validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
+        validateDataFilesExist(
+            base, startingSnapshotId, referencedDataFiles, !validateDeletes, conflictDetectionFilter);

Review comment:
       This does make the assumption that the conflict detection filter and referenced data files are related.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-956541168


   Added this as it was included in 0.12.1 (made cherry-picking easier).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-956541168


   Added this as it was included in 0.12.1 (made cherry-picking easier).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702246022



##########
File path: core/src/test/java/org/apache/iceberg/TestRowDelta.java
##########
@@ -629,4 +630,74 @@ public void testFastAppendDoesNotRemoveStaleDeleteFiles() {
         files(FILE_A_DELETES),
         statuses(Status.ADDED));
   }
+
+  @Test
+  public void testValidateDataFilesExistWithConflictDetectionFilter() {

Review comment:
       Sure, I'll add one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-912858763


   cc @openinx @rdblue @RussellSpitzer @jackye1995 @yyanyy @flyrain @kbendick @karuppayya @raptond


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-918590711


   Thanks for reviewing, @RussellSpitzer @rdblue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702229657



##########
File path: core/src/test/java/org/apache/iceberg/TestRowDelta.java
##########
@@ -629,4 +630,74 @@ public void testFastAppendDoesNotRemoveStaleDeleteFiles() {
         files(FILE_A_DELETES),
         statuses(Status.ADDED));
   }
+
+  @Test
+  public void testValidateDataFilesExistWithConflictDetectionFilter() {

Review comment:
       Should we also add a test where the validation fails? It looks like this one just checks that you can do isolated operations but I think we should do a conflicting test as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r707618602



##########
File path: core/src/test/java/org/apache/iceberg/TestRowDelta.java
##########
@@ -629,4 +630,74 @@ public void testFastAppendDoesNotRemoveStaleDeleteFiles() {
         files(FILE_A_DELETES),
         statuses(Status.ADDED));
   }
+
+  @Test
+  public void testValidateDataFilesExistWithConflictDetectionFilter() {

Review comment:
       Added a negative test too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-918589922


   I'll merge this one to unblock subsequent PRs. I added the missing test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702197418



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -91,7 +91,8 @@ public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilt
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
-        validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
+        validateDataFilesExist(
+            base, startingSnapshotId, referencedDataFiles, !validateDeletes, conflictDetectionFilter);

Review comment:
       I think that is a correct assumption as the conflict detection filter is our scan condition and referenced data files are data files were read.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r705788980



##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -91,7 +91,8 @@ public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilt
   protected void validate(TableMetadata base) {
     if (base.currentSnapshot() != null) {
       if (!referencedDataFiles.isEmpty()) {
-        validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
+        validateDataFilesExist(
+            base, startingSnapshotId, referencedDataFiles, !validateDeletes, conflictDetectionFilter);

Review comment:
       I think that's a valid assumption.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org