You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/03 23:12:52 UTC
[GitHub] [iceberg] aokolnychyi opened a new pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
aokolnychyi opened a new pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071
This PR optimizes our check for referenced data files in `BaseRowDelta` by pushing down the conflict detection filter. Previously, we would open manifests even though they belonged to partitions out of our interest.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-956541168
Added this as it was included in 0.12.1 (made cherry-picking easier).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702195345
##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -91,7 +91,8 @@ public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilt
protected void validate(TableMetadata base) {
if (base.currentSnapshot() != null) {
if (!referencedDataFiles.isEmpty()) {
- validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
+ validateDataFilesExist(
+ base, startingSnapshotId, referencedDataFiles, !validateDeletes, conflictDetectionFilter);
Review comment:
This does make the assumption that the conflict detection filter and referenced data files are related.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-956541168
Added this as it was included in 0.12.1 (made cherry-picking easier).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-956541168
Added this as it was included in 0.12.1 (made cherry-picking easier).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702246022
##########
File path: core/src/test/java/org/apache/iceberg/TestRowDelta.java
##########
@@ -629,4 +630,74 @@ public void testFastAppendDoesNotRemoveStaleDeleteFiles() {
files(FILE_A_DELETES),
statuses(Status.ADDED));
}
+
+ @Test
+ public void testValidateDataFilesExistWithConflictDetectionFilter() {
Review comment:
Sure, I'll add one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-912858763
cc @openinx @rdblue @RussellSpitzer @jackye1995 @yyanyy @flyrain @kbendick @karuppayya @raptond
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi merged pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-918590711
Thanks for reviewing, @RussellSpitzer @rdblue!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702229657
##########
File path: core/src/test/java/org/apache/iceberg/TestRowDelta.java
##########
@@ -629,4 +630,74 @@ public void testFastAppendDoesNotRemoveStaleDeleteFiles() {
files(FILE_A_DELETES),
statuses(Status.ADDED));
}
+
+ @Test
+ public void testValidateDataFilesExistWithConflictDetectionFilter() {
Review comment:
Should we also add a test where the validation fails? It looks like this one just checks that you can do isolated operations but I think we should do a conflicting test as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r707618602
##########
File path: core/src/test/java/org/apache/iceberg/TestRowDelta.java
##########
@@ -629,4 +630,74 @@ public void testFastAppendDoesNotRemoveStaleDeleteFiles() {
files(FILE_A_DELETES),
statuses(Status.ADDED));
}
+
+ @Test
+ public void testValidateDataFilesExistWithConflictDetectionFilter() {
Review comment:
Added a negative test too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#issuecomment-918589922
I'll merge this one to unblock subsequent PRs. I added the missing test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] aokolnychyi commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r702197418
##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -91,7 +91,8 @@ public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilt
protected void validate(TableMetadata base) {
if (base.currentSnapshot() != null) {
if (!referencedDataFiles.isEmpty()) {
- validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
+ validateDataFilesExist(
+ base, startingSnapshotId, referencedDataFiles, !validateDeletes, conflictDetectionFilter);
Review comment:
I think that is a correct assumption as the conflict detection filter is our scan condition and referenced data files are data files were read.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a change in pull request #3071: Core: Optimize check for referenced data files in BaseRowDelta
Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3071:
URL: https://github.com/apache/iceberg/pull/3071#discussion_r705788980
##########
File path: core/src/main/java/org/apache/iceberg/BaseRowDelta.java
##########
@@ -91,7 +91,8 @@ public RowDelta validateNoConflictingAppends(Expression newConflictDetectionFilt
protected void validate(TableMetadata base) {
if (base.currentSnapshot() != null) {
if (!referencedDataFiles.isEmpty()) {
- validateDataFilesExist(base, startingSnapshotId, referencedDataFiles, !validateDeletes);
+ validateDataFilesExist(
+ base, startingSnapshotId, referencedDataFiles, !validateDeletes, conflictDetectionFilter);
Review comment:
I think that's a valid assumption.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org