You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/29 16:37:34 UTC

[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #5666: Bug Fix for Expire Snapshots: Fix ancestor lookup during file cleanup

amogh-jahagirdar opened a new pull request, #5666:
URL: https://github.com/apache/iceberg/pull/5666

   Currently, clean up of files can only occur in ExpireSnapshots if there's 1 reference (it can be either main or a single non-main branch). However, the ancestor lookup that's done is done based on the main table state. 
   
   When updating expire snapshot tests which test branch deletions on a branch in this PR https://github.com/apache/iceberg/pull/5618/files, I encountered test failures due to data files being deleted which should not be for the non-main branch case. The snapshots which were getting expired were the expected snapshots, but the data files being deleted for the branch commit were unexpected because some of the manifests being reverted were unexpected because the check [here](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L503) would unexpectedly pass because the isFromAncestor would evaluate to false (and the rest of the checks were as expected), so the procedure would add manifests which should not be reverted to the reverted set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5666: Bug Fix for Expire Snapshots: Fix ancestor lookup during file cleanup

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #5666:
URL: https://github.com/apache/iceberg/pull/5666#discussion_r957643199


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -366,11 +367,19 @@ private void removeExpiredFiles(
     // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
     // as much of the delete work as possible and avoid orphaned data or manifest files.
 
-    // this is the set of ancestors of the current table state. when removing snapshots, this must
-    // only remove files that were deleted in an ancestor of the current table state to avoid
+    // ToDo: This will be removed when reachability analysis is done so files across multiple
+    // branches can be removed
+    SnapshotRef branchToCleanup = Iterables.getFirst(base.refs().values(), null);

Review Comment:
   My thinking is the following:
   
   1.) Logically, a tagged snapshot would either need to exist on either a.) non-main branch b.) main-branch
   2.) If the tag exists on main a file cleanup couldn't (currently) be done in the first place (because main cannot age off so we'd have multiple refs), so this point wouldn't have been reached
   3.) If the tag exists on a non-main branch and the non-main branch ages off before the tagged snapshot which gets retained, then the tag ends up being de-facto "tip" of a lineage. In which case, the expiration logic would work as expected. If non-main branch still is retained, then we wouldn't reach this point (same case as 2, just that the other ref is the non-main branch). 
   
   Combining this with the fact that writes cannot be performed on tags leads me to believe that for purpose of expiration , specifically determining which files to delete, there's no need to differentiate tags and branches. 
   
   I could call this refToCleanup if that makes more sense to folks? But the only case where this is a tag is the case what I mentioned in 3.) in which case it's just a "dangling" snapshot which is referenced by a tag. @namrathamyske @rdblue @jackye1995  
   
   Also let me know if there's a flaw in my logic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed pull request #5666: Core: Fix ancestor lookup during expire file cleanup

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #5666: Core: Fix ancestor lookup during expire file cleanup
URL: https://github.com/apache/iceberg/pull/5666


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5666: Bug Fix for Expire Snapshots: Fix ancestor lookup during file cleanup

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #5666:
URL: https://github.com/apache/iceberg/pull/5666#discussion_r957643199


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -366,11 +367,19 @@ private void removeExpiredFiles(
     // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
     // as much of the delete work as possible and avoid orphaned data or manifest files.
 
-    // this is the set of ancestors of the current table state. when removing snapshots, this must
-    // only remove files that were deleted in an ancestor of the current table state to avoid
+    // ToDo: This will be removed when reachability analysis is done so files across multiple
+    // branches can be removed
+    SnapshotRef branchToCleanup = Iterables.getFirst(base.refs().values(), null);

Review Comment:
   My thinking is the following:
   
   1.) Logically, a tagged snapshot would either need to exist on either a.) non-main branch b.) main-branch
   2.) If the tag exists on main a file cleanup couldn't (currently) be done in the first place (because main cannot age off so we'd have multiple refs), so this point wouldn't have been reached
   3.) If the tag exists on a non-main branch and the non-main branch ages off before the tagged snapshot which gets retained, then the tag ends up being de-facto "tip" of a lineage. In which case, the expiration logic would work as expected. If non-main branch still is retained, then we wouldn't reach this point (same case as 2, just that the other ref is the non-main branch). 
   
   Combining this with the fact that writes cannot be performed on tags leads me to believe that for purpose of expiration there's no need to differentiate tags and branches. 
   
   I could call this refToCleanup if that makes more sense to folks? But the only case where this is a tag is the case what I mentioned in 3.) in which case it's just a "dangling" snapshot which is referenced by a tag. @namrathamyske @rdblue @jackye1995  
   
   Also let me know if there's a flaw in my logic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5666: Bug Fix for Expire Snapshots: Fix ancestor lookup during file cleanup

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #5666:
URL: https://github.com/apache/iceberg/pull/5666#discussion_r957643199


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -366,11 +367,19 @@ private void removeExpiredFiles(
     // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
     // as much of the delete work as possible and avoid orphaned data or manifest files.
 
-    // this is the set of ancestors of the current table state. when removing snapshots, this must
-    // only remove files that were deleted in an ancestor of the current table state to avoid
+    // ToDo: This will be removed when reachability analysis is done so files across multiple
+    // branches can be removed
+    SnapshotRef branchToCleanup = Iterables.getFirst(base.refs().values(), null);

Review Comment:
   My thinking is the following:
   
   1.) Logically, a tagged snapshot would either need to exist on either a.) non-main branch b.) main-branch
   2.) If the tag exists on main a file cleanup couldn't be done in the first place (because main cannot age off so we'd have multiple refs), so this point wouldn't have been reached
   3.) If the tag exists on a non-main branch and the non-main branch ages off before the tagged snapshot which gets retained, then the tag ends up being de-facto "tip" of a lineage. In which case, the expiration logic would work as expected. If non-main branch still is retained, then we wouldn't reach this point (same case as 2, just that the other ref is the non-main branch). 
   
   Combining this with the fact that writes cannot be performed on tags leads me to believe that for purpose of expiration there's no need to differentiate tags and branches. 
   
   I could call this refToCleanup if that makes more sense to folks? But the only case where this is a tag is the case what I mentioned in 3.) in which case it's just a "dangling" snapshot which is referenced by a tag. @namrathamyske @rdblue @jackye1995 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5666: Core: Fix ancestor lookup during expire file cleanup

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5666:
URL: https://github.com/apache/iceberg/pull/5666#discussion_r986053816


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -366,11 +367,19 @@ private void removeExpiredFiles(
     // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
     // as much of the delete work as possible and avoid orphaned data or manifest files.
 
-    // this is the set of ancestors of the current table state. when removing snapshots, this must
-    // only remove files that were deleted in an ancestor of the current table state to avoid
+    // ToDo: This will be removed when reachability analysis is done so files across multiple

Review Comment:
   Nit: It should be `TODO`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #5666: Bug Fix for Expire Snapshots: Fix ancestor lookup during file cleanup

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #5666:
URL: https://github.com/apache/iceberg/pull/5666#discussion_r957643199


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -366,11 +367,19 @@ private void removeExpiredFiles(
     // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
     // as much of the delete work as possible and avoid orphaned data or manifest files.
 
-    // this is the set of ancestors of the current table state. when removing snapshots, this must
-    // only remove files that were deleted in an ancestor of the current table state to avoid
+    // ToDo: This will be removed when reachability analysis is done so files across multiple
+    // branches can be removed
+    SnapshotRef branchToCleanup = Iterables.getFirst(base.refs().values(), null);

Review Comment:
   My thinking is the following:
   
   1.) Logically, a tagged snapshot would either need to exist on either a.) non-main branch b.) main-branch
   2.) If the tag exists on main a file cleanup couldn't be done in the first place (because main cannot age off so we'd have multiple refs), so this point wouldn't have been reached
   3.) If the tag exists on a non-main branch and the non-main branch ages off before the tagged snapshot which gets retained, then the tag ends up being de-facto "tip" of a lineage. In which case, the expiration logic would work as expected. If non-main branch still is retained, then we wouldn't reach this point (same case as 2, just that the other ref is the non-main branch). 
   
   Combining this with the fact that writes cannot be performed on tags leads me to believe that for purpose of expiration there's no need to differentiate tags and branches. 
   
   I could call this refToCleanup if that makes more sense to folks? But the only case where this is a tag is the case what I mentioned in 3.) in which case it's just a "dangling" snapshot which is referenced by a tag. @namrathamyske @rdblue @jackye1995  
   
   Also let me know if there's a flaw in my logic



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] namrathamyske commented on a diff in pull request #5666: Bug Fix for Expire Snapshots: Fix ancestor lookup during file cleanup

Posted by GitBox <gi...@apache.org>.
namrathamyske commented on code in PR #5666:
URL: https://github.com/apache/iceberg/pull/5666#discussion_r957587491


##########
core/src/main/java/org/apache/iceberg/RemoveSnapshots.java:
##########
@@ -366,11 +367,19 @@ private void removeExpiredFiles(
     // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete
     // as much of the delete work as possible and avoid orphaned data or manifest files.
 
-    // this is the set of ancestors of the current table state. when removing snapshots, this must
-    // only remove files that were deleted in an ancestor of the current table state to avoid
+    // ToDo: This will be removed when reachability analysis is done so files across multiple
+    // branches can be removed
+    SnapshotRef branchToCleanup = Iterables.getFirst(base.refs().values(), null);

Review Comment:
   Here the expectation is that only one ref exists. Either main or branch ref. What if it's a tag ref?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org