You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/28 16:18:56 UTC

[GitHub] [iceberg] amogh-jahagirdar opened a new issue, #5653: Performing file deletion in ExpireSnapshots procedure: In-memory Reference Set

amogh-jahagirdar opened a new issue, #5653:
URL: https://github.com/apache/iceberg/issues/5653

   ### Feature Request / Improvement
   
   After the change in https://github.com/apache/iceberg/pull/4578 for updating the expire snapshots procedure to respect retention policies for branching and tagging, one significant limitation is that incremental file deletion as part of the procedure cannot be performed. This is because branching itself does not have visibility on what files can be removed; a reference set of "reachable" files has to be built from the metadata tree. 
   
   In previous community syncs this issue has come up, and wanted to discuss the approach for this:
   
   1.) Update the remove snapshots API implementation to build an in-memory reference set of reachable files across the retained branch snapshots and tags. This does pose a problem for large tables where the list of files would be too large to retain in memory on a single node, which brings us to point 2
   
   2.) For users with really large tables, as discussed in a previous community sync, it can be reasonably assumed that they have Spark infrastructure for running an effective distributed procedure. Currently the Spark Procedure performs the metadata removal for removing snapshots https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java#L185, and the spark action itself takes the responsibility of doing an anti-join of the reachable files before and after the expiration, and the subsequent deletion.
   
   The Spark procedure could also be updated for a better distributed procedure in the context of branching and tagging. We could refer (conceptually) to what Nessie is doing https://github.com/projectnessie/nessie/blob/main/gc/gc-base/src/main/java/org/projectnessie/gc/base/GCImpl.java#L58 for its Garbage collection implementation.
   
   If there is consensus in the community on this plan, I'll start the implementation
   
   CC: @rdblue @jackye1995 @namrathamyske @aokolnychyi @RussellSpitzer 
   
   ### Query engine
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar closed issue #5653: Performing file deletion in ExpireSnapshots with Branching and Tagging

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar closed issue #5653: Performing file deletion in ExpireSnapshots with Branching and Tagging
URL: https://github.com/apache/iceberg/issues/5653


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on issue #5653: Performing file deletion in ExpireSnapshots with Branching and Tagging

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on issue #5653:
URL: https://github.com/apache/iceberg/issues/5653#issuecomment-1296299673

   We can resolve this now that the reachability analysis PR is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on issue #5653: Performing file deletion in ExpireSnapshots with Branching and Tagging

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on issue #5653:
URL: https://github.com/apache/iceberg/issues/5653#issuecomment-1245380432

   Last week I raised this [PR](https://github.com/apache/iceberg/pull/5669) to address this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org