You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/18 16:28:41 UTC

[GitHub] [iceberg] rdblue commented on pull request #4736: WIP: Improve performance of expire snapshot by not double-scanning non-expired manifests

rdblue commented on PR #4736:
URL: https://github.com/apache/iceberg/pull/4736#issuecomment-1130232555

   I think this is a really good idea, and I also like Anton's idea to filter manifests that are still live from the removed snapshots as well. Basically, we should be filtering the tree at every level (snapshot/manifest-list, manifest, files) before moving on to the next one.
   
   1. Open the old metadata file and find snapshots that are no longer in the current metadata
   3. Create a DF of the `manifests` table from all the expired snapshots
   4. Create a DF of the `manifests` table from all the current snapshots
   5. Remove any manifests from the expired set that are currently in the table (all files are still referenced)
   6. Remove any manifests from the current set that have no EXISTING files (cannot contain old files). Optional because files removed and re-added would not be caught.
   7. Transform the expired manifests DF to expired data files by reading each manifest
   8. Transform the current manifests DF to current data files by reading each manifest
   9. Remove any manifests from the expired data file set that are in the current data files set
   
   Does that sound right? We may not want to do #6, but there is an opportunity to cut down on the number of manifests we read there by looking at the manifest file metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org