You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/20 13:46:51 UTC

[GitHub] [iceberg] RussellSpitzer commented on pull request #3772: [Api][Spark]Add cleanExpiredFiles to Actions/ExpireSnapshots

RussellSpitzer commented on pull request #3772:
URL: https://github.com/apache/iceberg/pull/3772#issuecomment-997938486


   > @ajantha-bhat @rdblue Thanks for your review! Maybe I didn't make it clear, the failure of _expire_snapshots_ I met is not caused by deleting files instead of building expired files dataset, cause some files needed to build expiredFiles are deleted by _remove_orphan_files_ which happens to be started right after _expire_snapshots_' commit. So plugin dummy consumer method to avoid deleting the files has no effect on this situation. This situation is quite rare, but truely it happens. I think a simple way to avoid this is execute _remove_orphan_files_ after _expire_snapshots_ job is done, but our _remove_orphan_files_ procedure is automatically scheduled by program. So it's a bit difficult to control timing.
   
   I agree with Ryan, the thing to do here is to make it so that "expire snapshots" won't fail in this sort of situation. Basically just place in a bunch of try's to make it that if we can't determine the correct set of files to remove, we just log warnings for those files and continue as normal. This would result in not cleaning the total set of unused files since we wouldn't be able to determine the full set of unused files but since RemoveOrphans is running concurrently in this example, that really doesn't matter.
   
   That said the ExpireSnapshots method is far more efficient than RemoveOrphans and we should always be encouraging users towards using Expire to delete rather than Remove.For this particular use case I would probably just have the same program that runs expire snapshots also run remove orphans so that they never run concurrently. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org