You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/21 03:15:42 UTC
[GitHub] [iceberg] jackye1995 commented on pull request #2591: Spark: RewriteDatafilesAction V2

jackye1995 commented on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845621625


   > We cannot and are not deleting delete files in this action because it's actually much more difficult to find out which delete files are no longer in use than just checking which ones are referred to by the FileScanTasks for the files we are looking at.
   
   Yes I am aware of this condition, but the delete files are applied for each file scan task anyway, it's just we cannot remove it because of the condition you described, and we have to call another action, do double work to fully remove the file. Conversely, say we have another action to only remove delete files, then we are reading those delete files anyway, and it also feels wasteful to me that we have to do another bin pack after deleting those files to make the files more optimized, and potentially cause more commit conflicts.
   
   I understand there is a good separation of concern if we do them as 2 different actions. But when I try to imagine what the compaction API looks like, it seems that I just need a different `selectFilesToRewrite` implementation of the rewrite strategy, and other things can mostly be reused with just a few small branching logic.
   
   So instead of having another totally different action that removes delete file, the major compaction can potentially be done as just an extension to the existing strategy or a replacement to run a different strategy. For example, we can extend the current bin pack strategy with: if there are delete files in a file scan task, then the data file must be included for rewriting. We can also plug in a strategy that try to select all data files based on a certain delete file threshold, etc.
   
   We can get more clever about that as we evolve, but the general thought I have is that having data file rewriting and delete file compaction as one base action with different strategies to satisfy different use cases seems to be a more efficient and flexible way to go.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org