You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/21 04:25:11 UTC

[GitHub] [iceberg] jackye1995 edited a comment on pull request #2591: Spark: RewriteDatafilesAction V2

jackye1995 edited a comment on pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#issuecomment-845641397


   > If we now want to check if we can remove Delete File A we only have to read files C and D so we actually
   made progress.
   
   I think this is the place I am a bit confused. A' and B' don't need delete file A for sure because sequence number of A' and B' is higher. But we don't read C and D to add delete file A to C and D's FileScanTask. It's done by reading the statistics of delete file A and determined by the partition filter. As long as there are files of lower sequence number in that partition, the delete file will be included to that file scan task if data filter also pass.
   
   This means that if we can have a counter for each delete file and expose a method `cleanUnreferencedDeleteFiles()` called after `planFileGroups()`, we can naturally get all the files compacted just by running bin packing continuously.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org