You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/18 19:17:44 UTC

[GitHub] [iceberg] szehon-ho commented on pull request #6581: Spark 3.3: Add RemoveDanglingDeletes action

szehon-ho commented on PR #6581:
URL: https://github.com/apache/iceberg/pull/6581#issuecomment-1387630786

   Chatting with @aokolnychyi , @RussellSpitzer , a guide to when this can be used.
   
   
   There will be two types of operations that can remove delete files:
   
   | Operation   | Cost | File Type | Description |
   | --- | --- | --- | --- | 
   | RemoveDanglingDeletes   |  Metadata-Only, cost will be like querying files/partition table  |  Both | Removes position deletes with sequence number less than that of the min sequence number of all data files in each partition |
   | RewritePositionDeletes   |   Data-operation, need to read/write all concerned delete files | Position only (Equality Deletes will need to be converted to PositionDeletes) | Read all position delete files satisfying given filter, write them back out , filtering out position delete entries that refer to data files that no longer exist |
   
   Use-case, RemoveDanglingDeleteFiles is cheaper, and is the only one to work across both types of files.  However, to get it to exactly work, we need the following conditions:  RewriteDataFiles being run with:
   * Filter that includes entire partition(s)
   * All data files in the partition with delete files gets rewritten, ie any of these:
     * rewrite-all=true
     * delete-file-threshold=1
     * All data files happen to meet the criteria of rewrite without these flags.
   * 'use-starting-sequence-number' needs to be false.  This is to properly identify old delete files as invalid using sequence number rule.  This is only needed for position-deletes, as equality-deletes are not applied to equivalent sequence number.
   
   Note RemoveDanglingDeleteFiles can still remove some delete files if these conditions are not met, but just it may not do so for all delete files, because an old data file (one with a low sequence number) not rewritten will prevent delete files from getting removed.
   
   So Im open to whether there is a good use-case of this.  One idea is to bundle this with RewriteDataFiles, and if trigger optimistically if these conditions are met, or trigger in any case in hopes it will remove delete files as, as its relatively cheap.
   
   Otherwise, the complete solution (all to be developed) would be:
   For position deletes, run RewritePositionDeletes across all partitions
   For equality deletes, run ConvertToPosDeletes, then RewritePositionDeletes across all partitions.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org