You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "aokolnychyi (via GitHub)" <gi...@apache.org> on 2023/04/28 23:03:18 UTC

[GitHub] [iceberg] aokolnychyi commented on pull request #5760: Core: Add minimum data sequence number to ManifestEntry

aokolnychyi commented on PR #5760:
URL: https://github.com/apache/iceberg/pull/5760#issuecomment-1528178324

   Let me go through an example to make sure I understand the purpose of this PR.
   
   ```
   seq 0: Add DataFile1
   seq 1: Add DataFile2
   seq 2: Add DataFile3
   seq 3: Add DataFile4
   seq 4: Add PositionDeletes3 (references only DataFile2) (min referenced sequence number is seq 2).
   ```
   
   Without this PR, most likely DataFile1, DataFile2, DataFile3, DataFile4 will get assigned PositionDeletes3 even though our position deletes apply only to DataFile3. After this PR, only DataFile3, DataFile4 will be assigned with PositionDeletes3.
   
   Did I get it correctly? It only applies to position deletes?
   
   My primary worry is that this would require a spec change and quite a bit of code to populate the new value. For instance, we currently only track file names when writing position deletes. After this, we would have to project and keep track of the sequence number per each referenced data file. Even after all of that, we can still get false positives.
   
   I am currently working on an alternative planning for position deletes in Spark, where I want to open files in a distributed manner and squash them into a bitmap per data file. This would give us a reliable way to check if delete files apply and would also avoid the need to open the same delete file multiple times for different data files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org