You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/20 22:11:13 UTC

[GitHub] [iceberg] jackye1995 commented on pull request #2372: Spark: add position delete row reader

jackye1995 commented on pull request #2372:
URL: https://github.com/apache/iceberg/pull/2372#issuecomment-845514501

Finally get some time to catch up with all the delete works. In general I agree the delete marker sounds like the right way to go forward. Regarding the 4 situations that Junjie described for his use cases, which are:

1. Convert all equality deletes to position deletes.
2. Cluster all position deletes to one.
3. Convert all equality deletes and position deletes to one position deletes.
4. Remove all deletes.

However, these are based on the assumption that:

1. we should always move files from equality deletes to position deletes to data files
2. we should have as few delete files as possible

Which are not 100% true in all situations. For example against 1, if we have tables that are well partitioned and sorted, and deletes are issued based on those partition and sort columns, then equality delete actually can consume way less memory and also perform better. For example against 2, having 1 single delete file means it has to be included in every single FileScanTask that might be executed by different workers and cannot share any cache, whereas if we have splitted those delete files, much fewer rows in delete files have to be read in each task. This also removed bottleneck of reading a single file with high parallelism which causes throttling in cloud storages.

For major compaction, I think there is no doubt, it's the removal of all delete files, and the RewriteDataFiles work that Russell is doing should cover major compaction use case.

But I feel everyone has a somewhat similar but different definition for minor compaction. I totally agree with Junjie that we should allow fine grained control for people to run a flexible set of actions based on the use case, and here is the definition in my mind:

Major compaction: an action that takes all files in a snapshot and produces only data files
Minor compaction: an action that takes all files in a snapshot and produces only delete files that are applied on top of the existing data files

It seems to me that we should add an action similar to `RewriteDataFiles` and make another action framework, and we can implement different strategies for that action to fulfill different use cases described. What do you think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org