You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/30 07:16:49 UTC

[GitHub] [iceberg] openinx commented on pull request #2372: Spark: add position delete row reader

openinx commented on pull request #2372:
URL: https://github.com/apache/iceberg/pull/2372#issuecomment-809976337

In my original mind, there are two kinds of compaction:

a. convert all equality deletes into position deletes. As whether should we eliminate the duplicate position deletes at the same time, the difference for me is: if the duplicate pos-deletes is removed during rewrite, the user's reading efficiency will be higher; if not, the reading efficiency will be worse. Generally speaking, I think it is a trade-off problem in performance optimization. Both of them seems to be acceptable to me.

b. Eliminate all deletes (include pos-deletes and equality-deletes). It is very suitable for the situation where delete has a high proportion in the whole table. On the one hand, we can save a lot of unnecessary storage, and on the other hand, we can avoid a lot of inefficient joins when reading data. [This](https://github.com/apache/iceberg/pull/2303/files#diff-605d0d98a73f67629cddbceb9a566e8655844a3cdf46b4dbcebd0e19102e82b4R128) is more simpler to implement compared to the case.a.

After reading @rdblue 's [comment](https://github.com/apache/iceberg/pull/2372#issuecomment-809823407) , what makes me feel the most valuable is: we can use the abstraction of meta-column to achieve code unification of case.a, case.b, and the normal read path. Saying if we have an `iterable=Iterable<Row>` with `_is_deleted` flag inside each row:

For case.a, we could just use `Iterables.transform(Iterables.filter(iterable, row -> row.isDeleted()), row -> (row.file(), row.pos()))` to generate all the pos-deletes.

For case.b, we could just use `Iterables.filter(iterable, row -> !row.isDeleted())` to get all remaining rows.

For the normal read path, it's same to the case.b.

This implementation greatly reduces the complexity of various paths, I think we can try this kind of code implementation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org