You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/30 00:55:09 UTC

[GitHub] [iceberg] rdblue commented on pull request #2372: Spark: add position delete row reader

rdblue commented on pull request #2372:
URL: https://github.com/apache/iceberg/pull/2372#issuecomment-809823407

@chenjunjiedada, I started reviewing this again, but I think we should reconsider the direction that this is taking.

My initial review comments were based on this change in isolation, which left out position deletes. Adding position deletes is harder because you can't union the rows that are deleted by position with the rows deleted by equality because a row may have been deleted by both if a position delete is encoded, followed by an equality delete that applies to the same data file. You could update this to avoid the duplicates, but I think that would result in substantial changes and doesn't actually get us closer to what you're trying to do.

If I understand correctly, what you're trying to do is to create a Spark `DataFrame` of deleted rows. That way, you could use Spark to project `_file` and `_pos`, sort it by those fields, and then write the position delete files from the resulting `DataFrame`. That's probably why you didn't consider position-based deletes in the initial PR. Is this correct?

If so, I think that the approach should be slightly different. Updating the filter supports the original goal of rewriting equality deletes, but is strangely specific and doesn't easily support other uses. Instead, I think that the way to do this is to select _all_ rows and set a metadata column to indicate whether or not the row is deleted. That's an easy way to guarantee that the deleted rows are returned just once because every row is returned once. The filtering may set the same "_is_deleted" field on the record but that's okay. Then we can use the resulting DataFrame for more operations, like inspecting row-level deletes or producing records for streaming (both inserted and deleted).

What do you think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org