You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/02 11:13:49 UTC

[GitHub] [iceberg] aokolnychyi commented on pull request #1947: [WIP] Spark MERGE INTO Support (copy-on-write implementation)

aokolnychyi commented on pull request #1947:
URL: https://github.com/apache/iceberg/pull/1947#issuecomment-753461057

I’ve been thinking about grouping of data on write during copy-on-write operations (merge-on-read is a different story).

Right now, we only have a sort order in the table metadata. However, we will probably add a way to represent distribution since Spark will have such a concept. I think global and local sorts don’t address all use cases. We will want to request hash distribution on write in some cases (it is cheaper than the global sort and works well if the data size per partition is small and does not have to be split into multiple tasks). This applies to inserts as well as to other operations like updates.

Since there will be a concept of distribution controlled by the user, the idea of leveraging both the distribution and sort order during row-level operations seems promising to me.

**DELETE**

Delete is an operation that does not change the order of data so we should be fine with just file and pos metadata columns.

In master, we do a global sort by file and pos that is the most expensive option. I think we can switch to hash-partitioning by file and local sort by file and pos. Yes, a global sort would co-locate files from same partitions next to each other but I don’t think it is worth the price of the range-based shuffle. I’d be in favor of faster deletes and doing a compaction later instead of doing a global sort during deletes. The global sort won’t eliminate the need for compacting and will make deletes more expensive which would increase the chances of concurrent conflicts.

In addition, I’d offer a table property specific to copy-on-write deletes to disable the shuffle step. If people want to have even faster deletes by skipping the shuffle, we should let them do that. They will have to compact more aggressively.

**UPDATE**

Update is the first operation that potentially changes the order of data. That’s why we should take the distribution and order into account. Our intention here is to group/sort rows that did not change by file and pos to preserve their original ordering and apply the distribution and order to updated records. If the user asks for hash-based distribution during inserts, most likely he/she wants to apply it during updates too.

I’d consider the following options:
- If the user asks for a global sort during inserts, do a range-based shuffle by `file`, `pos`, `if (file is null) sort_col_1 else null`, `if (file is null) sort_col_2 else null` and a local sort by the same attributes.
- If the user asks for hash partitioning and local sort during inserts, do a hash-based shuffle by `file`, `if (file is null) dist_col_1 else null`, `if (file is null) dist_col_2 else null`, etc and a local sort by `file`, `pos`, `if (file is null) sort_col_1 else null`, `if (file is null) sort_col_2 else null` where `file` and `pos` columns would be `null` for updated records.
- If the user asks for a local sort during inserts, do a local sort.
- Add a table property specific to copy-on-write updates to ignore the configured distribution.

**MERGE**

Merge is similar to update. We should consider new and updated records together.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org