You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/03 11:40:51 UTC

[GitHub] [iceberg] JingsongLi edited a comment on issue #360: Spec: Add column equality delete files

JingsongLi edited a comment on issue #360:
URL: https://github.com/apache/iceberg/issues/360#issuecomment-653501110

Thanks for your discussion and summary. Sorry for joining this discussion lately. Here is my thinking, correct me if I am wrong:
IIUC, there can be two user oriented modes for equality delete:

## Mode1
Iceberg as a database: Users can using SQLs to query this iceberg table: "insert into", "update ... where x=a”, "delete from where x=a“. For equality delete, with high performance, iceberg should just writes number of physical records at constant level for these SQLs.

What is the solution for mode1?
- For an iceberg transaction, just writing logs, including inserts and deletes, but merging efficiency looks not good.
- For an iceberg transaction, just writing inserts file and equality deletes file: Looks not work, consider insert(1, 2), delete(1, 2), insert(1, 2). If there just be inserts file and deletes file, it is hard to know we need emit an insert(1, 2) after merging. In other words, we can't distinguish the order between inserts and deletes.
- For an iceberg transaction, writing inserts file and equality deletes file and position deletes file. For a delete(1, 2), we may should write this record to both equality deletes file and position deletes file, because a delete should delete records from old files, and also delete records in this transaction in ordered. When merging, old transactions can just join equality deletes file is OK.
The equality deletes file could just have the key fields. The position deletes file should just have file_id and position.

## Mode2
Iceberg as a CDC receiver: Theoretically, every records from CDC stream should just affect single record to Database.
- CDC with primary key(unique ID), iceberg can let it affect all records to Database too. The semantics are the same.
- CDC without primary key(unique ID), it is hard to implement for this too, store all fields in memory looks expensive, and maybe need some special merging algorithm, I think it maybe too demanding.
Should support both batch reading and streaming reading for CDC input stream, and also producing CDC stream for streaming reading too.

What is the solution for mode2?
- For batch reading: The solution can be similar to mode1. As above said, with primary key(unique ID), CDC records can affect all records to Database too, so, it can be treated as a mode1 stream.
- For streaming reading: If we want to output CDC stream, a CDC stream should include delete records with all columns, so in equality deletes files, should have all the columns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org