You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/03 12:48:27 UTC
[GitHub] [iceberg] openinx commented on issue #360: Spec: Add column equality delete files

openinx commented on issue #360:
URL: https://github.com/apache/iceberg/issues/360#issuecomment-653532308


   @rdblue, our team also had another discussion about the user cases and solutions. For the user cases, we have three cases: 
   
   * Case.1: the table has a primary key(the unique ID as you said), and all CDC events are INSERT/UPDATE/DELETE operations(the UPDATE event has both old values and new values). Such as the classic MySQL binlog events (assuming that t have (id, data) columns and id is the primary key): 
   
   ```sql
   INERT INTO t SET @1=1 @2='apple';
   DELETE FROM t WHERE @1=1 @2='apple';
   UPDATE t WHERE @1=1 @2='apple' SET @1=1 @2='pear';
   ```
   
   * Case.2: the table has a primary key, and all CDC events are UPSERT/DELETE, the UPSERT don't have to provide old values. such as: 
   
   ```sql
   UPSERT(1, 'apple')
   DELETE(1, 'apple')
   DELETE(1) // DELETE don't have to provide all the column values, the primary key is enough.
   ```
   
   * Case.3: the table don't have a primary key, all CDC events are INSERT/UPDATE/DELETE(UPDATE should provide both old values and new values). Supporting UPSERT without old values is impossible.
   
   The three cases should match the CDC requirements from users, and we have the three goals to meet the requirements: 1) Fast ingestion; 2) Acceptable Batch Read performance; 3) an equivalent stream so that we could keep the eventual consistency between source table and sink table. 
   
   Now, we have three solutions: 
   
   1. The mixed equality-deletes and pos-deletes solution as you proposed. 
   2. The pure postional-deletes I described in the document.
   3. The incremental log files solution. 
   
   In summary, the solution#2 and solution#3 resovled the before-after order for downstream, but they both have few issues about batch read performance. The solution #2 will need to check whether there exist a delete row whose `<file_id, offset>` isn't less than the candinate row's in INSERT data files, it will use a data structure like `HashMap<primaryKey, TreeSet<T>>`(maintain those delete marker) to implement the `JOIN` algorithm. It don't need any `map<primary_key, pos>` when writing but solution#1 need, while the solution#1 is more simple when considering merge read.
   
   The solution#3 will need to do `JOIN` between the newly written INSERT while the solution#1 will only `JOIN` between INSERT and DELETE. 
   
   For solution#1, the problem is: how to reconstruct the stream events which could replay to downstream corrently ?  We could take an example, the CDC events: 
   
   ```
   t0, INSERT(1,2)
   t1, DELETE(2,2)
   t2, UPSERT(1,3)
   (The first field is primary key)
   ```
   
   It will generate three files: 
   
   * the INSERT file will have: 
   
   ```
   INSERT(1,2)
   INSERT(1,3)
   ```
   
   * the equality-delete file will have: 
   
   ```
   DELETE(2,2)
   ```
   
   * the postional-delete file will have: 
   
   ```
   <data-file=insert-file-id, position=0, primary_key_column=1>
   ```
   
   NOTICE: here we add the `primary_key_column=1` in positional-delete file because if it's `UPSERT` then the DELETE(1,2) should also mask the data files who have a smaller sequence number. Keeping the primary key value here could avoid to searching the original insert data file for comparing primary key when doing `JOIN` between data file and delete file.
   
   Now let's consider the replay issue again, we could regard the equality-delete file are deleting existed rows before this transaction, so it should be replayed firstly. Then we could restore the correct event order by JOIN INSERT file and positional-delete file. In theory, we could maintain the eventual consistency between source table and sink table because we gurantee: 
   
   1. Events of the same row will be consumed exactly as the order they are written in the source table.
   2. We don’t guarantee the consumption order of different rows.
   3. If an UPDATE event updates from one row to another row, then we don’t guarantee the before-after order between two rows. 
   
   So seems the solution#1 could be a candidate solution for the primary key case#1 and case#2.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org