You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/20 00:50:41 UTC

[GitHub] [iceberg] openinx edited a comment on issue #2610: Flink CDC iceberg table have duplicate rows

openinx edited a comment on issue #2610:
URL: https://github.com/apache/iceberg/issues/2610#issuecomment-844595929


   Did you have multiple parallelism in flink job to write the same keys to unpartitioned table ?   Assume there's three operations: 
   
   ```
   1.   INSERT key1 value1; 
   2.   DELETE key1 value1 ; 
   3.   INSERT key1 value2 ; 
   ```
   
   As we have 2 parallelism to write those rows ( without shuffling by primary key),  then it's possible that:  The first parallelism  accept the event2 ( DELETE key1 value1) and write to the iceberg table,  the second parallelism accept the event1 and event3 and write to the iceberg table.  Then finally the `DELETE key1 value1` won't mask the event1 & event3 because it happens before them and it would only delete all those events with the same key that happens before the delete.  In this case,  we will encounter two duplicate `INSERT key1` with different values `value1` and `value2`.   The suggest solution to fix this issue is:   shuffling by the primary key before writing those rows into apache iceberg table because we could ensure that the rows with the same keys are wrote to iceberg table with the same order as they produced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org