You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/09 09:12:28 UTC

[GitHub] [iceberg] luoyuxia commented on issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

luoyuxia commented on issue #6153:
URL: https://github.com/apache/iceberg/issues/6153#issuecomment-1308443668

   May the reason is that the key in source is `id`, but the primary key in sink `dt,id`. The primary keys for source and sink aren't equal.
   Is possible for the following case?
   ```
   +I['bob', 1, 20221109]
   -U['bob', 1,  20221109]
   +U['bob', 1,  20221110]
   ```
   And then in the upsert iceberg, `+I['bob', 1, 20221109]`, +U['bob', 1,  20221110]  will  be considered as different records since they has different keys.
   
   
   You can try to update you sink to 
   ```
   CREATE TABLE  IF NOT EXISTS iceberg.test_db.test_cdc(
   `username` STRING, 
     `id` int, 
      start_time String,
     `dt` string,
      PRIMARY KEY(`id`)  NOT ENFORCED
   )partitioned by (dt) 
   WITH (
      'write.format.default'='parquet',
      'format-version' = '2' ,
      'write.upsert.enabled'='true',
      'location' = 's3a://xxxxxx/xxx/xxx'
   );
   ```
   and see whether the duplicated records are still in here or not.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org