You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/09 05:13:06 UTC

[GitHub] [iceberg] zhangpengbigdata opened a new issue, #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

zhangpengbigdata opened a new issue, #6153:
URL: https://github.com/apache/iceberg/issues/6153

   ### Query engine
   
   Iceberg 1.0.0 Flink1.13 
   
   ### Question
   
   Hi all,
   I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table.  Could you please help me ?
   
   Iceberg 1.0.0, Flink1.13
   SQL like this:
   ```
   CREATE CATALOG iceberg WITH (
     'type'='iceberg',
     'catalog-type'='hive',
     'uri'='thrift://xxxx:9083',
     'clients'='5',
     'property-version'='1',
     'warehouse'='s3a://xxxx/xxxx/'
   );
   
     CREATE TABLE test_cdc ( 
     `username` varchar(100), 
     `id` int, 
      start_time String,
     PRIMARY KEY (`id`)  NOT ENFORCED
     ) WITH (
       'connector' = 'mysql-cdc',
       'hostname' = 'xxxxxx',
       'port' = '3306',
       'username' = 'xxx',
       'password' = 'xxx',
       'database-name' = 'xxx',
       'table-name' = 'xxx'
     );
   CREATE TABLE  IF NOT EXISTS iceberg.test_db.test_cdc(
   `username` STRING, 
     `id` int, 
      start_time String,
     `dt` string,
      PRIMARY KEY(`dt`, `id`)  NOT ENFORCED
   )partitioned by (dt) 
   WITH (
      'write.format.default'='parquet',
      'format-version' = '2' ,
      'write.upsert.enabled'='true',
      'location' = 's3a://xxxxxx/xxx/xxx'
   );
   INSERT INTO iceberg.test_db.test_cdc
   SELECT 
     `username`,
     `id`, 
   start_time,
     FROM_UNIXTIME(cast(start_time as bigint)  / 1000, 'yyyyMMdd') as dt 
   FROM test_cdc ;
   ```
   Executed flink sql, the result is:
   select count(distinct id), count(id),dt from iceberg.test_db.test_cdc group by dt order by dt
   <img width="579" alt="clipboard" src="https://user-images.githubusercontent.com/116717900/200744200-61199ef0-81c7-4e56-bca9-3e0b1e5aeebe.png">
   Rerun the flink sql,  the result is:
   select count(distinct id), count(id),dt from iceberg.test_db.test_cdc group by dt order by dt
   <img width="590" alt="clipboard2" src="https://user-images.githubusercontent.com/116717900/200744267-7973fcd1-73e9-4796-85fa-2d80092e3214.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] luoyuxia commented on issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

Posted by GitBox <gi...@apache.org>.

luoyuxia commented on issue #6153:
URL: https://github.com/apache/iceberg/issues/6153#issuecomment-1308443668

   May the reason is that the key in source is `id`, but the primary key in sink `dt,id`. The primary keys for source and sink aren't equal.
   Is possible for the following case?
   ```
   +I['bob', 1, 20221109]
   -U['bob', 1,  20221109]
   +U['bob', 1,  20221110]
   ```
   And then in the upsert iceberg, `+I['bob', 1, 20221109]`, +U['bob', 1,  20221110]  will  be considered as different records since they has different keys.
   
   
   You can try to update you sink to 
   ```
   CREATE TABLE  IF NOT EXISTS iceberg.test_db.test_cdc(
   `username` STRING, 
     `id` int, 
      start_time String,
     `dt` string,
      PRIMARY KEY(`id`)  NOT ENFORCED
   )partitioned by (dt) 
   WITH (
      'write.format.default'='parquet',
      'format-version' = '2' ,
      'write.upsert.enabled'='true',
      'location' = 's3a://xxxxxx/xxx/xxx'
   );
   ```
   and see whether the duplicated records are still in here or not.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] closed issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table 
URL: https://github.com/apache/iceberg/issues/6153


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6153:
URL: https://github.com/apache/iceberg/issues/6153#issuecomment-1558241166

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6153:
URL: https://github.com/apache/iceberg/issues/6153#issuecomment-1539217622

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] zhangpengbigdata commented on issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

Posted by GitBox <gi...@apache.org>.

zhangpengbigdata commented on issue #6153:
URL: https://github.com/apache/iceberg/issues/6153#issuecomment-1308479071

   Think you @luoyuxia for your replay.
   I just confirmed that the problem is caused by Trino which i used for query, because the result is correct when i use Spark to query.
   So this should be a Trino issue? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] luoyuxia commented on issue #6153: I found duplicate records when i was repeatedly exporting records from CDC Stream into iceberg partitioned table

Posted by GitBox <gi...@apache.org>.

luoyuxia commented on issue #6153:
URL: https://github.com/apache/iceberg/issues/6153#issuecomment-1308504155

   I guess it's a problem of trino-iceberg-connector, you can seek for help from the trino community if the connector is maintained by it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org