You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/25 12:25:08 UTC

[GitHub] [iceberg] coolderli opened a new issue #2632: Flink CDC has duplicated data

coolderli opened a new issue #2632:
URL: https://github.com/apache/iceberg/issues/2632


   I was trying to write binlog to an iceberg table with Flink SQL.
   
   This is my iceberg table with a primary key `id`.
   spark-sql> desc extended goods_info_backend_v2;
   **id	bigint	主键**
   gid	bigint	商品ID
   pid	bigint	商品型号ID
   cid1	int	旧一级分类
   cid2	int	旧二级分类
   cid3	int	旧三级分类
   cid4	int	旧四级分类
   ...
   Table Properties	[current-snapshot-id=4044922955807923122,**equality.field.columns=id,format=iceberg/parquet,format.version=2**,read.parquet.vectorization.enabled=true,read.split.target-size=1073741824,write.distribution-mode=hash,write.spark.fanout.enabled=true]
   
   When I select `count(id)`, I got duplicated data.
   ![image](https://user-images.githubusercontent.com/38486782/119494923-6b6c4800-bd94-11eb-92c6-d4d5e129d6d9.png)
   
   So I check the snapshots and chose an `id=1349343` to found which snapshots that data was appended.
   ![image](https://user-images.githubusercontent.com/38486782/119495864-74a9e480-bd95-11eb-8553-9fbeb0026380.png)
   
   ![image](https://user-images.githubusercontent.com/38486782/119495972-91deb300-bd95-11eb-87d8-2ec7a6a6e84b.png)
   
   ![image](https://user-images.githubusercontent.com/38486782/119496056-aae76400-bd95-11eb-8800-26d19ac7d36b.png)
   
   And I found the duplicated data `id=1349343`  was appended in snapshotId= 4839740852915438766.
   In this snapshot, we got 48 added data files and 93 added-delete files.
   
   ```
   spark-sql> select * from iceberg_zjyprc_hadoop.xxx.xxx.snapshots where snapshot_id=4839740852915438766;
   
   2021-05-24 16:50:59.394 4839740852915438766 5342648351401052384 overwrite hdfs://zjyprc-hadoop/user/h_data_platform/datalake/youpin.db/goods_info_backend_v2/metadata/snap-4839740852915438766-1-1f06a123-f35f-4ed4-9cfe-65324b98549d.avro {"added-data-files":"48","added-delete-files":"93","added-equality-deletes":"20506","added-files-size":"4654283","added-position-deletes":"2060","added-records":"10253","changed-partition-count":"16","flink.job-id":"c872284a1caca024b7cb26870e5c8e51","flink.max-committed-checkpoint-id":"234","total-data-files":"80","total-delete-files":"115","total-equality-deletes":"21108","total-files-size":"27224132","total-position-deletes":"2066","total-records":"252220"}
   Time taken: 0.27 seconds, Fetched 1 row(s)
   ```
   
   So I download these data files and delete files, and grep `1349343`, and got the answer:
   Two filed were found in two data files and four fields were found in two delete files.
   The snapshot and manifest files could be found in accessories
   (base) ➜ data grep -r 1349343 ./*
   ./00885:value 217: R:0 D:0 V:1349343
   ./00931:value 190: R:0 D:0 V:1349343
   
   
   (base) ➜ delete grep -r 1349343 ./*
   ./00886:value 433: R:0 D:0 V:1349343
   ./00886:value 434: R:0 D:0 V:1349343
   ./00932:value 379: R:0 D:0 V:1349343
   ./00932:value 380: R:0 D:0 V:1349343
   
   And I think 00885 and 00931 were added in the same commit, but why there are no position-delete files. At the same time, I found 1349343 was in bucket-5, so I check the snapshot files to try to found the position files in bucket-5, but failed to found the related rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-847853988


   @caseylucas I have finished the report, and reopen it. Can you take a look at it for me? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-853489322


   If you have partitioning by 8 buckets, then every snapshot will produce 8 data files (at most... assuming there's data for each output partition / bucket).
   
   So if you took a snapshot, the snapshot would commit the data. And so therefore, your files will be less than 512 mb if the job needs to commit and you don't have 512 * 8 mb (as Flink jobs always commit data on snapshot iirc for correctness guarantees).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-866486058


   `set execution.checkpointing.tolerable-failed-checkpoints=0;` works. 
   Maybe this should be fixed later.  @openinx 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-847851791


   I have another question, I found `00000-3-5b132013-2f36-4a79-9993-3fed0077e5d2-00885.parquet` was only  62kb, and the [write.target-file-size-bytes](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java#L145) is 512MB by default, why it rolls to a new files.
   ![image](https://user-images.githubusercontent.com/38486782/119502519-978bc700-bd9c-11eb-9d42-d8b9529844fe.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-847852599


   @openinx  Could you help me take a look at this problem, please? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-847843526


   [files.zip](https://github.com/apache/iceberg/files/6539351/files.zip)
   
   There are 11 files in this zip. 
   
   **1 snapshot files:**
   snap-4839740852915438766-1-1f06a123-f35f-4ed4-9cfe-65324b98549d.avro 
   2 manifests
   1f06a123-f35f-4ed4-9cfe-65324b98549d-m0.avro   -- data
   1f06a123-f35f-4ed4-9cfe-65324b98549d-m1.avro   --delete
   
   **2  data files:**
   data/id_bucket=5/00000-3-5b132013-2f36-4a79-9993-3fed0077e5d2-00885.parquet
   data/id_bucket=5/00000-3-5b132013-2f36-4a79-9993-3fed0077e5d2-00931.parquet
   
   **2 equality delete files:**
   data/id_bucket=5/00000-3-5b132013-2f36-4a79-9993-3fed0077e5d2-00886.parquet
   data/id_bucket=5/00000-3-5b132013-2f36-4a79-9993-3fed0077e5d2-00932.parquet
   
   **2 position delete files:**
   data/id_bucket=5/00000-3-5b132013-2f36-4a79-9993-3fed0077e5d2-00874.parquet
   data/id_bucket=5/00000-3-5b132013-2f36-4a79-9993-3fed0077e5d2-00967.parquet
   
   As I mentioned, In position delete files, I can not found a record that deletes `id=1349343` and `inventory=139`.
   Is there any problem in my investigation?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-847848818


   This is my flink DAG , the parallelism is 1. 
   ![image](https://user-images.githubusercontent.com/38486782/119502068-1a605200-bd9c-11eb-9e39-315f22851461.png)
   
   And my table is a partitioned by bucket(`id`, 8), and there are no errors in snapshots 4839740852915438766.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-853489322


   If you have partitioning by 8 buckets, then every snapshot will produce 8 data files (at most... assuming there's data for each output partition / bucket).
   
   So if you took a snapshot, the snapshot would commit the data. And so therefore, your files will be less than 512 mb if the job needs to commit and you don't have 512 * 8 mb (as Flink jobs always commit data on snapshot iirc for correctness guarantees).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] caseylucas commented on issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
caseylucas commented on issue #2632:
URL: https://github.com/apache/iceberg/issues/2632#issuecomment-847834975


   @coolderli I noticed you closed this issue. Was there no problem? Do you mind sharing what you found?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli closed issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli closed issue #2632:
URL: https://github.com/apache/iceberg/issues/2632


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli closed issue #2632: Flink CDC has duplicated data

Posted by GitBox <gi...@apache.org>.
coolderli closed issue #2632:
URL: https://github.com/apache/iceberg/issues/2632


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org