You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "konwu (Jira)" <ji...@apache.org> on 2022/06/17 09:23:00 UTC

[jira] [Closed] (HUDI-3286) duplicate records when flink task restart with index.bootstrap=true

     [ https://issues.apache.org/jira/browse/HUDI-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

konwu closed HUDI-3286.
-----------------------
    Resolution: Fixed

> duplicate records when flink task restart with index.bootstrap=true
> -------------------------------------------------------------------
>
>                 Key: HUDI-3286
>                 URL: https://issues.apache.org/jira/browse/HUDI-3286
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: flink
>            Reporter: konwu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>
>     In our company we use cow table type and use flink always with enable index.bootstrap=true.
> I found some duplicate records when flink task restart  . Some abnormal log
>  
> ./hadoop-014-018.th.bigdata.ly_22259:2022-01-10 11:30:19,016 INFO  org.apache.hudi.sink.partitioner.BucketAssigner              [] - For partitionPath :  Small Files => [SmallFile \{location=HoodieRecordLocation {instantTime=20220110110939, fileId=2d1b050f-5610-4c0a-b15c-3c2d5a9affe3}, sizeBytes=41992073}, SmallFile \{location=HoodieRecordLocation {instantTime=20220110110939, fileId=3c349304-e012-4915-b59d-a3bfca18c218}, sizeBytes=3658074}]
> ./hadoop-052-096.th.bigdata.ly_28867:2022-01-10 11:30:15,955 INFO  org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish sending index records, taskId = 5.
> ./hadoop-052-096.th.bigdata.ly_28867:2022-01-10 11:30:19,794 INFO  org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish sending index records, taskId = 3.
> ./hadoop-014-044.th.bigdata.ly_42121:2022-01-10 11:30:31,459 INFO  org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish sending index records, taskId = 4.
> ./hadoop-014-044.th.bigdata.ly_42121:2022-01-10 11:30:38,706 INFO  org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish sending index records, taskId = 0.
> ./hadoop-014-018.th.bigdata.ly_22259:2022-01-10 11:30:41,592 INFO  org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish sending index records, taskId = 2.
> ./hadoop-014-018.th.bigdata.ly_22259:2022-01-10 11:30:47,130 INFO  org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish sending index records, taskId = 1.
>  
> BucketAssigner is processing data before all index bootstrap done
>  
> It is because current restart use last GlobalAggregate ,It could be add some suffix to avoid this



--
This message was sent by Atlassian Jira
(v8.20.7#820007)