You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Danny Chen (Jira)" <ji...@apache.org> on 2022/05/10 02:26:00 UTC
[jira] [Resolved] (HUDI-4044) When reading data from flink-hudi to external storage, the result is incorrect
[ https://issues.apache.org/jira/browse/HUDI-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Danny Chen resolved HUDI-4044.
------------------------------
> When reading data from flink-hudi to external storage, the result is incorrect
> ------------------------------------------------------------------------------
>
> Key: HUDI-4044
> URL: https://issues.apache.org/jira/browse/HUDI-4044
> Project: Apache Hudi
> Issue Type: Bug
> Components: flink
> Affects Versions: 0.11.0
> Reporter: yanxiang
> Priority: Major
> Labels: pull-request-available
>
> When reading data from flink-hudi to external storage, the result is incorrect because of concurrency issues:
>
> Here's the case:
>
> There is a split_monitor task that listens for changes on the TimeLine every N seconds; There are four split_reader tasks for processing changing data and sinking data to external storage:
>
> (1) First,split_monitor listens to Instance1 changes , and the corresponding fileId is log1. Split_monitor distributes the fileId information to split_reader task 1 in Rebanlance mode for processing.
>
> (2) then,split_monitor listens for Instance2 change . The corresponding fileId is log1 (assuming that the changed data have the same primary key ). The split_monitor task distributes fileId information to split_reader task 2 in Rebanlance mode for processing.
>
> (3) Split_reader task 1 and split_reader task 2 process the same primary key data, and their processing speeds are inconsistent. As a result, the sequence of data sink to external storage is inconsistent. The data modified earlier overwrites the data modified later, resulting in incorrect data.
>
>
> Solution:
> After the split_monitor task monitors the data changes, it distributes them to the split_reader task through the FileId Hash mode to ensure that the same FileId files are processed in the same split_reader task, thus solving this problem .
--
This message was sent by Atlassian Jira
(v8.20.7#820007)