You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "zhangminglei (Jira)" <ji...@apache.org> on 2021/05/20 15:08:00 UTC
[jira] [Created] (HUDI-1918) Incorrect keyby field would cause
serious data skew
zhangminglei created HUDI-1918:
----------------------------------
Summary: Incorrect keyby field would cause serious data skew
Key: HUDI-1918
URL: https://issues.apache.org/jira/browse/HUDI-1918
Project: Apache Hudi
Issue Type: Bug
Components: Flink Integration
Reporter: zhangminglei
Assignee: zhangminglei
The code (https://github.com/apache/hudi/blob/master/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java#L92), in the actual data warehouse, partition path is most based on log_date or log_hour, so keyBy (HoodieRecord: :getPartitionPath) that would cause serious data skew.
we can actually shuffle data by record key here, just like the pipeline in HoodieTableSink.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)