You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Bhavani Sudha (Jira)" <ji...@apache.org> on 2020/05/17 21:50:00 UTC
[jira] [Updated] (HUDI-724) Parallelize GetSmallFiles For
Partitions
[ https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bhavani Sudha updated HUDI-724:
-------------------------------
Fix Version/s: 0.5.3
> Parallelize GetSmallFiles For Partitions
> ----------------------------------------
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Performance, Writer Core
> Reporter: Feichi Feng
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
> Attachments: gap.png, nogapAfterImprovement.png
>
> Original Estimate: 48h
> Time Spent: 40m
> Remaining Estimate: 47h 20m
>
> When writing data, a gap was observed between spark stages. By tracking down where the time was spent on the spark driver, it's get-small-files operation for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it uses a normal for-loop for get the list of small files for all partitions that the load is going to load data to, and the process is very slow when there are a lot of partitions to go through. While the operation is running on spark driver process, all other worker nodes are sitting idle waiting for tasks.
> For all those partitions, they don't affect each other, so the get-small-files operations can be parallelized. The change I made is to pass the JavaSparkContext to the UpsertPartitioner, and create RDD for the partitions and eventually send the get small files operations to multiple tasks.
>
> screenshot attached for
> the gap without the improvement
> the spark stage with the improvement (no gap)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)