You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Ethan Guo (Jira)" <ji...@apache.org> on 2022/04/01 21:15:00 UTC
[jira] [Commented] (HUDI-3773) Revisit performance of bloom filter writing flow in MDT for large batch ingestion
[ https://issues.apache.org/jira/browse/HUDI-3773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516131#comment-17516131 ]
Ethan Guo commented on HUDI-3773:
---------------------------------
One finding: there is one ill-configured and misused parallelism, "hoodie.bloom.index.parallelism". The parallelism is always 1, i.e., no parallelization, because {{recordsGenerationParams.getBloomIndexParallelism()}} is 0
HoodieTableMetadataUtil.convertMetadataToBloomFilterRecords()
final int parallelism = Math.max(Math.min(allWriteStats.size(), recordsGenerationParams.getBloomIndexParallelism()), 1);
HoodieData<HoodieWriteStat> allWriteStatsRDD = context.parallelize(allWriteStats, parallelism);
return allWriteStatsRDD.flatMap(hoodieWriteStat -> \{<bloom filter records>})
> Revisit performance of bloom filter writing flow in MDT for large batch ingestion
> ---------------------------------------------------------------------------------
>
> Key: HUDI-3773
> URL: https://issues.apache.org/jira/browse/HUDI-3773
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Ethan Guo
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 0.11.0
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)