You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Raymond Xu (Jira)" <ji...@apache.org> on 2022/03/11 12:18:00 UTC

[jira] [Updated] (HUDI-2777) Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

     [ https://issues.apache.org/jira/browse/HUDI-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raymond Xu updated HUDI-2777:
-----------------------------
    Labels: performance pull-request-available  (was: pull-request-available)

> Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-2777
>                 URL: https://issues.apache.org/jira/browse/HUDI-2777
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: spark
>    Affects Versions: 0.9.0
>         Environment: hudi 0.9.0
> spark3.1.1
> hive3.1.1
> hadoop3.1.1
>            Reporter: liuhe0702
>            Assignee: liuhe0702
>            Priority: Critical
>              Labels: performance, pull-request-available
>             Fix For: 0.11.0
>
>
> If multiple partitions exist and the final result of RDD.isEmpty is true, Spark starts multiple jobs in 5-fold increment mode. As a result, the computing performance deteriorates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)