You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "liuhe0702 (Jira)" <ji...@apache.org> on 2021/11/17 02:15:00 UTC

[jira] [Created] (HUDI-2777) Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

liuhe0702 created HUDI-2777:
-------------------------------

             Summary: Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.
                 Key: HUDI-2777
                 URL: https://issues.apache.org/jira/browse/HUDI-2777
             Project: Apache Hudi
          Issue Type: Bug
          Components: Spark Integration
    Affects Versions: 0.9.0
         Environment: hudi 0.9.0
spark3.1.1
hive3.1.1
hadoop3.1.1
            Reporter: liuhe0702
         Attachments: image-2021-11-17-10-14-29-308.png

If multiple partitions exist and the final result of RDD.isEmpty is true, Spark starts multiple jobs in 5-fold increment mode. As a result, the computing performance deteriorates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)