You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by 大啊 <be...@163.com> on 2020/03/07 01:29:17 UTC

Re:Spark-3.0 - performance degradation

Can you provide configuration information?



At 2020-02-27 03:49:53, "Peter Rudenko" <pe...@gmail.com> wrote:

Facing performance degradation for RDD shuffle jobs in Spark-3.0.
Environment:
Spark-3.0: build from commit ba4212660305c6555ae16b10c6bbaf6114c4d830

Spark-2.4.2: release (just to use scala-2.12, results are the same for spark-2.4.5)
Spark-terasort: https://github.com/ehiggs/spark-terasort/tree/a240386988a71eeaff1fe25cfd73e527c69fb7b2
Dataset of size 1800Gb, 20 executors, 25 cores per executor:


3.0 results:

2.4.2:


Event timeline for 3.0 looks very weird:




Compared to 2.4:


Everything with default settings. Run several different workloads of different sizes, with different executors number, but result is the same. Seems like some scheduling issue in 3.0.


Does someone facing the same issue?


Thanks,
Peter Rudenko