You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "gaoyajun02 (Jira)" <ji...@apache.org> on 2022/01/25 02:53:00 UTC

[jira] [Created] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

gaoyajun02 created SPARK-38010:
----------------------------------

             Summary: Push-based shuffle disabled due to insufficient mergeLocations
                 Key: SPARK-38010
                 URL: https://issues.apache.org/jira/browse/SPARK-38010
             Project: Spark
          Issue Type: Improvement
          Components: Shuffle, Spark Core
    Affects Versions: 3.1.0
            Reporter: gaoyajun02


The current shuffle merger position is obtained based on the host of the active or dead Executor.
When dynamic resource allocation is enabled, when the application submits the first few stages, there are often not enough locations to satisfy the push merge, which causes most shuffles to not benefit from the push bashed shuffle.
The first few shuffle write stages of spark applications are generally the stages for reading tables or data sources, which account for a large amount of shuffled data and the proportion of data. Because push cannot be used, the end-to-end improvement of spark applications is very limited.

I probably thought of a way, but not sure if it's possible：
 *  Lazy initialize shuffle merger locations, After the mapper writes the local shuffle data, it obtains the merge location in the push thread.

Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org