You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "wuyi (JIRA)" <ji...@apache.org> on 2018/04/18 10:31:00 UTC

[jira] [Created] (SPARK-24011) Cache rdd's immediate parent ShuffleDependency to accelerate getShuffleDependencies()

wuyi created SPARK-24011:
----------------------------

             Summary: Cache rdd's immediate parent ShuffleDependency to accelerate getShuffleDependencies()
                 Key: SPARK-24011
                 URL: https://issues.apache.org/jira/browse/SPARK-24011
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.3.0
            Reporter: wuyi


When creating stages for jobs, we need to find a rdd's (except the final rdd) immediate parent ShuffleDependencies by method getShuffleDependencies() for at least 2 times (first in

getMissingAncestorShuffleDependencies(), and second in getOrCreateParentStages()).

So, we can cache the result at the fist time we call getShuffleDependencies().

This is helpful for cutting time consuming when there's many NarrowDependencies between the rdd and its immediate parent ShuffleDependencies or if the rdd has a number of immediate parent ShuffleDependencies .

 

There's an exception for checkpointed rdd. If a rdd is checkpointed, it's immediate parent ShuffleDependencies should adjust to empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org