You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Madhusudanan Kandasamy <ma...@in.ibm.com> on 2015/09/08 17:00:37 UTC

Question on DAGScheduler.getMissingParentStages()


Hi,

I'm new to SPARK, trying to understand the DAGScheduler code flow. As per
my understanding it looks like getMissingParentStages() doing a redundant
job of re-calculating stage dependencies. When the first stage is created
all of its dependent/parent stages would be recursively calculated and
stored in stage.parents member. Whenever any given stage needs to be
submitted, it would call getMissingParentStages() to get list of all
un-computed parent stages.

I've expected that getMissingParentStages() would go through stage.parents
and retrieve information about whether they are already computed or not.
However, this function does another graph traversal from the stage.rdd
which seems unnecessary. Is there any specific reason to design like that?
If not, I would like to redesign getMissingParentStages() avoiding the
graph traversal.

Thanks,
Madhu.