You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Penglei Shi (Jira)" <ji...@apache.org> on 2022/08/15 09:25:00 UTC

[jira] [Created] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

Penglei Shi created SPARK-40082:
-----------------------------------

             Summary: DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
                 Key: SPARK-40082
                 URL: https://issues.apache.org/jira/browse/SPARK-40082
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 3.1.1
            Reporter: Penglei Shi


In condition of push-based shuffle being enabled and speculative tasks existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, then its parent stages will be resubmitting firstly and it will cost some time to compute. Before the shuffleMapStage being resubmitting, its all speculative tasks success and register map output, but task successful events can not trigger shuffleMergeFinalized because this stage has been  remove from runningStages 

!image-2022-08-15-17-17-08-666.png!

Then this stage is resubmitted, but speculative tasks have registered map output and there are no missing tasks to compute, resubmitting stages will also not trigger shuffleMergeFinalized. Eventually this stage‘s _shuffleMergedFinalized keeps false.

!image-2022-08-15-17-17-49-488.png!

Then AQE will submit next stages which are dependent on  this shuffleMapStage occurring fetchFailed. And in getMissingParentStages, this stage will be marked as missing and will being resubmitted, but next stages are added after this stage being finished, so next stages will not be submitted even though this stage's resubmitting has been finished.

!image-2022-08-15-17-15-39-992.png!

 

I have only met some times in my production env and it is difficult to reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org