You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mridul Muralidharan (Jira)" <ji...@apache.org> on 2023/03/22 06:12:00 UTC

[jira] [Resolved] (SPARK-40082) DAGScheduler may not schduler new stage in condition of push-based shuffle enabled

     [ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mridul Muralidharan resolved SPARK-40082.
-----------------------------------------
    Fix Version/s: 3.5.0
         Assignee: Fencheng Mei
       Resolution: Fixed

> DAGScheduler may not schduler new stage in condition of push-based shuffle enabled
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-40082
>                 URL: https://issues.apache.org/jira/browse/SPARK-40082
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 3.1.1
>            Reporter: Penglei Shi
>            Assignee: Fencheng Mei
>            Priority: Major
>             Fix For: 3.5.0
>
>         Attachments: missParentStages.png, shuffleMergeFinalized.png, submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, then its parent stages will be resubmitting firstly and it will cost some time to compute. Before the shuffleMapStage being resubmitted, its all speculative tasks success and register map output, but speculative task successful events can not trigger shuffleMergeFinalized because this stage has been removed from runningStages.
> Then this stage is resubmitted, but speculative tasks have registered map output and there are no missing tasks to compute, resubmitting stages will also not trigger shuffleMergeFinalized. Eventually this stage‘s _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage occurring fetchFailed. And in getMissingParentStages, this stage will be marked as missing and will be resubmitted, but next stages are added to waitingStages after this stage being finished, so next stages will not be submitted even though this stage's resubmitting has been finished.
> I have only met some times in my production env and it is difficult to reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org