You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:15:08 UTC

[jira] [Resolved] (SPARK-14649) DagScheduler re-starts all running tasks on fetch failure

     [ https://issues.apache.org/jira/browse/SPARK-14649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-14649.
----------------------------------
    Resolution: Incomplete

> DagScheduler re-starts all running tasks on fetch failure
> ---------------------------------------------------------
>
>                 Key: SPARK-14649
>                 URL: https://issues.apache.org/jira/browse/SPARK-14649
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Sital Kedia
>            Priority: Major
>              Labels: bulk-closed
>
> When a fetch failure occurs, the DAGScheduler re-launches the previous stage (to re-generate output that was missing), and then re-launches all tasks in the stage with the fetch failure that hadn't *completed* when the fetch failure occurred (the DAGScheduler re-lanches all of the tasks whose output data is not available -- which is equivalent to the set of tasks that hadn't yet completed).
> The assumption when this code was originally written was that when a fetch failure occurred, the output from at least one of the tasks in the previous stage was no longer available, so all of the tasks in the current stage would eventually fail due to not being able to access that output.  This assumption does not hold for some large-scale, long-running workloads.  E.g., there's one use case where a job has ~100k tasks that each run for about 1 hour, and only the first 5-10 minutes are spent fetching data.  Because of the large number of tasks, it's very common to see a few tasks fail in the fetch phase, and it's wasteful to re-run other tasks that had finished fetching data so aren't affected by the fetch failure (and may be most of the way through their hour-long execution).  The DAGScheduler should not re-start these tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org