You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2015/09/01 17:25:46 UTC

[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

    [ https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725552#comment-14725552 ] 

Imran Rashid commented on SPARK-2666:
-------------------------------------

I'm copying [~kayousterhout]'s comment from the PR here for discussion:

bq. My understanding is that it can help to let the remaining tasks run -- because they may hit Fetch failures from different map outputs than the original fetch failure, which will lead to the DAGScheduler to more quickly reschedule all of the failed tasks. For example, if an executor failed and had multiple map outputs on it, the first Fetch failure will only tell us about one of the map outputs being missing, and it's helpful to learn about all of them before we resubmit the earlier stage. Did you already think about this / am I misunderstanding the issue?

Things may have changed in the meantime, but I'm pretty sure that now, when there is a fetch failure, spark assumes its lost *all* of the map output for that host.  Its a bit confusing -- it seems we first only remove [the one map output with the failure|https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1134] but then we remove all map outputs in [{{handleExecutorLost}} | https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1184].  I suppose it could still be useful to run the remaining tasks, as they may discover *another* executor that has died, but I don't think its worth it just for that, right?

Elsewhere we've also discussed always killing all tasks as soon as the {{TaskSetManager}} is marked as a zombie, see https://github.com/squito/spark/pull/4.

I'm particularly interested b/c this is relevant to SPARK-10370.  In that case, there wouldn't be any benefit to leaving tasks as running after marking the stage as zombie.  If we do want to cancel all tasks as soon as we mark a stage as zombie, then I'd prefer we go the route of making {{isZombie}} private, and make task cancellation part of {{markAsZombie}} to make the code easier to follow and make sure we always cancel tasks.

Is my understanding correct?  Other opinions on the right approach here?

> when task is FetchFailed cancel running tasks of failedStage
> ------------------------------------------------------------
>
>                 Key: SPARK-2666
>                 URL: https://issues.apache.org/jira/browse/SPARK-2666
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Lianhui Wang
>
> in DAGScheduler's handleTaskCompletion,when reason of failed task is FetchFailed, cancel running tasks of failedStage before add failedStage to failedStages queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org