You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matei Zaharia (JIRA)" <ji...@apache.org> on 2014/11/06 18:34:34 UTC

[jira] [Resolved] (SPARK-644) Jobs canceled due to repeated executor failures may hang

     [ https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matei Zaharia resolved SPARK-644.
---------------------------------
    Resolution: Fixed

> Jobs canceled due to repeated executor failures may hang
> --------------------------------------------------------
>
>                 Key: SPARK-644
>                 URL: https://issues.apache.org/jira/browse/SPARK-644
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.6.1
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>
> In order to prevent an infinite loop, the standalone master aborts jobs that experience more than 10 executor failures (see https://github.com/mesos/spark/pull/210).  Currently, the master crashes when aborting jobs (this is the issue that uncovered SPARK-643).  If we fix the crash, which involves removing a {{throw}} from the actor's {{receive}} method, then these failures can lead to a hang because they cause the job to be removed from the master's scheduler, but the upstream scheduler components aren't notified of the failure and will wait for the job to finish.
> I've considered fixing this by adding additional callbacks to propagate the failure to the higher-level schedulers.  It might be cleaner to move the decision to abort the job into the higher-level layers of the scheduler, sending an {{AbortJob(jobId)}} method to the Master.  The Client is already notified of executor state changes, so it may be able to make the decision to abort (or defer that decision to a higher layer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org