You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mridul Muralidharan (JIRA)" <ji...@apache.org> on 2015/06/10 22:33:00 UTC

[jira] [Commented] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN

    [ https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581038#comment-14581038 ] 

Mridul Muralidharan commented on SPARK-8297:
--------------------------------------------

Spark on mesos handles this situation by calling removeExecutor() on the scheduler backend - yarn module does not.
I have this fixed locally, but unfortunately, I do not have the bandwidth to shepherd a patch.

The fix is simple - replicate something similar to what is done in CoarseMesosSchedulerBackend.slaveLost().
Essentially :
a) maintain a mapping from container-id to executor-id in YarnAllocator (consistent with and inverse of executorIdToContainer)
b) propagate the scheduler backend to YarnAllocator when YarnClusterScheduler.postCommitHook is called, 
c) In processCompletedContainers, if the container is not in releasedContainers, invoke backend.removeExecutor(executorId, msg) to notify backend that the executor has not exit'ed gracefully/expectedly.
d) Remove mapping from containerIdToExecutorId and executorIdToContainer in processCompletedContainers (The latter also fixes a memory leak in YarnAllocator btw).


In case no one is picking this one up, I can fix it later in 1.5 release cycle.

> Scheduler backend is not notified in case node fails in YARN
> ------------------------------------------------------------
>
>                 Key: SPARK-8297
>                 URL: https://issues.apache.org/jira/browse/SPARK-8297
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.2.2, 1.3.1, 1.4.1
>         Environment: Spark on yarn - both client and cluster mode.
>            Reporter: Mridul Muralidharan
>            Priority: Critical
>
> When a node crashes, yarn detects the failure and notifies spark - but this information is not propagated to scheduler backend (unlike in mesos mode, for example).
> It results in repeated re-execution of stages (due to FetchFailedException on shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org