You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/06/29 13:07:04 UTC

[jira] [Assigned] (SPARK-8374) Job frequently hangs after YARN preemption

     [ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-8374:
-----------------------------------

    Assignee: Apache Spark

> Job frequently hangs after YARN preemption
> ------------------------------------------
>
>                 Key: SPARK-8374
>                 URL: https://issues.apache.org/jira/browse/SPARK-8374
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.4.0
>         Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
>            Reporter: Shay Rojansky
>            Assignee: Apache Spark
>            Priority: Critical
>
> After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce:
> 1. I run Spark job A that acquires all grid resources
> 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted.
> 3. Kill job B, releasing all resources
> 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts.
> (see logs at bottom).
> Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues.
> The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue).
> Logs
> ------
> When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee):
> {noformat}
> ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated
> INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
> WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
> INFO DAGScheduler: Executor lost: 447 (epoch 0)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406)
> INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
> {noformat}
> (It's strange for errors/warnings to be logged for preemption)
> Later, when job B's AM starts requesting its resources, I get lots of the following in job A:
> {noformat}
> ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated
> INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
> WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
> WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> {noformat}
> Finally, when I kill job B, job A emits lots of the following:
> {noformat}
> INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
> WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
> {noformat}
> And finally after some time:
> {noformat}
> WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 120000 ms
> ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms
> {noformat}
> At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org