You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shay Rojansky (JIRA)" <ji...@apache.org> on 2015/06/15 13:16:00 UTC
[jira] [Created] (SPARK-8374) Job frequently hangs after YARN preemption

Shay Rojansky created SPARK-8374:
------------------------------------

             Summary: Job frequently hangs after YARN preemption
                 Key: SPARK-8374
                 URL: https://issues.apache.org/jira/browse/SPARK-8374
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 1.4.0
         Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
            Reporter: Shay Rojansky
            Priority: Critical


After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce:

1. I run Spark job A that acquires all grid resources
2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted.
3. Kill job B, releasing all resources
4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts.

(see logs at bottom).

Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues.

The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue).

Logs
------
When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee):

{noformat}
ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated
INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
INFO DAGScheduler: Executor lost: 447 (epoch 0)
INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406)
INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
{noformat}

(It's strange for errors/warnings to be logged for preemption)

Later, when job B's AM starts requesting its resources, I get lots of the following in job A:

{noformat}
ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated
INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
{noformat}

Finally, when I kill job B, job A emits lots of the following:

{noformat}
INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
{noformat}

And finally after some time:

{noformat}
WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 120000 ms
ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms
{noformat}

At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org