You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Adrian Tanase (JIRA)" <ji...@apache.org> on 2015/09/24 11:00:11 UTC
[jira] [Created] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart
Adrian Tanase created SPARK-10792:
-------------------------------------
Summary: Spark streaming + YARN – executor is not re-created on machine restart
Key: SPARK-10792
URL: https://issues.apache.org/jira/browse/SPARK-10792
Project: Spark
Issue Type: Bug
Components: Streaming, YARN
Affects Versions: 1.4.0
Environment: - centos7 deployed on AWS
- yarn / hadoop 2.6.0-cdh5.4.2
- spark 1.4.0 compiled with hadoop 2.6
Reporter: Adrian Tanase
We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a stateful app that reads from kafka (with the new direct API) and we’re checkpointing to HDFS.
During some resilience testing, we restarted one of the machines and brought it back online. During the offline period, the Yarn cluster would not have resources to re-create the missing executor.
After starting all the services on the machine, it correctly joined the Yarn cluster, however the spark streaming app does not seem to notice that the resources are back and has not re-created the missing executor.
The app is correctly running with 6 out of 7 executors, however it’s running under capacity.
If we manually kill the driver and re-submit the app to yarn, all the sate is correctly recreated from checkpoint and all 7 executors are now online – however this seems like a brutal workaround.
Scenarios tested to isolate the issue:
The expected outcome after a machine reboot + services back is that processing continues on it. *FAILED* below means that processing continues in a reduced capacity, as the machine lost rarely re-joins as container/executor even if YARN sees it as healthy node.
The expected outcome after a machine reboot + services back is that processing continues on it.
FAILED below means that processing continues in a reduced capacity, as the machine lost rarely re-joins as container/executor even if YARN sees it as healthy node.
|| No || Failure scenario || test result || data loss || Notes ||
| 1 | Single node restart | FAILED | NO | Executor NOT redeployed when machine comes back and services are restarted |
| 2 | Multi-node restart (quick succession) | FAILED | YES | If we are not restoring services on machines that are down, the app OR kafka OR zookeeper metadata gets corrupted, app crashes and can't be restarted w/o clearing checkpoint -> dataloss. Root cause is unhealthy cluster when too many machines are lost. |
| 3 | Multi-node restart (rolling) | FAILED | NO | Same as single node restart, driver does not crash |
| 4 | Graceful services restart | FAILED | NO | Behaves just like single node restart even if we take the time to manually stop services before machine reboot. |
| 5 | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app will usually start even if YARN can't fullfill all the resource requests (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes are added to YARN, we see that Spark deploys executors on them, as expected in all the scenarios. |
| 6 | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts it behaves like machine restart - the rest work as expected, container/executor are redeployed in a matter of seconds |
| 7 | Node restart on bigger cluster | FAILED | NO | We were trying to validate if the behavior is caused by maxing out the cluster and having no slack to redeploy a crashed node. We are still behaving like single node restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
*Logs for Scenario 6 – correct behavior on process restart*
{noformat}
2015-09-21 11:00:11,193 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Completed container container_1442827158253_0004_01_000004 (state: COMPLETE, exit status: 137)
2015-09-21 11:00:11,193 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: container_1442827158253_0004_01_000004. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
..
(logical continuation from earlier restart attempt)
2015-09-21 10:33:20,658 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
2015-09-21 10:33:20,658 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, capability: <memory:18022, vCores:14>)
..
2015-09-21 10:33:25,663 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Launching container container_1442827158253_0004_01_000012 for on host ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,664 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Launching ExecutorRunnable. driverUrl: akka.tcp://sparkDriver@10.0.1.14:32938/user/CoarseGrainedScheduler, executorHostname: ip-10-0-1-16.ec2.internal
2015-09-21 10:33:25,664 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Received 1 containers from YARN, launching executors on 1 of them.
{noformat}
*Logs for Scenario 1 – weird resource requests / behavior on node restart*
{noformat}
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] INFO org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated or disconnected! Shutting down. ip-10-0-1-16.ec2.internal:34741
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-24] ERROR org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost executor 8 on ip-10-0-1-16.ec2.internal: remote Rpc client disassociated
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-24] INFO org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated or disconnected! Shutting down. ip-10-0-1-16.ec2.internal:34741
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://sparkExecutor@ip-10-0-1-16.ec2.internal:34741] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2015-09-21 10:36:57,352 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Executor lost: 8 (epoch 995)
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 8 from BlockManagerMaster.
2015-09-21 10:36:57,352 [sparkDriver-akka.actor.default-dispatcher-31] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager BlockManagerId(8, ip-10-0-1-16.ec2.internal, 35415)
2015-09-21 10:36:57,352 [dag-scheduler-event-loop] INFO org.apache.spark.storage.BlockManagerMaster - Removed 8 successfully in removeExecutor
20
...
2015-09-21 10:39:44,320 [sparkDriver-akka.actor.default-dispatcher-31] WARN org.apache.spark.HeartbeatReceiver - Removing executor 8 with no recent heartbeats: 168535 ms exceeds timeout 120000 ms
2015-09-21 10:39:44,320 [sparkDriver-akka.actor.default-dispatcher-31] ERROR org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost an executor 8 (already removed): Executor heartbeat timed out after 168535 ms
2015-09-21 10:39:44,320 [kill-executor-thread] INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Requesting to kill executor(s) 8
2015-09-21 10:39:44,320 [kill-executor-thread] WARN org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Executor to kill 8 does not exist!
2015-09-21 10:39:44,320 [sparkDriver-akka.actor.default-dispatcher-31] INFO org.apache.spark.deploy.yarn.YarnAllocator - Driver requested a total number of 5 executor(s).
2015-09-21 10:39:44,321 [sparkDriver-akka.actor.default-dispatcher-31] INFO org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver requested to kill executor(s) .
2015-09-21 10:39:45,793 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor containers
2015-09-21 10:39:45,793 [Reporter] WARN org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, but found none.
... every 5 seconds
2015-09-21 10:40:05,800 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor containers
2015-09-21 10:40:05,800 [Reporter] WARN org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, but found none.
..
2015-09-21 10:43:55,876 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor containers
2015-09-21 10:43:55,876 [Reporter] WARN org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, but found none.
...
2015-09-21 10:49:20,979 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Canceling requests for 0 executor containers
2015-09-21 10:49:20,979 [Reporter] WARN org.apache.spark.deploy.yarn.YarnAllocator - Expected to find pending requests, but found none.
2015-09-21 10:49:20,980 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Completed container container_1442827158253_0004_01_000012 (state: COMPLETE, exit status: -100)
2015-09-21 10:49:20,980 [Reporter] INFO org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: container_1442827158253_0004_01_000012. Exit status: -100. Diagnostics: Container released on a *lost* node
.. done
=======
(ANOTHER RESTART ATTEMPT – note how it’s now requesting a total of 4 (should be 7!)
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-19] INFO org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated or disconnected! Shutting down. ip-10-0-1-15.ec2.internal:34332
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-31] ERROR org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost executor 1 on ip-10-0-1-15.ec2.internal: remote Rpc client disassociated
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-30] WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://sparkExecutor@ip-10-0-1-15.ec2.internal:34332] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-31] INFO org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver terminated or disconnected! Shutting down. ip-10-0-1-15.ec2.internal:34332
2015-09-21 10:51:28,226 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Executor lost: 1 (epoch 1431)
2015-09-21 10:51:28,226 [sparkDriver-akka.actor.default-dispatcher-32] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove executor 1 from BlockManagerMaster.
2015-09-21 10:51:28,227 [sparkDriver-akka.actor.default-dispatcher-32] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager BlockManagerId(1, ip-10-0-1-15.ec2.internal, 36311)
2015-09-21 10:51:28,227 [dag-scheduler-event-loop] INFO org.apache.spark.storage.BlockManagerMaster - Removed 1 successfully in removeExecutor
...
2015-09-21 10:53:44,320 [sparkDriver-akka.actor.default-dispatcher-32] WARN org.apache.spark.HeartbeatReceiver - Removing executor 1 with no recent heartbeats: 140055 ms exceeds timeout 120000 ms
2015-09-21 10:53:44,320 [sparkDriver-akka.actor.default-dispatcher-32] ERROR org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost an executor 1 (already removed): Executor heartbeat timed out after 140055 ms
2015-09-21 10:53:44,320 [kill-executor-thread] INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Requesting to kill executor(s) 1
2015-09-21 10:53:44,320 [kill-executor-thread] WARN org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Executor to kill 1 does not exist!
2015-09-21 10:53:44,320 [sparkDriver-akka.actor.default-dispatcher-32] INFO org.apache.spark.deploy.yarn.YarnAllocator - Driver requested a total number of 4 executor(s).
2015-09-21 10:53:44,321 [sparkDriver-akka.actor.default-dispatcher-32] INFO org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver requested to kill executor(s) .
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org