You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeroen Dries (Jira)" <ji...@apache.org> on 2022/07/08 07:54:00 UTC
[jira] [Created] (SPARK-39717) Spark on YARN + preemption: "Not removing executor 304 because the ResourceProfile was UNKNOWN"
Jeroen Dries created SPARK-39717:
------------------------------------
Summary: Spark on YARN + preemption: "Not removing executor 304 because the ResourceProfile was UNKNOWN"
Key: SPARK-39717
URL: https://issues.apache.org/jira/browse/SPARK-39717
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.2.1
Environment: YARN Hadoop 3.2 cluster
YARN_CONTAINER_RUNTIME_TYPE=docker
spark.speculation=true
spark.dynamicAllocation.enabled=true
Reporter: Jeroen Dries
The Spark driver continuously logs these messages for about 4 executors, after preemption of these executors. The spark app is still functional, and many more containers have been preempted without resulting in these warnings.
When looking at the spark code, it seems that this is an unexpected state. The spark UI also no longer shows these executors.
Restarting the application makes it go away for some time, but it always comes back.
{code:java}
{"lineno":69,"name":"org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Requesting driver to remove executor 304 for reason Container cont
ainer_e5047_1656725539918_7361_01_001553 on host: epod062.vgt.vito.be was preempted.","created":1657198251.106}
{"lineno":73,"name":"org.apache.spark.scheduler.cluster.YarnScheduler","process":31,"filename":"Logging.scala","levelname":"ERROR","message":"Lost executor 304 on epod062.vgt.vito.be: Container container_e5047_1656725539918_7361_01_0015
53 on host: epod062.vgt.vito.be was preempted.","created":1657198251.111}
{code}
{code:java}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198313.857}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198313.858}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198313.858}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198313.858}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198313.959}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198313.959}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198313.959}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198313.959}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198314.059}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198314.059}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198314.059}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198314.059}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198314.160}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198314.160}
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198314.160}
{code}
YARN node manager logs seem fine for this container:
{code:java}
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(290)) - Initializing container container_e5047_1656725539918_7361_01_001553
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from LOCALIZING to SCHEDULED
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(503)) - Starting container [container_e5047_1656725539918_7361_01_001553]
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from SCHEDULED to RUNNING
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:onStartMonitoringContainer(943)) - Starting resource-monitoring for container_e5047_1656725539918_73
61_01_001553
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,064 INFO runtime.DockerLinuxContainerRuntime (DockerLinuxContainerRuntime.java:getIpAndHost(1098)) - Docker inspect output for container_e5047_1656725539918_7361_01_00
1553: ,epod062.vgt.vito.be
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,074 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:initializeProcessTrees(567)) - container_e5047_1656725539918_7361_01_001553's ip = 192.168.207.162,
and hostname = epod062.vgt.vito.be
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:23,400 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(468)) - Skipping monitoring container container_e5047_1656725539918_7361_01_001553 since CPU usa
ge is not yet available.
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from RUNNING to KILLING
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(744)) - Cleaning up container container_e5047_1656725539918_7361_01_001553
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(762)) - Marking container container_e5047_1656725539918_7361_01_001553 as inactive
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(769)) - Getting pid for container container_e5047_1656725539918_7361_01_001553 to kill from pi
d file /data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1098)) - Accessing pid for container container_e5047_1656725539918_7361_01_001553 from pid file
/data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1110)) - Got pid 38072 for container container_e5047_1656725539918_7361_01_001553
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(933)) - Sending signal to pid 38072 as user openeo for container container_e5047_1656725539918_73
61_01_001553
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,017 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(941)) - Sent signal SIGTERM to pid 38072 as user openeo for container container_e5047_16567255399
18_7361_01_001553, result=success
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,020 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(585)) - Exit code from container container_e5047_1656725539918_7361_01_001553 i
s : 143
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org