You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeroen Dries (Jira)" <ji...@apache.org> on 2022/07/08 07:54:00 UTC

[jira] [Created] (SPARK-39717) Spark on YARN + preemption: "Not removing executor 304 because the ResourceProfile was UNKNOWN"

Jeroen Dries created SPARK-39717:
------------------------------------

             Summary: Spark on YARN + preemption: "Not removing executor 304 because the ResourceProfile was UNKNOWN"
                 Key: SPARK-39717
                 URL: https://issues.apache.org/jira/browse/SPARK-39717
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.1
         Environment: YARN Hadoop 3.2 cluster

YARN_CONTAINER_RUNTIME_TYPE=docker

spark.speculation=true

spark.dynamicAllocation.enabled=true

 
            Reporter: Jeroen Dries


The Spark driver continuously logs these messages for about 4 executors, after preemption of these executors. The spark app is still functional, and many more containers have been preempted without resulting in these warnings.

 

When looking at the spark code, it seems that this is an unexpected state. The spark UI also no longer shows these executors.

Restarting the application makes it go away for some time, but it always comes back.

 
{code:java}
{"lineno":69,"name":"org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Requesting driver to remove executor 304 for reason Container cont
ainer_e5047_1656725539918_7361_01_001553 on host: epod062.vgt.vito.be was preempted.","created":1657198251.106}

{"lineno":73,"name":"org.apache.spark.scheduler.cluster.YarnScheduler","process":31,"filename":"Logging.scala","levelname":"ERROR","message":"Lost executor 304 on epod062.vgt.vito.be: Container container_e5047_1656725539918_7361_01_0015
53 on host: epod062.vgt.vito.be was preempted.","created":1657198251.111}


{code}
{code:java}
 {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198313.857} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198313.858} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198313.858} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198313.858} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198314.160} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198314.160} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198314.160}
{code}
YARN node manager logs seem fine for this container:

{code:java}
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO  yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(290)) - Initializing container container_e5047_1656725539918_7361_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO  container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from LOCALIZING to SCHEDULED

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO  scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(503)) - Starting container [container_e5047_1656725539918_7361_01_001553]

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO  container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from SCHEDULED to RUNNING

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:onStartMonitoringContainer(943)) - Starting resource-monitoring for container_e5047_1656725539918_73
61_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,064 INFO  runtime.DockerLinuxContainerRuntime (DockerLinuxContainerRuntime.java:getIpAndHost(1098)) - Docker inspect output for container_e5047_1656725539918_7361_01_00
1553: ,epod062.vgt.vito.be

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,074 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:initializeProcessTrees(567)) - container_e5047_1656725539918_7361_01_001553's ip = 192.168.207.162, 
and hostname = epod062.vgt.vito.be

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:23,400 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(468)) - Skipping monitoring container container_e5047_1656725539918_7361_01_001553 since CPU usa
ge is not yet available.

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO  container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from RUNNING to KILLING

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(744)) - Cleaning up container container_e5047_1656725539918_7361_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(762)) - Marking container container_e5047_1656725539918_7361_01_001553 as inactive

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(769)) - Getting pid for container container_e5047_1656725539918_7361_01_001553 to kill from pi
d file /data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1098)) - Accessing pid for container container_e5047_1656725539918_7361_01_001553 from pid file
 /data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1110)) - Got pid 38072 for container container_e5047_1656725539918_7361_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(933)) - Sending signal to pid 38072 as user openeo for container container_e5047_1656725539918_73
61_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,017 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(941)) - Sent signal SIGTERM to pid 38072 as user openeo for container container_e5047_16567255399
18_7361_01_001553, result=success

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,020 WARN  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(585)) - Exit code from container container_e5047_1656725539918_7361_01_001553 i
s : 143


 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org