You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "t oo (Jira)" <ji...@apache.org> on 2020/06/20 20:59:00 UTC
[jira] [Created] (SPARK-32040) Idle cores not being allocated

t oo created SPARK-32040:
----------------------------

             Summary: Idle cores not being allocated
                 Key: SPARK-32040
                 URL: https://issues.apache.org/jira/browse/SPARK-32040
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 2.4.5
            Reporter: t oo


Background: 
I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs 
running on EC2. We deploy a Scala web server as a long running jar via 
`spark-submit` in client mode. Sometimes we get into a state where the 
application ends up with 0 cores due to our in-house autoscaler scaling down 
and killing workers without checking if any of the cores in the worker were 
allocated to existing applications. These applications then end up with 0 
cores, even though there are healthy workers in the cluster. 

However, if i submit a new application or register a new worker in the 
cluster, only then will the master finally reallocate cores to the 
application. This is problematic, because the long running 0 core 
application is stuck. 

Could this be related to the fact that `schedule()` is only triggered by new 
workers / new applications as commented here? 
[https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724]

If that is the case, should the application be calling `schedule()` when 
removing workers after calling `timeOutWorkers()`? 
[https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417]

The downscaling causes me to see this in my logs, so i am fairly certain 
`timeOutWorkers()` is being called: 
``` 
20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested 
to set total executors to 1. 
20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0 
on worker worker-20200608113523-<IP_ADDRESS>-7077 
20/06/08 11:41:44 WARN Master: Removing 
worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60 
seconds 
20/06/08 11:41:44 INFO Master: Removing worker 
worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077 
20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0 
20/06/08 11:41:44 INFO Master: Telling app of lost worker: 
worker-20200608113523-10.158.242.213-7077 
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org