You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by krchia <> on 2020/06/15 12:01:41 UTC

[2.4.5 Standalone Master]: Idle cores not being allocated

I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs
running on EC2. We deploy a Scala web server as a long running jar via
`spark-submit` in client mode. Sometimes we get into a state where the
application ends up with 0 cores due to our in-house autoscaler scaling down
and killing workers without checking if any of the cores in the worker were
allocated to existing applications. These applications then end up with 0
cores, even though there are healthy workers in the cluster.

However, if i submit a new application or register a new worker in the
cluster, only then will the master finally reallocate cores to the
application. This is problematic, because the long running 0 core
application is stuck.

Could this be related to the fact that `schedule()` is only triggered by new
workers / new applications as commented here?

If that is the case, should the application be calling `schedule()` when
removing workers after calling `timeOutWorkers()`?

The downscaling causes me to see this in my logs, so i am fairly certain
`timeOutWorkers()` is being called:
20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested
to set total executors to 1.
20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0
on worker worker-20200608113523-<IP_ADDRESS>-7077
20/06/08 11:41:44 WARN Master: Removing
worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60
20/06/08 11:41:44 INFO Master: Removing worker
worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077
20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0
20/06/08 11:41:44 INFO Master: Telling app of lost worker:

Sent from:

To unsubscribe e-mail: