You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:12:34 UTC

[jira] [Resolved] (SPARK-20054) [Mesos] Detectability for resource starvation

     [ https://issues.apache.org/jira/browse/SPARK-20054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-20054.
----------------------------------
    Resolution: Incomplete

> [Mesos] Detectability for resource starvation
> ---------------------------------------------
>
>                 Key: SPARK-20054
>                 URL: https://issues.apache.org/jira/browse/SPARK-20054
>             Project: Spark
>          Issue Type: Improvement
>          Components: Mesos, Scheduler
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>            Reporter: Kamal Gurala
>            Priority: Minor
>              Labels: bulk-closed
>
> We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We had a production issue recently wherein we had our spark frameworks accept resources from the Mesos master, so executors were started and spark driver was aware of them, but the driver didn’t plan any task and nothing was happening for a long time because it didn't meet a minimum registered resources threshold. and the cluster is usually under-provisioned in order because not all the jobs need to run at the same time. These held resources were never offered back to the master for re-allocation leading to the entire cluster to a halt until we had to manually intervene. 
> Using DRF for mesos and FIFO for Spark and the cluster is usually under-provisioned. At any point of time there could be 10-15 spark frameworks running on Mesos on the under-provisioned cluster 
> The ask is to have a way to better recoverability or detectability for a scenario where the individual Spark frameworks hold onto resources but never launch any tasks or have these frameworks release these resources after a fixed amount of time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org