You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "DB Tsai (JIRA)" <ji...@apache.org> on 2018/10/30 22:02:00 UTC

[jira] [Commented] (SPARK-25889) Dynamic allocation load-aware ramp up

    [ https://issues.apache.org/jira/browse/SPARK-25889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669384#comment-16669384 ] 

DB Tsai commented on SPARK-25889:
---------------------------------

The problem sounds valid, and the solution could work. Since I'm not in the field of job scheduler, ping [~vanzin] for expert's input. Thanks for the write-up.

> Dynamic allocation load-aware ramp up
> -------------------------------------
>
>                 Key: SPARK-25889
>                 URL: https://issues.apache.org/jira/browse/SPARK-25889
>             Project: Spark
>          Issue Type: New Feature
>          Components: Scheduler, YARN
>    Affects Versions: 2.3.2
>            Reporter: Adam Kennedy
>            Priority: Major
>
> The time based exponential ramp up behavior for dynamic allocation is naive and destructive, making it very difficult to run very large jobs.
> On a large (64,000 core) YARN cluster with a high number of input partitions (200,000+) the default dynamic allocation approach of requesting containers in waves, doubling exponentially once per second, results in 50% of the entire cluster being requested in the final 1 second wave.
> This can easily overwhelm RPC processing, or cause expensive Executor startup steps to break systems. With the interval so short, many additional containers may be requested beyond what is actually needed and then complete very little work before sitting around waiting to be deallocated.
> Delaying the time between these fixed doublings only has limited impact. Setting double intervals to once per minute would result in a very slow ramp up speed, at the end of which we still face large potentially crippling waves of executor startup.
> An alternative approach to spooling up large job appears to be needed, which is still relatively simple but could be more adaptable to different cluster sizes and differing cluster and job performance.
> I would like to propose a few different approaches based around the general idea of controlling outstanding requests for new containers based on the number of executors that are currently running, for some definition of "running".
> One example might be to limit requests to one new executor for every existing executor that currently has an active task. Or some ratio of that, to allow for more or less aggressive spool up. A lower number would let us approximate something like fibonacci ramp up, a higher number of say 2x would spool up quickly, but still aligned with the rate at which broadcast blocks can be easily distributed to new members.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org