You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2018/06/19 02:56:00 UTC

[jira] [Assigned] (SPARK-24591) Number of cores and executors in the cluster

     [ https://issues.apache.org/jira/browse/SPARK-24591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-24591:
------------------------------------

    Assignee: Apache Spark

> Number of cores and executors in the cluster
> --------------------------------------------
>
>                 Key: SPARK-24591
>                 URL: https://issues.apache.org/jira/browse/SPARK-24591
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.1
>            Reporter: Maxim Gekk
>            Assignee: Apache Spark
>            Priority: Minor
>
> Need to add 2 new methods. The first one should return total number of CPU cores of all executors in the cluster. The second one should give current number of executors registered in the cluster.
> Main motivations for adding of those methods:
> 1. It is the best practice to manage job parallelism relative to available cores, e.g., df.repartition(5 * sc.coreCount) . In particular, it is an anti-pattern to leave a bunch of cores on large clusters twiddling their thumb & doing nothing. Usually users pass predefined constants for _repartition()_ and _coalesce()_. Selection of the constant is based on current cluster size. If the code runs on another cluster and/or on the resized cluster, they need to modify the constant each time. This happens frequently when a job that normally runs on, say, an hour of data on a small cluster needs to run on a week of data on a much larger cluster.
> 2. *spark.default.parallelism* can be used to get total number of cores in the cluster but it can be redefined by user. The info can be taken via registration of a listener but repeating the same looks ugly. We should follow the DRY principle.
> 3. Regarding to executorsCount(), some jobs, e.g., local node ML training, use a lot of parallelism. It's a common practice to aim to distribute such jobs such that there is one partition for each executor. 
>  
> 4. In some places users collect this info, as well as other settings info together with job timing (at the app level) for analysis. E.g., you can use ML to determine optimal cluster size given different objectives, e.g., fastest throughput vs. lowest cost per unit of processing.
> 5. The simpler argument is that basic cluster properties should be easily discoverable via APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org