You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2019/07/16 16:42:14 UTC

[jira] [Updated] (SPARK-26104) make pci devices visible to task scheduler

     [ https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-26104:
----------------------------------
    Affects Version/s:     (was: 2.4.0)
                       3.0.0

> make pci devices visible to task scheduler
> ------------------------------------------
>
>                 Key: SPARK-26104
>                 URL: https://issues.apache.org/jira/browse/SPARK-26104
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Chen Qin
>            Priority: Major
>              Labels: Hydrogen
>
> Spark Task scheduling has long time consider CPU only, depending on how many vcores each executor has at given moment, the task were scheduled as long as enough vcores become available.
> Moving to deep learning use cases, The fundamental computation and processing unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU memory.
> Deep learning framework build on top of GPU fleets requires fixture of task to number of GPUs spark haven't support yet. E.g a horord task requires 2 GPUs running uninterrupted before it finish regardless how CPU availability in executor. In Uber peloton executor scheduler, the number of cores available could be more than what user asked due to the fact it might get over provisioned.
> Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run into unexpected states.
>  
> related jiras allocating executor containers with gpu resources, serve as bootstrap phase usage
> SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN
> Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, compatible with design, approach is a bit different as it tacks utilization of pci devices where customized taskscheduler could either fallback to "best to have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org