You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xudingyu (Jira)" <ji...@apache.org> on 2020/12/24 03:12:00 UTC
[jira] [Updated] (SPARK-33896) Make Spark DAGScheduler datasource cache aware when scheduling tasks

     [ https://issues.apache.org/jira/browse/SPARK-33896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xudingyu updated SPARK-33896:
-----------------------------
    Description: 
Goals: 
•	Make Spark 3.0 Scheduler DataSource-Cache-Aware in multi-replication HDFS cluster
•	Performance gain in E2E workload when enabling this feature

Problem Statement:
Spark’s DAGScheduler currently schedule tasks according to RDD’s preferLocations, which repects HDFS BlockLocation. In a multi-replication cluster, HDFS BlockLocation can be returned as an Array[BlockLocation], Spark chooses one of the BlockLocation to run tasks on. However, tasks can run faster if scheduled to the nodes with datasource cache that they need. Currently there’re no datasource cache locality provision mechanism in Spark if nodes in the cluster have cache data.
This project aims to add a cache-locality-aware mechanism. Spark DAGScheduler can schedule tasks to the nodes with datasource cache according to cache locality in a multi-replication HDFS.


> Make Spark DAGScheduler datasource cache aware when scheduling tasks
> --------------------------------------------------------------------
>
>                 Key: SPARK-33896
>                 URL: https://issues.apache.org/jira/browse/SPARK-33896
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xudingyu
>            Priority: Major
>
> Goals: 
> •	Make Spark 3.0 Scheduler DataSource-Cache-Aware in multi-replication HDFS cluster
> •	Performance gain in E2E workload when enabling this feature
> Problem Statement:
> Spark’s DAGScheduler currently schedule tasks according to RDD’s preferLocations, which repects HDFS BlockLocation. In a multi-replication cluster, HDFS BlockLocation can be returned as an Array[BlockLocation], Spark chooses one of the BlockLocation to run tasks on. However, tasks can run faster if scheduled to the nodes with datasource cache that they need. Currently there’re no datasource cache locality provision mechanism in Spark if nodes in the cluster have cache data.
> This project aims to add a cache-locality-aware mechanism. Spark DAGScheduler can schedule tasks to the nodes with datasource cache according to cache locality in a multi-replication HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org