You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Pei (JIRA)" <ji...@apache.org> on 2016/03/31 14:39:25 UTC

[jira] [Updated] (SPARK-14293) Improve shuffle load balancing and minmize network data transmission

     [ https://issues.apache.org/jira/browse/SPARK-14293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Pei updated SPARK-14293:
------------------------------
    Target Version/s:   (was: 1.6.1)
       Fix Version/s:     (was: 2.0.0)

> Improve shuffle load balancing and minmize network data transmission
> --------------------------------------------------------------------
>
>                 Key: SPARK-14293
>                 URL: https://issues.apache.org/jira/browse/SPARK-14293
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Cheng Pei
>              Labels: performance
>
> Currently Spark provides the mechanism to set preferred location for reduce task. When the fraction of total map output at a location is equal or greater than the  parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a preferred location. But this does not consider load balancing and network transmission. Based on the map output sizes and locations tracked by MapOutputTracker, we can obtain a better load balancing, which has been tested in the some real applications, such as wordcount and pagerank.
>  
> This patch proposes a strategy to set preferred locations for each reduce task, which could firstly keep each executor process almost the same amount of intermediate data and secondly minimize the network data transmission. This can benefit some conditions:
> 1.	REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the largest output. If there exists data skew in the map outputs. It could cause some executors that have large of map outputs become busy. Our method could avoid this case and minimize the network data transmission. 
> 2.	When there are large of reduce tasks in the job, it helps each executor processes almost the same data and keeps load balancing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org