You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/03/31 15:43:25 UTC

[jira] [Assigned] (SPARK-14293) Improve shuffle load balancing and minmize network data transmission

     [ https://issues.apache.org/jira/browse/SPARK-14293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-14293:
------------------------------------

    Assignee:     (was: Apache Spark)

> Improve shuffle load balancing and minmize network data transmission
> --------------------------------------------------------------------
>
>                 Key: SPARK-14293
>                 URL: https://issues.apache.org/jira/browse/SPARK-14293
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Cheng Pei
>              Labels: performance
>
> Currently Spark provides the mechanism to set preferred location for reduce task. When the fraction of total map output at a location is equal or greater than the  parameter REDUCER_PREF_LOCS_FRACTION, the reduce task will get a preferred location. But this does not consider load balancing and network transmission. Based on the map output sizes and locations tracked by MapOutputTracker, we can obtain a better load balancing.
>  
> This patch proposes a strategy to set preferred locations for each reduce task, which could firstly keep each executor process almost the same amount of intermediate data and secondly minimize the network data transmission. This can benefit some conditions:
> 1．	REDUCER_PREF_LOCS_FRACTION tries to place the reduce tasks close to the largest output. If there exists data skew in the map outputs, it could cause some executors that have large of map outputs become busy. Our method could avoid this case and minimize the network data transmission. 
> 2．	When there are large of reduce tasks in the job, it helps each executor processes almost the same data and keeps load balancing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org