You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhang, Liye (JIRA)" <ji...@apache.org> on 2015/09/15 08:02:46 UTC

[jira] [Created] (SPARK-10608) turn off reduce tasks locality as default to avoid bad cases

Zhang, Liye created SPARK-10608:
-----------------------------------

             Summary: turn off reduce tasks locality as default to avoid bad cases
                 Key: SPARK-10608
                 URL: https://issues.apache.org/jira/browse/SPARK-10608
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 1.5.0
            Reporter: Zhang, Liye
            Priority: Critical


After [SPARK-2774|https://issues.apache.org/jira/browse/SPARK-2774], which is aiming to reduce network transform, reduce tasks will have their own locality other than following the map side locality. This will lead to some bad cases when there is data skew happens. In some cases, tasks will continue being distributed on some nodes, and will never be balance distributed. 
e.g. If we do not set *spark.scheduler.minRegisteredExecutorsRatio*, then the input data will only be loaded on part of the nodes, say 4 nodes in 10 nodes. And this will leading the first batch of the reduce tasks running on  the 4 nodes, and with many pending tasks waiting for distribution. It might be fine if the tasks runnning for a long time, But if the tasks are running in short time, for example, less than *spark.locality.wait*, then the locality level will not get to lower level, and then the following batches of tasks will be still running on the 4 nodes. Which will ending with all following tasks are running on the 4 nodes instead of 10 nodes. Even though after several stages the tasks may evenly distributed, however, the unbalanced tasks distribution in the beginning will exhaust resources on some nodes first and cause GC more frequently. Which will lead bad performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org