You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/12/03 21:29:11 UTC

[jira] [Resolved] (SPARK-5081) Shuffle write increases

     [ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-5081.
------------------------------
    Resolution: Cannot Reproduce

OK will treat it as almost certainly improved along the way, somewhere

> Shuffle write increases
> -----------------------
>
>                 Key: SPARK-5081
>                 URL: https://issues.apache.org/jira/browse/SPARK-5081
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.2.0
>            Reporter: Kevin Jung
>            Priority: Critical
>         Attachments: Spark_Debug.pdf, diff.txt
>
>
> The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org