You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kevin Jung (JIRA)" <ji...@apache.org> on 2015/01/05 03:42:35 UTC

[jira] [Created] (SPARK-5081) Shuffle write increases

Kevin Jung created SPARK-5081:
---------------------------------

             Summary: Shuffle write increases
                 Key: SPARK-5081
                 URL: https://issues.apache.org/jira/browse/SPARK-5081
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 1.2.0
            Reporter: Kevin Jung


The size of shuffle write showing in spark web UI is much different when I 
execute same spark job with same input data in both spark 1.1 and spark 
1.2. 
At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 
but 146.9MB in spark 1.2. 
I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1.
It can increase disk I/O overhead exponentially as the input file gets bigger.
In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org