You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jem Tucker (JIRA)" <ji...@apache.org> on 2015/07/24 14:05:04 UTC

[jira] [Created] (SPARK-9310) Spark shuffle performance degrades significantly with an increased number of tasks

Jem Tucker created SPARK-9310:
---------------------------------

             Summary: Spark shuffle performance degrades significantly with an increased number of tasks
                 Key: SPARK-9310
                 URL: https://issues.apache.org/jira/browse/SPARK-9310
             Project: Spark
          Issue Type: Bug
          Components: Shuffle
    Affects Versions: 1.2.0
         Environment: 2 node cluster - CDH 5.3.2 on CentOS 
            Reporter: Jem Tucker


When running a large number of complex stages on high volumes of data shuffle duration increased by a factor of 3 when the parallelism was increased by a factor of 5 from 2000 to 10000. 

In both cases tasks run for over a minute (to process approximately 2MB of data with initial parallelisation) so I ruled out any task overhead that could be causing this.

Monitoring IO and network traffic showed that neither were at more than 10% of their potential max during shuffles and CPU utilization seemed worryingly low as well, neither are we experiencing a concerning level of garbage collection.

Is performance of shuffles expected to be so heavily influenced by the number of tasks?  If so, is there an effective way to tune the number of partitions at run-time for different inputs? 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org