You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jem Tucker (JIRA)" <ji...@apache.org> on 2015/07/24 14:05:04 UTC
[jira] [Created] (SPARK-9310) Spark shuffle performance degrades
significantly with an increased number of tasks
Jem Tucker created SPARK-9310:
---------------------------------
Summary: Spark shuffle performance degrades significantly with an increased number of tasks
Key: SPARK-9310
URL: https://issues.apache.org/jira/browse/SPARK-9310
Project: Spark
Issue Type: Bug
Components: Shuffle
Affects Versions: 1.2.0
Environment: 2 node cluster - CDH 5.3.2 on CentOS
Reporter: Jem Tucker
When running a large number of complex stages on high volumes of data shuffle duration increased by a factor of 3 when the parallelism was increased by a factor of 5 from 2000 to 10000.
In both cases tasks run for over a minute (to process approximately 2MB of data with initial parallelisation) so I ruled out any task overhead that could be causing this.
Monitoring IO and network traffic showed that neither were at more than 10% of their potential max during shuffles and CPU utilization seemed worryingly low as well, neither are we experiencing a concerning level of garbage collection.
Is performance of shuffles expected to be so heavily influenced by the number of tasks? If so, is there an effective way to tune the number of partitions at run-time for different inputs?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org