You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Ash (JIRA)" <ji...@apache.org> on 2014/09/05 21:01:28 UTC
[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default
implementation
[ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123390#comment-14123390 ]
Andrew Ash commented on SPARK-3280:
-----------------------------------
[~joshrosen] do you have a theory for the cause of the dropoff between 2800 and 3200 partitions in your chart? My interpretation is that both shuffle implementations behave similarly in this scenario up to ~1600 after which the hash based starts falling behind, then there's another step difference at 3200 where it hits a severe dropoff. I'm interested in the right third of the chart.
A couple theories:
- more partitions = more stuff in memory concurrently = GC pressure. Sort-based can stream and do merge sort, but hash-based needs to build the hash all at once then spill it
- more partitions = more concurrent spills = disk thrashing while writing to lots of files concurrently, exacerbated if the test was on spinnies instead of SSDs. Maybe the sort-based merges spills while writing to disk so ends up writing fewer spill files concurrently.
Also the chart is a little unclear, is the y-axis time in seconds?
> Made sort-based shuffle the default implementation
> --------------------------------------------------
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Reynold Xin
> Assignee: Reynold Xin
> Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org