You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Ash (JIRA)" <ji...@apache.org> on 2014/09/05 21:01:28 UTC

[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

    [ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123390#comment-14123390 ] 

Andrew Ash commented on SPARK-3280:
-----------------------------------

[~joshrosen] do you have a theory for the cause of the dropoff between 2800 and 3200 partitions in your chart?  My interpretation is that both shuffle implementations behave similarly in this scenario up to ~1600 after which the hash based starts falling behind, then there's another step difference at 3200 where it hits a severe dropoff.  I'm interested in the right third of the chart.

A couple theories:
- more partitions = more stuff in memory concurrently = GC pressure.  Sort-based can stream and do merge sort, but hash-based needs to build the hash all at once then spill it
- more partitions = more concurrent spills = disk thrashing while writing to lots of files concurrently, exacerbated if the test was on spinnies instead of SSDs.  Maybe the sort-based merges spills while writing to disk so ends up writing fewer spill files concurrently.

Also the chart is a little unclear, is the y-axis time in seconds?

> Made sort-based shuffle the default implementation
> --------------------------------------------------
>
>                 Key: SPARK-3280
>                 URL: https://issues.apache.org/jira/browse/SPARK-3280
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>         Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org