You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Andrew Ash (JIRA)" <ji...@apache.org> on 2014/09/11 18:10:35 UTC

[jira] [Updated] (SPARK-2045) Sort-based shuffle implementation

     [ https://issues.apache.org/jira/browse/SPARK-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Ash updated SPARK-2045:
------------------------------
    Component/s: Shuffle

> Sort-based shuffle implementation
> ---------------------------------
>
>                 Key: SPARK-2045
>                 URL: https://issues.apache.org/jira/browse/SPARK-2045
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle, Spark Core
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>             Fix For: 1.1.0
>
>         Attachments: Sort-basedshuffledesign.pdf
>
>
> Building on the pluggability in SPARK-2044, a sort-based shuffle implementation that takes advantage of an Ordering for keys (or just sorts by hashcode for keys that don't have it) would likely improve performance and memory usage in very large shuffles. Our current hash-based shuffle needs an open file for each reduce task, which can fill up a lot of memory for compression buffers and cause inefficient IO. This would avoid both of those issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org