You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aaron Davidson (JIRA)" <ji...@apache.org> on 2014/05/22 20:31:02 UTC

[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

    [ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006262#comment-14006262 ] 

Aaron Davidson commented on SPARK-983:
--------------------------------------

[~aash] Note that the current implementation actually just sorts within each partition, using an initial pass from a RangePartitioner.

> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>            Assignee: Aaron Davidson
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a buffer to hold the entire partition, then sorts it. This will cause an OOM if an entire partition cannot fit in memory, which is especially problematic for skewed data. Rather than OOMing, the behavior should be similar to the [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)