You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Patrick Wendell (JIRA)" <ji...@apache.org> on 2014/11/22 02:16:34 UTC

[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

    [ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221690#comment-14221690 ] 

Patrick Wendell commented on SPARK-4550:
----------------------------------------

Not an expert on the internals of this component, but do we need a way of ordering/comparing serialized objects for this to work?

> In sort-based shuffle, store map outputs in serialized form
> -----------------------------------------------------------
>
>                 Key: SPARK-4550
>                 URL: https://issues.apache.org/jira/browse/SPARK-4550
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Sandy Ryza
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that it ends up storing many more java objects in memory.  If Spark could store map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are independent from each other and occupy contiguous segments of memory.  E.g. when Kryo reference tracking is left on, objects may contain pointers to objects farther back in the stream, which means that the sort can't relocate objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org