You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2014/12/19 01:34:14 UTC

[jira] [Updated] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey

     [ https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen updated SPARK-4743:
------------------------------
    Assignee: Ivan Vergiliev

> Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-4743
>                 URL: https://issues.apache.org/jira/browse/SPARK-4743
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Ivan Vergiliev
>            Assignee: Ivan Vergiliev
>              Labels: performance
>             Fix For: 1.3.0
>
>
> AggregateByKey and foldByKey in PairRDDFunctions both use the closure serializer to serialize and deserialize the initial value. This means that the Java serializer is always used, which can be very expensive if there's a large number of groups. Calling combineByKey manually and using the normal serializer instead of the closure one improved the performance on the dataset I'm testing with by about 30-35%.
> I'm not familiar enough with the codebase to be certain that replacing the serializer here is OK, but it works correctly in my tests, and it's only serializing a single value of type U, which should be serializable by the default one since it can be the output of a job. Let me know if I'm missing anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org