You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2014/12/04 13:20:12 UTC

[jira] [Commented] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey

    [ https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234165#comment-14234165 ] 

Apache Spark commented on SPARK-4743:
-------------------------------------

User 'IvanVergiliev' has created a pull request for this issue:
https://github.com/apache/spark/pull/3605

> Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-4743
>                 URL: https://issues.apache.org/jira/browse/SPARK-4743
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Ivan Vergiliev
>              Labels: performance
>
> AggregateByKey and foldByKey in PairRDDFunctions both use the closure serializer to serialize and deserialize the initial value. This means that the Java serializer is always used, which can be very expensive if there's a large number of groups. Calling combineByKey manually and using the normal serializer instead of the closure one improved the performance on the dataset I'm testing with by about 30-35%.
> I'm not familiar enough with the codebase to be certain that replacing the serializer here is OK, but it works correctly in my tests, and it's only serializing a single value of type U, which should be serializable by the default one since it can be the output of a job. Let me know if I'm missing anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org