You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2015/05/21 23:34:17 UTC

[jira] [Resolved] (SPARK-7718) Speed up data source partitioning by avoiding cleaning closures

     [ https://issues.apache.org/jira/browse/SPARK-7718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yin Huai resolved SPARK-7718.
-----------------------------
       Resolution: Fixed
    Fix Version/s: 1.4.0

Issue resolved by pull request 6256
[https://github.com/apache/spark/pull/6256]

> Speed up data source partitioning by avoiding cleaning closures
> ---------------------------------------------------------------
>
>                 Key: SPARK-7718
>                 URL: https://issues.apache.org/jira/browse/SPARK-7718
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: Andrew Or
>            Assignee: Andrew Or
>            Priority: Critical
>             Fix For: 1.4.0
>
>
> The new partitioning support strategy creates a bunch of RDDs (1 per partition, could be up to several thousands), then calls `mapPartitions` on every single one of these RDDs. This causes us to clean the same closure many times. Since we provide the closure in Spark we know for sure it is serializable, so we can bypass the cleaning for performance.
> According to [~yhuai] cleaning 5000 closures take up to 6-7 seconds in a 12 seconds job that involves data source partitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org