You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2015/07/19 20:58:04 UTC

[jira] [Resolved] (SPARK-9021) Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()

     [ https://issues.apache.org/jira/browse/SPARK-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Davies Liu resolved SPARK-9021.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.5.0
                   1.4.2

>  Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-9021
>                 URL: https://issues.apache.org/jira/browse/SPARK-9021
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.0
>         Environment: Ubuntu 14.04 LTS
>            Reporter: Nicholas Hwang
>             Fix For: 1.4.2, 1.5.0
>
>
> Please see pull request for more information.
> Currently, PySpark will run an unnecessary comboOp on each partition, combining zeroValue and the results of mapPartitions. Since the zeroValue used in this comboOp is the same reference as the zeroValue used for mapPartitions in each partition, unexpected behavior can happen if zeroValue is a mutable object.
> Instead, RDD.aggregate() should do a reduction on the results of each mapPartitions task. This way, we remove the unnecessary initial comboOp on each partition and also correct the unexpected behavior for mutable zeroValues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org