You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2017/06/10 18:45:21 UTC

[jira] [Assigned] (SPARK-21039) Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter

     [ https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-21039:
------------------------------------

    Assignee: Apache Spark

> Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
> --------------------------------------------------------------------
>
>                 Key: SPARK-21039
>                 URL: https://issues.apache.org/jira/browse/SPARK-21039
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.1.1
>            Reporter: Lovasoa
>            Assignee: Apache Spark
>
> Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that the bloom filters received for each partition of data are merged in the driver. The cost of this operation can be very high if the bloom filters are large. It would be nice if it used RDD.treeAggregate instead, in order to parallelize the operation of merging the bloom filters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org