You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/08/04 20:12:00 UTC

[jira] [Commented] (SPARK-36419) Move final aggregation in RDD.treeAggregate to executor

    [ https://issues.apache.org/jira/browse/SPARK-36419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393429#comment-17393429 ] 

Apache Spark commented on SPARK-36419:
--------------------------------------

User 'akpatnam25' has created a pull request for this issue:
https://github.com/apache/spark/pull/33643

> Move final aggregation in RDD.treeAggregate to executor
> -------------------------------------------------------
>
>                 Key: SPARK-36419
>                 URL: https://issues.apache.org/jira/browse/SPARK-36419
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, Tests
>    Affects Versions: 3.0.0
>            Reporter: Aravind Patnam
>            Priority: Minor
>
> For the last iteration in RDD.treeAggregate, spark relies on RDD.fold as an implementation detail.
> RDD.fold pulls all shuffle partitions to the driver to merge the result.
> There are two concerns with this:
> a) Shuffle machinery at executors is much more robust/fault tolerant compared to fetching results to driver.
> b) Driver is single point of failure in a spark application. When this results in nontrivial increase in memory pressure while pulling partitions to driver or increased memory usage as part of computing the aggregated state (in user code), it can result in driver failures.
> For treeAggregate, instead of relying on fold for the last iteration, we should (optionally) do the computation at a single reducer - and fetch the final result to driver.
> Additional cost: one extra stage with a single (resulting) partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org