You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jason Altekruse (Jira)" <ji...@apache.org> on 2020/01/15 22:40:00 UTC

[jira] [Created] (SPARK-30523) Collapse back to back aggregations into a single aggregate to reduce the number of shuffles

Jason Altekruse created SPARK-30523:
---------------------------------------

             Summary: Collapse back to back aggregations into a single aggregate to reduce the number of shuffles
                 Key: SPARK-30523
                 URL: https://issues.apache.org/jira/browse/SPARK-30523
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer
    Affects Versions: 3.0.0
            Reporter: Jason Altekruse


Queries containing nested aggregate operations can in some cases be computable with a single phase of aggregation. This Jira seeks to introduce a new optimizer rule to identify some of those cases and rewrite plans to avoid needlessly re-shuffling and generating the aggregation hash table data twice.

Some examples of collapsible aggregates:
{code:java}
SELECT sum(sumAgg) as a, year from (
      select sum(1) as sumAgg, course, year FROM courseSales GROUP BY course, year
) group by year

// can be collapsed to
SELECT sum(1) as `a`, year from courseSales group by year
{code}
{code}
SELECT sum(agg), min(a), b from (
     select sum(1) as agg, a, b FROM testData2 GROUP BY a, b
     ) group by b

// can be collapsed to
SELECT sum(1) as `sum(agg)`, min(a) as `min(a)`, b from testData2 group by b
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org