You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/12/07 16:10:00 UTC

[jira] [Updated] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

     [ https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen updated SPARK-25914:
---------------------------------
    Target Version/s:   (was: 3.0.0)

> Separate projection from grouping and aggregate in logical Aggregate
> --------------------------------------------------------------------
>
>                 Key: SPARK-25914
>                 URL: https://issues.apache.org/jira/browse/SPARK-25914
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wei Xue
>            Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: {{groupingExpressions}} and {{aggregateExpressions}}, in which {{aggregateExpressions}} is actually the result expressions, or in other words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by flattening "concat" and causes the query to fail since the expression {{concat('x', a, 's')}} in the SELECT clause is neither referencing a grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, in one field: the group-and-aggregate operation and the project operation. There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, the same way we do for physical aggregate classes (e.g., {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, {{groupingExpressions}} would still be the expressions from the GROUP BY clause (as before), but {{aggregateExpressions}} would contain aggregate functions only, and {{resultExpressions}} would be the project list in the SELECT clause holding references to either {{groupingExpressions}} or {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break the pattern matching in existing optimization rules and thus require more changes in the compiler. So we'd probably wanna go with option 2. That said, I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as {{groupingExpressions}} and {{aggregateExpressions}}, but change the semantics of {{aggregateExpressions}} by replacing the grouping expressions with corresponding references to expressions in {{groupingExpressions}}. The aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org