You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "pan3793 (via GitHub)" <gi...@apache.org> on 2023/12/14 10:19:31 UTC

[PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

pan3793 opened a new pull request, #44352:
URL: https://github.com/apache/spark/pull/44352

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   This PR enhanced the analyzer to handle the following pattern properly.
   
   ```
   Sort
    - Filter
      - Aggregate
   ```
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   ```
   spark-sql (default)> CREATE TABLE t1 (flag BOOLEAN, dt STRING);
   
   spark-sql (default)>   SELECT LENGTH(dt),
                      >          COUNT(t1.flag)
                      >     FROM t1
                      > GROUP BY LENGTH(dt)
                      >   HAVING COUNT(t1.flag) > 1
                      > ORDER BY LENGTH(dt);
   [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `dt` cannot be resolved. Did you mean one of the following? [`length(dt)`, `count(flag)`].; line 6 pos 16;
   'Sort ['LENGTH('dt) ASC NULLS FIRST], true
   +- Filter (count(flag)#60L > cast(1 as bigint))
      +- Aggregate [length(dt#9)], [length(dt#9) AS length(dt)#59, count(flag#8) AS count(flag)#60L]
         +- SubqueryAlias spark_catalog.default.t1
            +- Relation spark_catalog.default.t1[flag#8,dt#9] parquet
   ```
   
   The above case demonstrates the failure case, the query failed during the analysis phase when both `HAVING` and `ORDER BY` clauses are present, but successful if only one is present.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Yes.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   New UTs are added
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1428490859


##########
sql/core/src/test/resources/sql-tests/inputs/having.sql:
##########
@@ -33,3 +33,6 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING
+SELECT k, sum(v) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY sum(v)

Review Comment:
   does it work if we order by a different aggregate function (different from HAVING)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1431354583


##########
sql/core/src/test/resources/sql-tests/inputs/having.sql:
##########
@@ -33,3 +33,27 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column presents on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY k;
+SELECT k FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;
+
+-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column does not present on SELECT list
+SELECT length(k) FROM hav GROUP BY k ORDER BY k;
+SELECT length(k) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;
+
+-- SPARK-28386: Resolve ORDER BY scalar function with/without HAVING clause, while the scalar function presents on SELECT list
+SELECT length(k) FROM hav GROUP BY k ORDER BY length(k);
+SELECT length(k) FROM hav GROUP BY k HAVING max(v) > 2 ORDER BY length(k);
+
+-- SPARK-28386: Resolve ORDER BY scalar function with/without HAVING clause, while the scalar function does not present on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY length(k);
+SELECT k FROM hav GROUP BY k HAVING max(v) > 2 ORDER BY length(k);

Review Comment:
   ditto for these two tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1428769800


##########
sql/core/src/test/resources/sql-tests/inputs/having.sql:
##########
@@ -33,3 +33,6 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING
+SELECT k, sum(v) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY sum(v)

Review Comment:
   Good question!



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveReferencesInSort.scala:
##########
@@ -28,10 +28,11 @@ import org.apache.spark.sql.connector.catalog.CatalogManager
  *    includes metadata columns as well.
  * 2. Resolves the column to a literal function which is allowed to be invoked without braces, e.g.
  *    `SELECT col, current_date FROM t`.
- * 3. If the child plan is Aggregate, resolves the column to [[TempResolvedColumn]] with the output
- *    of Aggregate's child plan. This is to allow Sort to host grouping expressions and aggregate
- *    functions, which can be pushed down to the Aggregate later. For example,
- *    `SELECT max(a) FROM t GROUP BY b ORDER BY min(a)`.

Review Comment:
   Shall we obtain the example?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1860806255

   > It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?
   
   The plan change log shows it indeed caused by `ResolveReferences`(which invokes the virtual rule `ResolveReferencesInSort`), thus I suppose this answer is valid https://github.com/apache/spark/pull/44352#discussion_r1427566432. it should be rule-order independent
   
   <details>
   <summary>master plan change log</summary>
   
   ```
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                 +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                            'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                         +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], [unresolvedalias('u...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                          'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                       +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], [unresolvedalias...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                             +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                           'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                        +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggre...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf('b) = 3)
       +- Aggregate [b#178, c#179], [c...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)                                                                                +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3)                                                    +- Filter (udf(b)#181 = 3)
       +- Aggregate [b#178, ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Project [udf(b)#181, udf(c)#182]
   !+- Filter (udf(b)#181 = 3)                                                                                                                          +- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true
   !   +- Aggregate ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
   !'Project [udf(b)#181, udf(c)#182]                                                                                                                                    Project [udf(b)#181, udf(c)#182]
   !+- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true                                                                                            +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
    Project [udf(b)#181, udf(c)#182]                                                                                                                                     Project [udf(b)#181, udf(c)#182]
    +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true                                  +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cas...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Result of Batch Resolution ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    Project [udf(b)#181, udf(c)#182]
   !+- 'UnresolvedHaving ('udf('b) = 3)                                                 +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- Filter (udf(b)#181 = 3)
   !      +- 'UnresolvedRelat...
   ```
   </details>
   
   <details>
   <summary>this patch plan change log</summary>
   
   ```
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                 +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                            'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                         +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], [unresolvedalias('u...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                          'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                       +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], [unresolvedalias...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                             +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                           'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                        +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggre...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf('b) = 3)
       +- Aggregate [b#178, c#179], [c...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf('b) = 3)                                                                                                                 +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)                                                                                +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3)                                                    +- Filter (udf(b)#181 = 3)
       +- Aggregate [b#178, ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                                                                                    'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true
    +- Filter (udf(b)#181 = 3)                                                                                                                    ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
   !'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true                  Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true
    +- Filter (udf(b)#181 = 3)                                                           ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
   !Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true   Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true
    +- Filter (udf(b)#181 = 3)                                                                                                                                ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Result of Batch Resolution ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true                    Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf('b) = 3)                                                 +- Filter (udf(b)#181 = 3)
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))]      +- Aggregate [b#178, c#179], [cast(udf(cast(b#178 as string)) as int) AS udf(b)#181, cast(udf(cast(c#179 as string)) as str...
   ```
   </details>
   
   > I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.
   
   BTW, I think this may be out scope of this PR. The key point of this PR is, `ORDER BY` should have the same resolve result whether `HAVING` clause is present or not.
   
   ```patch
     SELECT xxx
     FROM xxx
     GROUP BY xxx
   + HAVING xxx
     ORDER BY xxx
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1860450330

   > I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.
   
   @cloud-fan I believe Spark already supports it when HVAING is absent, but does not work if HAVING is present, that why I say "maybe we can call it a bugfix"
   
   ```patch
     SELECT xxx
     FROM xxx
     GROUP BY xxx
   + HAVING xxx
     ORDER BY xxx
   ```
   
   >  It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with `AttributReference`. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?
   
   Let me try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1860426324

   I think we should dig into https://github.com/apache/spark/pull/44352#discussion_r1427552666 more. It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with `AttributReference`. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1431429678


##########
sql/core/src/test/resources/sql-tests/inputs/having.sql:
##########
@@ -33,3 +33,27 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column presents on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY k;
+SELECT k FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;

Review Comment:
   yes, should I only retain the fixed SQL？



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427596518


##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   i think so, the plan shows it eliminates some unnecessary column propagation across operators



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427566432


##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   Though this query has both `HAVING` and `ORDER BY` clauses, but only scalar function is present in `ORDER BY` clause, I think the previous resolution matches item 4 of `ResolveReferencesInSort` comments.
   
   > 4. Resolves the column to [[AttributeReference]] with the output of a descendant plan node.
   >    Spark will propagate the missing attributes from the descendant plan node to the Sort node.
   >    This is to allow users to ORDER BY columns that are not in the SELECT clause, which is
   >    widely supported in other SQL dialects. For example, `SELECT a FROM t ORDER BY b`.
   
   With this patch, it should match item 3
   
   > 3. If the child plan is Aggregate or Filter(_, Aggregate), resolves the column to
   >    [[TempResolvedColumn]] with the output of Aggregate's child plan.
   >    This is to allow Sort to host grouping expressions and aggregate functions, which can
   >    be pushed down to the Aggregate later. For example,
   >    `SELECT max(a) FROM t GROUP BY b HAVING max(a) > 1 ORDER BY min(a)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1431351538


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:
##########
@@ -1415,6 +1415,20 @@ class AnalysisSuite extends AnalysisTest with Matchers {
     )
   }
 
+  test("SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING") {
+    assertAnalysisSuccess(
+      parsePlan(
+        """
+          |WITH t1 as (SELECT 1 id, 'one' name)

Review Comment:
   Since this just checks analysis success, we can remove it if the test is already in golden file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.

yaooqinn closed pull request #44352: [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING
URL: https://github.com/apache/spark/pull/44352


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "yaooqinn (via GitHub)" <gi...@apache.org>.

yaooqinn commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1864082194

   Thanks, merged to master. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1862624266

   CI failure seems irrelevant
   https://github.com/pan3793/spark/actions/runs/7259161309/job/19780922398
   ```
   Notice:  A new release of pip is available: 23.3.1 -> 23.3.2
   Notice:  To update, run: python3.9 -m pip install --upgrade pip
   ERROR:  Error installing bundler:
   	The last version of bundler (>= 0) to support your Ruby & RubyGems was 2.4.22. Try installing it with `gem install bundler -v 2.4.22`
   	bundler requires Ruby version >= 3.0.0. The current ruby version is 2.7.0.0.
   Error: Process completed with exit code 1.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427800437


##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   both Analyzed plan and Optimized plan are changed :)
   
   before:
   ```
   == Analyzed Logical Plan ==
   udf(b): int, udf(c): double
   Project [udf(b)#24, udf(c)#25]
   +- Sort [udf(b#21) ASC NULLS FIRST, udf(cast(c#22 as double)) ASC NULLS FIRST], true
      +- Filter (udf(b)#24 = 3)
         +- Aggregate [b#21, c#22], [udf(b#21) AS udf(b)#24, udf(cast(c#22 as double)) AS udf(c)#25, b#21, c#22]
            +- SubqueryAlias spark_catalog.default.test_having
               +- Relation spark_catalog.default.test_having[a#20,b#21,c#22,d#23] parquet
   
   == Optimized Logical Plan ==
   Project [udf(b)#24, udf(c)#25]
   +- Sort [udf(b#21) ASC NULLS FIRST, udf(cast(c#22 as double)) ASC NULLS FIRST], true
      +- Aggregate [b#21, c#22], [udf(b#21) AS udf(b)#24, udf(cast(c#22 as double)) AS udf(c)#25, b#21, c#22]
         +- Project [b#21, c#22]
            +- Filter (isnotnull(b#21) AND (udf(b#21) = 3))
               +- Relation spark_catalog.default.test_having[a#20,b#21,c#22,d#23] parquet
   ```
   
   after:
   ```
   == Analyzed Logical Plan ==
   udf(b): int, udf(c): double
   Sort [udf(b)#9 ASC NULLS FIRST, udf(c)#10 ASC NULLS FIRST], true
   +- Filter (udf(b)#9 = 3)
      +- Aggregate [b#6, c#7], [udf(b#6) AS udf(b)#9, udf(cast(c#7 as double)) AS udf(c)#10]
         +- SubqueryAlias spark_catalog.default.test_having
            +- Relation spark_catalog.default.test_having[a#5,b#6,c#7,d#8] parquet
   
   == Optimized Logical Plan ==
   Sort [udf(b)#9 ASC NULLS FIRST, udf(c)#10 ASC NULLS FIRST], true
   +- Aggregate [b#6, c#7], [udf(b#6) AS udf(b)#9, udf(cast(c#7 as double)) AS udf(c)#10]
      +- Project [b#6, c#7]
         +- Filter (isnotnull(b#6) AND (udf(b#6) = 3))
            +- Relation spark_catalog.default.test_having[a#5,b#6,c#7,d#8] parquet
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427477125


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -323,12 +323,14 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
   }
 
   // Resolves `UnresolvedAttribute` to `TempResolvedColumn` via `plan.child.output` if plan is an
-  // `Aggregate`. If `TempResolvedColumn` doesn't end up as aggregate function input or grouping
-  // column, we will undo the column resolution later to avoid confusing error message. E,g,, if
+  // `Aggregate` or `Filter(_, Aggregate)`.
+  // If `TempResolvedColumn` doesn't end up as aggregate function input or grouping
+  // column, we will undo the column resolution later to avoid confusing error message. E.g., if
   // a table `t` has columns `c1` and `c2`, for query `SELECT ... FROM t GROUP BY c1 HAVING c2 = 0`,
   // even though we can resolve column `c2` here, we should undo it and fail with
   // "Column c2 not found".
   protected def resolveColWithAgg(e: Expression, plan: LogicalPlan): Expression = plan match {
+    case Filter(_, agg: Aggregate) => resolveColWithAgg(e, agg)

Review Comment:
   Cann we move this into `ResolveReferencesInSort`. I think it is more clearer to make the call side resolve aggregate through filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427591478


##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   oh so the plan is actually more efficient?



##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   oh so the plan is actually more efficient now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1855834233

   cc @cloud-fan @wangyum @yaooqinn @ulysses-you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1428490301


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:
##########
@@ -1415,6 +1415,20 @@ class AnalysisSuite extends AnalysisTest with Matchers {
     )
   }
 
+  test("SPARK-28386: Cannot resolve ORDER BY columns with GROUP BY and HAVING") {
+    assertAnalysisSuccess(
+      parsePlan(
+        """
+          |WITH t1 as (SELECT 1 id, 'one' name)

Review Comment:
   we can put this in golden file test as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1858818617

   > does this query work in other databases?
   
   Actually, it was reported by my colleague that the same SQL works on Impala but not Spark. I will investigate other popular RDBMS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.

ulysses-you commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427717512


##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   it is an analyzed plan, the optimized plan should be same with pr ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1858490137

   does this query work in other databases?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427525939


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -323,12 +323,14 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
   }
 
   // Resolves `UnresolvedAttribute` to `TempResolvedColumn` via `plan.child.output` if plan is an
-  // `Aggregate`. If `TempResolvedColumn` doesn't end up as aggregate function input or grouping
-  // column, we will undo the column resolution later to avoid confusing error message. E,g,, if
+  // `Aggregate` or `Filter(_, Aggregate)`.
+  // If `TempResolvedColumn` doesn't end up as aggregate function input or grouping
+  // column, we will undo the column resolution later to avoid confusing error message. E.g., if
   // a table `t` has columns `c1` and `c2`, for query `SELECT ... FROM t GROUP BY c1 HAVING c2 = 0`,
   // even though we can resolve column `c2` here, we should undo it and fail with
   // "Column c2 not found".
   protected def resolveColWithAgg(e: Expression, plan: LogicalPlan): Expression = plan match {
+    case Filter(_, agg: Aggregate) => resolveColWithAgg(e, agg)

Review Comment:
   Thanks, changed as suggested



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427552666


##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   do you know why the plan is changed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1432220550


##########
sql/core/src/test/resources/sql-tests/inputs/having.sql:
##########
@@ -33,3 +33,27 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column presents on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY k;
+SELECT k FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;

Review Comment:
   updated, only retain the fixed SQL now



##########
sql/core/src/test/resources/sql-tests/inputs/having.sql:
##########
@@ -33,3 +33,27 @@ SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY GROUPING SETS(t.c1) HAVING t.
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY CUBE(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY ROLLUP(t.c1) HAVING t.c1 = 1;
 SELECT c1 FROM VALUES (1, 2) as t(c1, c2) GROUP BY t.c1 HAVING t.c1 = 1;
+
+-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column presents on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY k;
+SELECT k FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;
+
+-- SPARK-28386: Resolve ORDER BY column with/without HAVING clause, while the column does not present on SELECT list
+SELECT length(k) FROM hav GROUP BY k ORDER BY k;
+SELECT length(k) FROM hav GROUP BY k HAVING sum(v) > 2 ORDER BY k;
+
+-- SPARK-28386: Resolve ORDER BY scalar function with/without HAVING clause, while the scalar function presents on SELECT list
+SELECT length(k) FROM hav GROUP BY k ORDER BY length(k);
+SELECT length(k) FROM hav GROUP BY k HAVING max(v) > 2 ORDER BY length(k);
+
+-- SPARK-28386: Resolve ORDER BY scalar function with/without HAVING clause, while the scalar function does not present on SELECT list
+SELECT k FROM hav GROUP BY k ORDER BY length(k);
+SELECT k FROM hav GROUP BY k HAVING max(v) > 2 ORDER BY length(k);

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1862325392

   @cloud-fan @beliefer I have updated the test cases as requested, please take another look when you have time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #44352:
URL: https://github.com/apache/spark/pull/44352#discussion_r1427566432


##########
sql/core/src/test/resources/sql-tests/analyzer-results/udf/postgreSQL/udf-select_having.sql.out:
##########
@@ -102,12 +102,11 @@ Project [udf(b)#x, udf(c)#x]
 SELECT udf(b), udf(c) FROM test_having
 	GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
 -- !query analysis
-Project [udf(b)#x, udf(c)#x]

Review Comment:
   I think the previous resolution matches item 4 of `ResolveReferencesInSort` comments.
   
   > 4. Resolves the column to [[AttributeReference]] with the output of a descendant plan node.
   >    Spark will propagate the missing attributes from the descendant plan node to the Sort node.
   >    This is to allow users to ORDER BY columns that are not in the SELECT clause, which is
   >    widely supported in other SQL dialects. For example, `SELECT a FROM t ORDER BY b`.
   
   With this patch, it should match item 3
   
   > 3. If the child plan is Aggregate or Filter(_, Aggregate), resolves the column to
   >    [[TempResolvedColumn]] with the output of Aggregate's child plan.
   >    This is to allow Sort to host grouping expressions and aggregate functions, which can
   >    be pushed down to the Aggregate later. For example,
   >    `SELECT max(a) FROM t GROUP BY b HAVING max(a) > 1 ORDER BY min(a)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1860415356

   In the SQL standard, ORDER BY can only reference columns in the SELECT list, but many databases extend it to support other cases. I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org