You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/12 20:31:31 UTC

[GitHub] [spark] tanelk opened a new pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

tanelk opened a new pull request #30018:
URL: https://github.com/apache/spark/pull/30018


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Added optimizer rule `RemoveRedundantAggregates`. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values. 
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   Performance improvements - few TPCDS queries have these kinds of duplicate aggregates.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   UT


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r510364986



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.analysis.MultiAlias
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Project}
+
+/**
+ * Helper methods for collecting and replacing aliases.
+ */
+trait AliasHelper {

Review comment:
       I extracted this to #30134




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r510364986



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.analysis.MultiAlias
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Project}
+
+/**
+ * Helper methods for collecting and replacing aliases.
+ */
+trait AliasHelper {

Review comment:
       I extracted this to ##30134




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714279677


   **[Test build #130124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130124/testReport)** for PR 30018 at commit [`fab0427`](https://github.com/apache/spark/commit/fab04270ff669c7956788522bdaaced4bac25bd3).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717601024


   **[Test build #130339 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130339/testReport)** for PR 30018 at commit [`12d1bf4`](https://github.com/apache/spark/commit/12d1bf4a7e0a333e8d5eb19e9d12d9c6f489f5fb).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714234038


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34732/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790799426


   **[Test build #135757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135757/testReport)** for PR 30018 at commit [`37dc4b1`](https://github.com/apache/spark/commit/37dc4b11447155c8ac6a2d855e5e23ffdefda74c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-803838732


   > The affected part of the query plan for TPCDS q87:
   
   Why is the golden file of TPCDS q87 not updated in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-802577269


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707332154


   I'll try do get the actual performance change soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707360803






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708601830


   **[Test build #129753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129753/testReport)** for PR 30018 at commit [`29701dc`](https://github.com/apache/spark/commit/29701dc197fc552c43fe60476b6fa5ed2fef0a95).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713366637






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717459807


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130336/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708677789


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34359/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748215362


   **[Test build #133028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133028/testReport)** for PR 30018 at commit [`6d68718`](https://github.com/apache/spark/commit/6d687181170acdc03cf46ed8182a81a88245d9c5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717601699






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-749001447


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133141/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710140964






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717455802


   **[Test build #130336 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130336/testReport)** for PR 30018 at commit [`67861f9`](https://github.com/apache/spark/commit/67861f9b6f709a4828adf0260acbfe5641203194).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714226171


   **[Test build #130115 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130115/testReport)** for PR 30018 at commit [`a82699e`](https://github.com/apache/spark/commit/a82699eedcf1f72d89ba9fbdade6d1a411a14c4d).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708718920


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758266008


   **[Test build #133944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133944/testReport)** for PR 30018 at commit [`b415194`](https://github.com/apache/spark/commit/b41519448e1db46d6c11aeb56396c4af4b0e1c51).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509146082



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##########
@@ -97,7 +97,7 @@ object Predicate extends CodeGeneratorWithInterpretedFallback[Expression, BasePr
   }
 }
 
-trait PredicateHelper extends Logging {
+trait PredicateHelper extends Logging with AliasHelper {

Review comment:
       `extends AliasHelper with Logging`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748421658


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37647/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758270122


   **[Test build #133945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133945/testReport)** for PR 30018 at commit [`797d48f`](https://github.com/apache/spark/commit/797d48fcd76269c06f0dc832d2019bf9619bcbf2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710110026


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748872080


   **[Test build #133141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133141/testReport)** for PR 30018 at commit [`57af005`](https://github.com/apache/spark/commit/57af00504c7e985e71a60960d45c8b08f76407a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748872080


   **[Test build #133141 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133141/testReport)** for PR 30018 at commit [`57af005`](https://github.com/apache/spark/commit/57af00504c7e985e71a60960d45c8b08f76407a7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717474049


   **[Test build #130339 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130339/testReport)** for PR 30018 at commit [`12d1bf4`](https://github.com/apache/spark/commit/12d1bf4a7e0a333e8d5eb19e9d12d9c6f489f5fb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758270122


   **[Test build #133945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133945/testReport)** for PR 30018 at commit [`797d48f`](https://github.com/apache/spark/commit/797d48fcd76269c06f0dc832d2019bf9619bcbf2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-801086833


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40740/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-801064063


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136159/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748469656


   **[Test build #133056 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133056/testReport)** for PR 30018 at commit [`33d6072`](https://github.com/apache/spark/commit/33d60727d50d8c0809305068cde2e20a62e916c4).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748474080


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133056/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748241441


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37627/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754566866


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38246/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761644397


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38730/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r503518310



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -563,7 +563,7 @@ case class Range(
  *
  * @param groupingExpressions expressions for grouping keys
  * @param aggregateExpressions expressions for a project list, which could contain
- *                             [[AggregateFunction]]s.
+ *                             [[AggregateExpression]]s.

Review comment:
       This caused some confusion while making this PR




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509152188



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)
+
+    val upperReferencesOnlyGrouping = upper.references.subsetOf(AttributeSet(
+      lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
+
+    val upperHasNoAggregateExpressions = upper.aggregateExpressions
+      .forall(_.find(isAggregate).isEmpty)
+
+    isDeterministic && upperReferencesOnlyGrouping && upperHasNoAggregateExpressions
+  }
+
+  private def isAggregate(expr: Expression): Boolean = {
+    expr.find(e => e.isInstanceOf[AggregateExpression] ||
+      PythonUDF.isGroupedAggPandasUDF(e)).isDefined
+  }
+
+  /**
+   * Replace all attributes, that reference an alias, with the aliased expression,
+   * but keep the name of the outmost attribute.
+   */
+  private def replaceAliasButKeepName(

Review comment:
       Since this function is only used in the single place, could you inline it?
   ```
       case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
         val aliasMap = getAliasMap(lower)
   
          // Replace all attributes, that reference an alias, with the aliased expression,
          // but keep the name of the outmost attribute.
         val newAggExprs = upper.aggregateExpressions.map {
           case a: Attribute if aliasMap.contains(a) =>
             Alias(replaceAlias(a, aliasMap), a.name)(a.exprId, a.qualifier)
           case expr =>
             replaceAlias(expr, aliasMap).asInstanceOf[NamedExpression]
         }
   
         upper.copy(
           child = lower.child,
           groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
           aggregateExpressions = newAggExprs)
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707388581


   Could you describe more about the "redundant" case in the PR description? e.g., plan changes before/after this PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758371312


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133945/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717474049


   **[Test build #130339 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130339/testReport)** for PR 30018 at commit [`12d1bf4`](https://github.com/apache/spark/commit/12d1bf4a7e0a333e8d5eb19e9d12d9c6f489f5fb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717461074


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509003317



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       For example the UT `Remove 2 redundant aggregates`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754536440


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38246/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r505256449



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       `transformUp` -> `transform`? Seems like this transformation does not depend on the order.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r506346341



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)

Review comment:
       I think that they are now handled correctly




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r505242243



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references
+      .subsetOf(AttributeSet(lower.groupingExpressions))

Review comment:
       I think `CleanupAliases` removes aliases in `groupingExpressions` in the analyzer phase.
   
   ```
       val upperReferencesOnlyGrouping = upper.references
         .subsetOf(AttributeSet(lower.groupingExpressions))
   ```
   How about checking if upper references are a subset of non-aggregate expressions in a lower node instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710024768


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34500/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707428786


   **[Test build #129705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129705/testReport)** for PR 30018 at commit [`14f3033`](https://github.com/apache/spark/commit/14f30332656010284798e7c1de0256f10532fafa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-801047743


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713328345


   **[Test build #130074 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130074/testReport)** for PR 30018 at commit [`6cdc43a`](https://github.com/apache/spark/commit/6cdc43a5c4ed6de961ed40cd0784178465d13232).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717459794






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790618359


   **[Test build #135757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135757/testReport)** for PR 30018 at commit [`37dc4b1`](https://github.com/apache/spark/commit/37dc4b11447155c8ac6a2d855e5e23ffdefda74c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748441047


   **[Test build #133056 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133056/testReport)** for PR 30018 at commit [`33d6072`](https://github.com/apache/spark/commit/33d60727d50d8c0809305068cde2e20a62e916c4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790599592


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r546399201



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -485,6 +486,42 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])

Review comment:
       Yes, `Remove redundant aggregate with non-deterministic lower` covers this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748418285


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37647/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-743417809


   **[Test build #132674 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132674/testReport)** for PR 30018 at commit [`12d1bf4`](https://github.com/apache/spark/commit/12d1bf4a7e0a333e8d5eb19e9d12d9c6f489f5fb).
    * This patch passes all tests.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708667178


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34359/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708696740


   **[Test build #129769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129769/testReport)** for PR 30018 at commit [`4ce0644`](https://github.com/apache/spark/commit/4ce064480be88119f3cfe880b965f1665af67bae).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710006244


   **[Test build #129895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129895/testReport)** for PR 30018 at commit [`ef64abf`](https://github.com/apache/spark/commit/ef64abfdd8916504f25e9ff5f2c7f13bcb50752d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-801093557


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40740/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509003784



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)
+
+    val upperReferencesOnlyGrouping = upper.references.subsetOf(AttributeSet(
+      lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
+
+    val upperHasNoAggregateExpressions = upper.aggregateExpressions
+      .forall(_.find(isAggregate).isEmpty)
+
+    isDeterministic && upperReferencesOnlyGrouping && upperHasNoAggregateExpressions
+  }
+
+  private def isAggregate(expr: Expression): Boolean = {
+    expr.find(e => e.isInstanceOf[AggregateExpression] ||
+      PythonUDF.isGroupedAggPandasUDF(e)).isDefined
+  }
+
+  /**
+   * Replace all attributes, that reference an alias, with the aliased expression,
+   * but keep the name of the name of the outmost attribute.

Review comment:
       nit: `the name of the name`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-767752896


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39100/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-802561775


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40817/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-802577269


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136235/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714186192






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790641828


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40340/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708718920






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708718591


   **[Test build #129753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129753/testReport)** for PR 30018 at commit [`29701dc`](https://github.com/apache/spark/commit/29701dc197fc552c43fe60476b6fa5ed2fef0a95).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761644397


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38730/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r558952225



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.api.python.PythonEvalType
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Expression, PythonUDF}
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.IntegerType
+
+class RemoveRedundantAggregatesSuite extends PlanTest {
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches = Batch("RemoveRedundantAggregates", FixedPoint(10),
+      RemoveRedundantAggregates) :: Nil
+  }
+
+  private def aggregates(e: Expression): Seq[Expression] = {
+    Seq(
+      count(e),
+      PythonUDF("pyUDF", null, IntegerType, Seq(e),
+        PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF, udfDeterministic = true)
+    )
+  }
+
+  test("Remove redundant aggregate") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove 2 redundant aggregates") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with different grouping") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a, 'b)('a)
+      .groupBy('a)('a)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with aliases") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c, agg)
+        .groupBy('c)('c)
+        .analyze
+      val expected = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with non-deterministic upper") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a)
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with non-deterministic lower") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .groupBy('a, 'c)('a, 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a, 'c)('a, rand(0) as 'c)

Review comment:
       LGTM now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708807514






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761697297


   **[Test build #134147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134147/testReport)** for PR 30018 at commit [`e202987`](https://github.com/apache/spark/commit/e202987ed5d6b78aa0a91264232a9a42ca0c3dbe).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748641635


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37709/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761616013


   **[Test build #134147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134147/testReport)** for PR 30018 at commit [`e202987`](https://github.com/apache/spark/commit/e202987ed5d6b78aa0a91264232a9a42ca0c3dbe).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r505638309



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       There is a slight difference - `transformUp` can handle more than two consecutive aggregates in a single iteration.
   If there are 3 aggregates `A1(A2(A3(...)))`, where `A2` and  `A3` are redundant, then with `transformDown` on first iteration `A2` is removed and then `A3` on the second iteration. With `transformUp` it removes `A3` and then `A2` on the same iteration.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r508964832



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       I meant that `lowerIsRedundant(A1, A2)` would return true, If it would return false, then A2 would not be removed at all.
   
   With `transformDown` it would check `A1` and `A2` and remove `A2`, but then it would not check `A1` and `A3` on the same iteration.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710140964


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717459765


   **[Test build #130336 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130336/testReport)** for PR 30018 at commit [`67861f9`](https://github.com/apache/spark/commit/67861f9b6f709a4828adf0260acbfe5641203194).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r505252465



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)

Review comment:
       If `lower.aggregateExpressions` has aliases, I think we need to rewrite `upper.aggregateExpressions`/`upper.groupingExpressions` accordingly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509021645



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)

Review comment:
       > As a dummy example - one that counts the number of times it has been called and returns that count. Even if it is not referenced, evaluating it changes future results.
   
   I think `ColumnPruning` removes exprs not referenced by a upper node even in the case above, so we can ignore it.
   ```
   // v3.0.1
   scala> sql("select a, b from (select a, b, rand() from t group by a, b) group by a, b").explain(true)
   == Analyzed Logical Plan ==
   a: bigint, b: bigint
   Aggregate [a#0L, b#1L], [a#0L, b#1L]
   +- SubqueryAlias __auto_generated_subquery_name
      +- Aggregate [a#0L, b#1L], [a#0L, b#1L, rand(-8085205063662585668) AS rand(-8085205063662585668)#31]
         +- SubqueryAlias spark_catalog.default.t
            +- Relation[a#0L,b#1L] parquet
   
   == Optimized Logical Plan ==
   Aggregate [a#0L, b#1L], [a#0L, b#1L]
   +- Aggregate [a#0L, b#1L], [a#0L, b#1L]
      +- Relation[a#0L,b#1L] parquet
   ```
   ```
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710110045


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34505/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790641828


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40340/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509004882



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)
+
+    val upperReferencesOnlyGrouping = upper.references.subsetOf(AttributeSet(
+      lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
+
+    val upperHasNoAggregateExpressions = upper.aggregateExpressions
+      .forall(_.find(isAggregate).isEmpty)
+
+    isDeterministic && upperReferencesOnlyGrouping && upperHasNoAggregateExpressions
+  }
+
+  private def isAggregate(expr: Expression): Boolean = {
+    expr.find(e => e.isInstanceOf[AggregateExpression] ||
+      PythonUDF.isGroupedAggPandasUDF(e)).isDefined
+  }
+
+  /**
+   * Replace all attributes, that reference an alias, with the aliased expression,
+   * but keep the name of the name of the outmost attribute.
+   */
+  private def replaceAliasButKeepOuter(

Review comment:
       `Outer`->`Name`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717508202


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34940/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754445761


   **[Test build #133657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133657/testReport)** for PR 30018 at commit [`c21dd52`](https://github.com/apache/spark/commit/c21dd5269c11d537ef2ca464d6a69bf44fdcbc29).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717455802


   **[Test build #130336 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130336/testReport)** for PR 30018 at commit [`67861f9`](https://github.com/apache/spark/commit/67861f9b6f709a4828adf0260acbfe5641203194).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-767773608


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39100/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-802711037


   @maropu , the checks did pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714242183


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34732/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710141010


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129899/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509025650



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)

Review comment:
       Probably, we can re-use `CollapseProject#haveCommonNonDeterministicOutput`?  Seems like the similar pre-condition to collapse aggregates, I think.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714150588


   **[Test build #130116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130116/testReport)** for PR 30018 at commit [`38d7007`](https://github.com/apache/spark/commit/38d700751e6689d272129ee2922f357025289910).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714241351


   **[Test build #130116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130116/testReport)** for PR 30018 at commit [`38d7007`](https://github.com/apache/spark/commit/38d700751e6689d272129ee2922f357025289910).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r506348894



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {

Review comment:
       I think that the current one is more explicit. With your proposition it is not obvious which one is redundant. Perhaps there are situations where `upperIsRedundant`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748215362


   **[Test build #133028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133028/testReport)** for PR 30018 at commit [`6d68718`](https://github.com/apache/spark/commit/6d687181170acdc03cf46ed8182a81a88245d9c5).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r546475732



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -485,6 +486,46 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
+        PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
+      } else {
+        newAggregate
+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references.subsetOf(AttributeSet(
+      lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
+
+    val upperHasNoAggregateExpressions = upper.aggregateExpressions
+      .forall(_.find(isAggregate).isEmpty)
+
+    upperReferencesOnlyGrouping && upperHasNoAggregateExpressions

Review comment:
       nit: to avoid unnecessary object creation (e.g,. `AttributeSet`), how about checking `upperHasNoAggregateExpressions` first like this?
   ```
     private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
       val upperHasNoAggregateExpressions = upper.aggregateExpressions
         .forall(_.find(isAggregate).isEmpty)
       lazy val upperRefsHaveGroupingOnly = upper.references.subsetOf(AttributeSet(
         lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
       upperHasNoAggregateExpressions && upperRefsHaveGroupingOnly 
     }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r555315461



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.api.python.PythonEvalType
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Expression, PythonUDF}
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.IntegerType
+
+class RemoveRedundantAggregatesSuite extends PlanTest {
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches = Batch("RemoveRedundantAggregates", FixedPoint(10),
+      RemoveRedundantAggregates) :: Nil
+  }
+
+  private def aggregates(e: Expression): Seq[Expression] = {
+    Seq(
+      count(e),
+      PythonUDF("pyUDF", null, IntegerType, Seq(e),
+        PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF, udfDeterministic = true)
+    )
+  }
+
+  test("Remove redundant aggregate") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove 2 redundant aggregates") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with different grouping") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a, 'b)('a)
+      .groupBy('a)('a)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with aliases") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c, agg)
+        .groupBy('c)('c)
+        .analyze
+      val expected = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with non-deterministic upper") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a)
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with non-deterministic lower") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .groupBy('a, 'c)('a, 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a, 'c)('a, rand(0) as 'c)

Review comment:
       Hmm, shouldn't this `expected` be `.groupBy('a)('a, rand(0) as 'c)`? If this test passes currently then this PR would introduce a correctness issue.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r508993694



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       > transformUp can handle more than two consecutive aggregates in a single iteration.
   
   You meant that `transformUp` can transform `A1(A2(A3(...))) => A1()` in a single iteration? Probably, I miss some conditions for the case you described above, so could you give me a concrete query example for it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509003317



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       For example the UT `Remove 2 redundant aggregates`.
   With `transformUp` it runs 2 iterations (1 extra to detect that all changes are done).
   With `transformDown` it runs 3 iterations




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713357312






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710110026






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r504889356



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references
+      .subsetOf(AttributeSet(lower.groupingExpressions))
+    val upperHasAggregateExpressions = upper.aggregateExpressions
+      .exists(_.find(_.isInstanceOf[AggregateExpression]).nonEmpty)

Review comment:
       This condition could be relaxed later on. It can be the same as `EliminateDistinct`, that currently allows `Max` and `Min` aggregates because their results are not changed if `distinct` is used on the inputs. Also there are some more aggregates, that can be added to `EliminateDistinct`: `BitAndAgg`, `BitOrAgg`, `CollectSet`, and perhaps some more.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r508964832



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       I meant that `lowerIsRedundant(A1, A2)` would return true,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717461081


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34938/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r546371103



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -3858,3 +3808,55 @@ object UpdateOuterReferences extends Rule[LogicalPlan] {
     }
   }
 }
+
+/**
+ * Pulls out nondeterministic expressions from LogicalPlan which is not Project or Filter,
+ * put them into an inner Project and finally project them away at the outer Project.
+ */
+object PullOutNondeterministic extends Rule[LogicalPlan] {

Review comment:
       Could you make a dedicated file for this rule?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748440412


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133047/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748448429


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37656/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713294084


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34675/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r513473158



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.analysis.MultiAlias
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Project}
+
+/**
+ * Helper methods for collecting and replacing aliases.
+ */
+trait AliasHelper {

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748412005


   **[Test build #133047 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133047/testReport)** for PR 30018 at commit [`0d86060`](https://github.com/apache/spark/commit/0d86060acab636bd2f19dbdec4c504924b4335fa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710039315


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r555380820



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.api.python.PythonEvalType
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Expression, PythonUDF}
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.IntegerType
+
+class RemoveRedundantAggregatesSuite extends PlanTest {
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches = Batch("RemoveRedundantAggregates", FixedPoint(10),
+      RemoveRedundantAggregates) :: Nil
+  }
+
+  private def aggregates(e: Expression): Seq[Expression] = {
+    Seq(
+      count(e),
+      PythonUDF("pyUDF", null, IntegerType, Seq(e),
+        PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF, udfDeterministic = true)
+    )
+  }
+
+  test("Remove redundant aggregate") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove 2 redundant aggregates") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with different grouping") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a, 'b)('a)
+      .groupBy('a)('a)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with aliases") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c, agg)
+        .groupBy('c)('c)
+        .analyze
+      val expected = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with non-deterministic upper") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a)
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with non-deterministic lower") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .groupBy('a, 'c)('a, 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a, 'c)('a, rand(0) as 'c)

Review comment:
       Thank you, good catch. I fixed this test, added another to check this case and added an extra condition to the optimizer. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707360803






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708890845






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-801064063


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136159/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r598160545



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -488,6 +489,50 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression

Review comment:
       `introduces` -> `introduced`?
   `expression` -> `expressions`.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -488,6 +489,50 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression

Review comment:
       - `introduces` -> `introduced`?
   - `expression` -> `expressions`.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -488,6 +489,50 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression

Review comment:
       - `introduces` -> `introduced`
   - `expression` -> `expressions`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713366608


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34683/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r598160545



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -488,6 +489,50 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression

Review comment:
       `introduces` -> `introduced`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509010483



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)

Review comment:
       I was thinking about some stateful UDF. As a dummy example - one that counts the number of times it has been called and returns that count. Even if it is not referenced, evaluating it changes future results. 
   Or perhaps I'm overthinking it and we can ignore it? 
   Taking it a step further - perhaps only upper has to be deterministic?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu closed pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu closed pull request #30018:
URL: https://github.com/apache/spark/pull/30018


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-743420191


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132674/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790809496


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135757/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-802561775


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40817/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761620276


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38730/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713357118






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748323915


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133028/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748260619


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37627/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-743420191


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132674/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714181257






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r508952294



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
##########
@@ -249,6 +237,41 @@ trait PredicateHelper extends Logging {
   }
 }
 
+/**
+ * Helper methods for collecting and replacing aliases.
+ */
+trait AliasHelper {

Review comment:
       How about moving this trait into a new file like `AliasHelper.scala`? It looks more general now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r512713449



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.analysis.MultiAlias
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Project}
+
+/**
+ * Helper methods for collecting and replacing aliases.
+ */
+trait AliasHelper {

Review comment:
       okay, I've merged #30134, so please rebase it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714174981


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34723/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708580082


   Added a changed query plan sample and some TPCDS results. The change is not remarkable, but for bigger datasets it can add up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754566866


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38246/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-803503800


   Thanks for the reviews, @dongjoon-hyun ! Please open a new follow-up PR to address them, @tanelk .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758311652


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714181257






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761699360


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134147/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710054486


   > **[Test build #129895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129895/testReport)** for PR 30018 at commit [`ef64abf`](https://github.com/apache/spark/commit/ef64abfdd8916504f25e9ff5f2c7f13bcb50752d).
   > 
   > * This patch **fails Spark unit tests**.
   > * This patch merges cleanly.
   > * This patch adds the following public classes _(experimental)_:
   > * `trait PredicateHelper extends Logging with AliasHelper `
   > * `trait AliasHelper `
   
   The `org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql` seems to failing on other PRs also. For example https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129890/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714150588


   **[Test build #130116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130116/testReport)** for PR 30018 at commit [`38d7007`](https://github.com/apache/spark/commit/38d700751e6689d272129ee2922f357025289910).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714177676


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34724/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754596010


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133657/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708677807






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754445761


   **[Test build #133657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133657/testReport)** for PR 30018 at commit [`c21dd52`](https://github.com/apache/spark/commit/c21dd5269c11d537ef2ca464d6a69bf44fdcbc29).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748307853


   **[Test build #133028 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133028/testReport)** for PR 30018 at commit [`6d68718`](https://github.com/apache/spark/commit/6d687181170acdc03cf46ed8182a81a88245d9c5).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713272634






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r546474613



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -485,6 +486,46 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
+        PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
+      } else {
+        newAggregate
+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references.subsetOf(AttributeSet(
+      lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
+
+    val upperHasNoAggregateExpressions = upper.aggregateExpressions
+      .forall(_.find(isAggregate).isEmpty)
+
+    upperReferencesOnlyGrouping && upperHasNoAggregateExpressions
+  }
+
+  private def isAggregate(expr: Expression): Boolean = {
+    expr.find(e => e.isInstanceOf[AggregateExpression] ||
+      PythonUDF.isGroupedAggPandasUDF(e)).isDefined

Review comment:
       Could you add test cases for `PythonUDF.isGroupedAggPandasUDF(e)`, too?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -485,6 +486,46 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
+        PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
+      } else {
+        newAggregate
+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references.subsetOf(AttributeSet(

Review comment:
       nit: how about `upperReferencesOnlyGrouping` -> `upper[Refs|References]HaveGroupingOnly`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -485,6 +486,46 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
+        PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
+      } else {
+        newAggregate
+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references.subsetOf(AttributeSet(
+      lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
+
+    val upperHasNoAggregateExpressions = upper.aggregateExpressions
+      .forall(_.find(isAggregate).isEmpty)
+
+    upperReferencesOnlyGrouping && upperHasNoAggregateExpressions

Review comment:
       nit: to avoid unnecessary object creation (e.g,. `AttributeSet`) how about reorganizing it like this?
   ```
     private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
       val upperHasNoAggregateExpressions = upper.aggregateExpressions
         .forall(_.find(isAggregate).isEmpty)
       lazy val upperRefsHaveGroupingOnly = upper.references.subsetOf(AttributeSet(
         lower.aggregateExpressions.filter(!isAggregate(_)).map(_.toAttribute)))
       upperHasNoAggregateExpressions && upperRefsHaveGroupingOnly 
     }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r546220365



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -2940,56 +2940,6 @@ class Analyzer(override val catalogManager: CatalogManager)
     }
   }
 
-  /**

Review comment:
       Just moved it outside of Analyzer, so it would be accessible from outside




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717492617


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34940/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r505255176



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {

Review comment:
       nit: `lowerIsRedundant` -> `hasRedundantAggregate`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r598162929



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -488,6 +489,50 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
+        PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
+      } else {
+        newAggregate
+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperHasNoAggregateExpressions = !upper.aggregateExpressions.exists(isAggregate)
+
+    lazy val upperRefsOnlyDeterministicNonAgg = upper.references.subsetOf(AttributeSet(
+      lower
+        .aggregateExpressions
+        .filter(_.deterministic)
+        .filter(!isAggregate(_))

Review comment:
       ```scala
   - .filter(!isAggregate(_))
   + .filterNot(isAggregate)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r558967818



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -487,6 +488,51 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
+        PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
+      } else {
+        newAggregate
+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperHasNoAggregateExpressions = upper.aggregateExpressions

Review comment:
       Hm, one more thing: `val upperHasNoAggregateExpressions = !upper.aggregateExpressions.exists(isAggregate)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708718931


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129753/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-801093557


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40740/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710036771


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34500/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-803227752


   okay, we have much time until the next release, so I'll merge this for now. If there are more comments, please feel free to leave them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708696740


   **[Test build #129769 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129769/testReport)** for PR 30018 at commit [`4ce0644`](https://github.com/apache/spark/commit/4ce064480be88119f3cfe880b965f1665af67bae).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748637589


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37709/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707429378






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-803227950


   Thanks! Merged to master.  cc: @cloud-fan @viirya @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r598161195



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -488,6 +489,50 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {

Review comment:
       Could you move this optimizer into a new file please, @tanelk ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754595173


   **[Test build #133657 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133657/testReport)** for PR 30018 at commit [`c21dd52`](https://github.com/apache/spark/commit/c21dd5269c11d537ef2ca464d6a69bf44fdcbc29).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-767707162






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748448424


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37656/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r598161683



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala
##########
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.api.python.PythonEvalType
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Expression, PythonUDF}
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.IntegerType
+
+class RemoveRedundantAggregatesSuite extends PlanTest {
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches = Batch("RemoveRedundantAggregates", FixedPoint(10),
+      RemoveRedundantAggregates) :: Nil
+  }
+
+  private def aggregates(e: Expression): Seq[Expression] = {
+    Seq(
+      count(e),
+      PythonUDF("pyUDF", null, IntegerType, Seq(e),
+        PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF, udfDeterministic = true)
+    )
+  }
+
+  test("Remove redundant aggregate") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove 2 redundant aggregates") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with different grouping") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a, 'b)('a)
+      .groupBy('a)('a)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with aliases") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c, agg)
+        .groupBy('c)('c)
+        .analyze
+      val expected = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with non-deterministic upper") {

Review comment:
       Is this test case name, `upper`, consistent with the optimizer code in this PR?
   ```
   case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748323915


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133028/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-802450993


   hm, Jenkins still looks unstable. Could you add an empty commit to invoke GA, @tanelk?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707333076


   **[Test build #129705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129705/testReport)** for PR 30018 at commit [`14f3033`](https://github.com/apache/spark/commit/14f30332656010284798e7c1de0256f10532fafa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748474080


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133056/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509021645



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)

Review comment:
       > As a dummy example - one that counts the number of times it has been called and returns that count. Even if it is not referenced, evaluating it changes future results.
   
   For example, I think `ColumnPruning` removes exprs not referenced by a upper node even in the case above, so we can ignore it.
   ```
   // v3.0.1
   scala> sql("select a, b from (select a, b, rand() from t group by a, b) group by a, b").explain(true)
   == Analyzed Logical Plan ==
   a: bigint, b: bigint
   Aggregate [a#0L, b#1L], [a#0L, b#1L]
   +- SubqueryAlias __auto_generated_subquery_name
      +- Aggregate [a#0L, b#1L], [a#0L, b#1L, rand(-8085205063662585668) AS rand(-8085205063662585668)#31]
         +- SubqueryAlias spark_catalog.default.t
            +- Relation[a#0L,b#1L] parquet
   
   == Optimized Logical Plan ==
   Aggregate [a#0L, b#1L], [a#0L, b#1L]
   +- Aggregate [a#0L, b#1L], [a#0L, b#1L]
      +- Relation[a#0L,b#1L] parquet
   ```
   ```
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710075408


   **[Test build #129899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129899/testReport)** for PR 30018 at commit [`4bf08bb`](https://github.com/apache/spark/commit/4bf08bbaba3dc8c7431c405d705629a580da673e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748273708


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37627/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707360789


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34311/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708601830


   **[Test build #129753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129753/testReport)** for PR 30018 at commit [`29701dc`](https://github.com/apache/spark/commit/29701dc197fc552c43fe60476b6fa5ed2fef0a95).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r506346510



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references
+      .subsetOf(AttributeSet(lower.groupingExpressions))

Review comment:
       Done
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790618359


   **[Test build #135757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135757/testReport)** for PR 30018 at commit [`37dc4b1`](https://github.com/apache/spark/commit/37dc4b11447155c8ac6a2d855e5e23ffdefda74c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714242195






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509084441



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       Ah, ok. I got your point. Looks okay as it is.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748441047


   **[Test build #133056 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133056/testReport)** for PR 30018 at commit [`33d6072`](https://github.com/apache/spark/commit/33d60727d50d8c0809305068cde2e20a62e916c4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717508219






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk edited a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk edited a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707332154


   I'll try do get the actual performance change for the TPCDS queries soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713272634


   **[Test build #130066 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130066/testReport)** for PR 30018 at commit [`832ff02`](https://github.com/apache/spark/commit/832ff021d8ddd2c9de7edd019c022cea53876dfc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708773480


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34375/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714241846


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714226557






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748440412


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133047/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r598160288



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -488,6 +489,50 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
+        PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
+      } else {
+        newAggregate
+      }
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {

Review comment:
       nit. Usually, `isXXX` is better and consistent with Apache Spark convention.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r508964832



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {

Review comment:
       I meant that `lowerIsRedundant(A1, A2)` would return true, If it would return false, then A2 would not be removed at all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710036785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714186192






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-749001447


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133141/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r558952225



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala
##########
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.api.python.PythonEvalType
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{Expression, PythonUDF}
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.IntegerType
+
+class RemoveRedundantAggregatesSuite extends PlanTest {
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+    val batches = Batch("RemoveRedundantAggregates", FixedPoint(10),
+      RemoveRedundantAggregates) :: Nil
+  }
+
+  private def aggregates(e: Expression): Seq[Expression] = {
+    Seq(
+      count(e),
+      PythonUDF("pyUDF", null, IntegerType, Seq(e),
+        PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF, udfDeterministic = true)
+    )
+  }
+
+  test("Remove redundant aggregate") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove 2 redundant aggregates") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a)('a, agg)
+        .groupBy('a)('a)
+        .groupBy('a)('a)
+        .analyze
+      val expected = relation
+        .groupBy('a)('a)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with different grouping") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a, 'b)('a)
+      .groupBy('a)('a)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with aliases") {
+    val relation = LocalRelation('a.int, 'b.int)
+    for (agg <- aggregates('b)) {
+      val query = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c, agg)
+        .groupBy('c)('c)
+        .analyze
+      val expected = relation
+        .groupBy('a + 'b)(('a + 'b) as 'c)
+        .analyze
+      val optimized = Optimize.execute(query)
+      comparePlans(optimized, expected)
+    }
+  }
+
+  test("Remove redundant aggregate with non-deterministic upper") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a)
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .analyze
+    val optimized = Optimize.execute(query)
+    comparePlans(optimized, expected)
+  }
+
+  test("Remove redundant aggregate with non-deterministic lower") {
+    val relation = LocalRelation('a.int, 'b.int)
+    val query = relation
+      .groupBy('a)('a, rand(0) as 'c)
+      .groupBy('a, 'c)('a, 'c)
+      .analyze
+    val expected = relation
+      .groupBy('a, 'c)('a, rand(0) as 'c)

Review comment:
       LGTM now, conflicts need to be resolved though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714279919


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130124/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708807438


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34375/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758365346


   **[Test build #133945 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133945/testReport)** for PR 30018 at commit [`797d48f`](https://github.com/apache/spark/commit/797d48fcd76269c06f0dc832d2019bf9619bcbf2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-743344216


   **[Test build #132674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132674/testReport)** for PR 30018 at commit [`12d1bf4`](https://github.com/apache/spark/commit/12d1bf4a7e0a333e8d5eb19e9d12d9c6f489f5fb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708807514






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758266008


   **[Test build #133944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133944/testReport)** for PR 30018 at commit [`b415194`](https://github.com/apache/spark/commit/b41519448e1db46d6c11aeb56396c4af4b0e1c51).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710039315






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748436394


   **[Test build #133047 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133047/testReport)** for PR 30018 at commit [`0d86060`](https://github.com/apache/spark/commit/0d86060acab636bd2f19dbdec4c504924b4335fa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710006244


   **[Test build #129895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129895/testReport)** for PR 30018 at commit [`ef64abf`](https://github.com/apache/spark/commit/ef64abfdd8916504f25e9ff5f2c7f13bcb50752d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r509023102



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)

Review comment:
       > Taking it a step further - perhaps only upper has to be deterministic?
   
   How about the case below? Seems like we can remove the lower agg?
   ```
   scala> sql("select a, b, rand() from (select a, b from t group by a, b) group by a, b").explain(true)
   == Optimized Logical Plan ==
   Aggregate [a#0L, b#1L], [a#0L, b#1L, rand(3153957949044950464) AS rand(3153957949044950464)#2]
   +- Aggregate [a#0L, b#1L], [a#0L, b#1L]
      +- Relation[a#0L,b#1L] parquet
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717461074






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748897008


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37740/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-790634788


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40340/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748448429


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37656/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748898804


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37740/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748894800


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37740/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748635973






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-717601699






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714279911


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r504939808



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -477,6 +478,25 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      upper.copy(child = lower.child)
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val upperReferencesOnlyGrouping = upper.references
+      .subsetOf(AttributeSet(lower.groupingExpressions))

Review comment:
       I do not have a strong understanding of aliases.
   
   Is there a way this can be improved to also handle aliases:
   ```
       val query = relation
         .groupBy(('a + 'b) as 'c)(('a + 'b) as 'c, count('b))
         .groupBy('c)('c)
         .analyze
       val expected = relation
         .groupBy(('a + 'b) as 'c)(('a + 'b) as 'c)
         .analyze
   ```
   
   Perhaps @maropu or @cloud-fan could advice me here 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-754554747


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38246/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710139858


   **[Test build #129899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129899/testReport)** for PR 30018 at commit [`4bf08bb`](https://github.com/apache/spark/commit/4bf08bbaba3dc8c7431c405d705629a580da673e).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-801093045


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40740/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707333076


   **[Test build #129705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129705/testReport)** for PR 30018 at commit [`14f3033`](https://github.com/apache/spark/commit/14f30332656010284798e7c1de0256f10532fafa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761623667


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38730/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758371312


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133945/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748273708


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37627/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707350433


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34311/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-708890845






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-803845936


   > > The affected part of the query plan for TPCDS q87:
   > 
   > Why is the golden file of TPCDS q87 not updated in this PR?
   
   The `LeftSemi`/`LeftAnti` pushdown rule was changed while this PR was in review and the situation where this rule applied did not occure any more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713294092






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-800875597


   Anyone could check this? @cloud-fan @viirya @dongjoon-hyun @HyukjinKwon  If no one has more comments, I'll merge this into master in a few days.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-761616013


   **[Test build #134147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134147/testReport)** for PR 30018 at commit [`e202987`](https://github.com/apache/spark/commit/e202987ed5d6b78aa0a91264232a9a42ca0c3dbe).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tanelk commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
tanelk commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-800864646


   @maropu , this has been aproved for a while now, any change, that we can merge this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758295406


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r546371763



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -485,6 +486,42 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+
+      val newAggregate = upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepName(_, aliasMap))
+      )
+
+      // We might have introduces non-deterministic grouping expression
+      PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])

Review comment:
       To avoid unnecessary checks, how about this?
   ```
   if (newAggregate.groupingExpressions.exists(!_.deterministic)) {
     PullOutNondeterministic.applyLocally.applyOrElse(newAggregate, identity[LogicalPlan])
   }
   ```
   >> // We might have introduces non-deterministic grouping expression
   
   This PR already has tests for this case?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714186177


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34724/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748412005


   **[Test build #133047 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133047/testReport)** for PR 30018 at commit [`0d86060`](https://github.com/apache/spark/commit/0d86060acab636bd2f19dbdec4c504924b4335fa).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710075408


   **[Test build #129899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129899/testReport)** for PR 30018 at commit [`4bf08bb`](https://github.com/apache/spark/commit/4bf08bbaba3dc8c7431c405d705629a580da673e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714215491


   **[Test build #130124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130124/testReport)** for PR 30018 at commit [`fab0427`](https://github.com/apache/spark/commit/fab04270ff669c7956788522bdaaced4bac25bd3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-714226557


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713294092






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758285636


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-758311652


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/38532/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-713286485


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34675/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748717762


   Looks fine. Could you review this? @cloud-fan @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r508999675



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)

Review comment:
       This check looks strict, so could you make it looser? e.g., how about the case where `lower.aggregateExpressions` has non-determinisitc exprs, but a upper node does not reference them.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748635973


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/133110/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748642990


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37709/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-710036785






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-767707162


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134515/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r510098048



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -475,6 +476,58 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty)
 }
 
+/**
+ * Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose
+ * only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
+ */
+object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+    case upper @ Aggregate(_, _, lower: Aggregate) if lowerIsRedundant(upper, lower) =>
+      val aliasMap = getAliasMap(lower)
+      upper.copy(
+        child = lower.child,
+        groupingExpressions = upper.groupingExpressions.map(replaceAlias(_, aliasMap)),
+        aggregateExpressions = upper.aggregateExpressions.map(
+          replaceAliasButKeepOuter(_, aliasMap))
+      )
+  }
+
+  private def lowerIsRedundant(upper: Aggregate, lower: Aggregate): Boolean = {
+    val isDeterministic = upper.aggregateExpressions.forall(_.deterministic) &&
+      lower.aggregateExpressions.forall(_.deterministic)

Review comment:
       Ah, yes, you're right. I've checked the related code again and I think we don't need to check if exprs are deterministic or not because of `PullOutNondeterministic`. Seems like it is okay just to check if grouping exprs in a upper node is a subset of grouping exprs in a lower node.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-707429378






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30018:
URL: https://github.com/apache/spark/pull/30018#issuecomment-748421658


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/37647/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #30018: [SPARK-33122][SQL] Remove redundant aggregates in the Optimzier

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #30018:
URL: https://github.com/apache/spark/pull/30018#discussion_r510101973



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
##########
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions
+
+import org.apache.spark.sql.catalyst.analysis.MultiAlias
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Project}
+
+/**
+ * Helper methods for collecting and replacing aliases.
+ */
+trait AliasHelper {

Review comment:
       hm, how about working on this refactoring in a separate PR first? This current PR has a mix of the improvement fixes and refactoring ones. cc: @cloud-fan 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org