You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/31 11:08:55 UTC

[GitHub] [spark] beliefer opened a new pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

beliefer opened a new pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058
 
 
   ### What changes were proposed in this pull request?
   This PR is related to https://github.com/apache/spark/pull/26656.
   https://github.com/apache/spark/pull/26656 only support use FILTER clause on aggregate expression without DISTINCT.
   This PR will enhance this feature when one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause.
   Such as:
   ```
   select sum(distinct id) filter (where sex = 'man') from student;
   select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id;
   select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student;
   select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id;
   select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student;
   select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
   select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
   ```
   This PR not support:
   ```
   select class_id, count(distinct sex), sum(distinct id) filter (where sex = 'man') from student group by class_id;
   select class_id, count(distinct sex) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id;
   ```
   https://issues.apache.org/jira/browse/SPARK-30396 used for later.
   
   ### Why are the changes needed?
   No
   
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   New UT
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-569942377
 
 
   **[Test build #115987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115987/testReport)** for PR 27058 at commit [`0008bae`](https://github.com/apache/spark/commit/0008bae5ff27a6766b9ceb3bb067eac2e8e322dc).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-576748173
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364234459
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ##########
 @@ -207,6 +207,29 @@ trait CheckAnalysis extends PredicateHelper {
                 s"of type ${condition.dataType.catalogString} is not a boolean.")
 
           case Aggregate(groupingExprs, aggregateExprs, child) =>
+            val aggExpressions = aggregateExprs.flatMap { e =>
+              e.collect {
+                case ae: AggregateExpression => ae
+              }
+            }
+            val aggWithDistinctAndFilters = aggExpressions.filter(e => e.isDistinct)
+            val aggGroups = aggWithDistinctAndFilters.groupBy { e =>
 
 Review comment:
   nit: `aggGroups` -> `distinctAggGroups`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368805334
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,66 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1,
+ *     COUNT(DISTINCT cat1) as cat1_cnt2,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt1, 'cat1_cnt2, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat1_cnt2, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
+ *            projections = [('key, if ('id > 1) 'cat1 else null, 'cat1, cast('value as bigint))]
+ *            output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value])
+ *           LocalTableScan [...]
+ * }}}
+ *
+ * The rule serves two purposes:
+ * 1. Expand distinct aggregates which exists filter clause.
+ * 2. Rewrite when aggregate exists at least two distinct aggregates.
+ *
+ * The first child rule does the following things here:
+ * 1. Guaranteed to compute filter clause locally.
 
 Review comment:
   What does `compute filter clause locally` means?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570140108
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573077780
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116490/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570184658
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20825/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579133604
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22231/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577519811
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22041/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575508203
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21680/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572864751
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578698892
 
 
   **[Test build #117436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117436/testReport)** for PR 27058 at commit [`7a74aae`](https://github.com/apache/spark/commit/7a74aae09f8f696102c5b92b850d572d64fd9cb1).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575520166
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577670226
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367752087
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
 
 Review comment:
   How about `extractFiltersInDistinctAggregate`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575129915
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116827/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578633620
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117434/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581008795
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22475/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577519805
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369315269
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,64 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
+ *     COUNT(DISTINCT cat2) as cat2_cnt,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat2),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
+ *            projections = [('key, if ('id > 1) 'cat1 else null, 'cat2, cast('value as bigint))]
+ *            output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value])
+ *           LocalTableScan [...]
+ * }}}
+ *
+ * The rule consists of the two phases as follows.
+ *
+ * In the first phase, expands distinct aggregates which exists filter clause:
+ * 1. Guaranteed to compute filter clauses in the first aggregate locally.
+ * 2. The attributes referenced by different distinct aggregate expressions are likely to overlap,
+ *    and if no additional processing is performed, data loss will occur. To prevent this, we
+ *    generate new attributes and replace the original ones.
+ * 3. If we apply the first phase to distinct aggregate expressions which exists filter
+ *    clause, the aggregate after expand may have at least two distinct aggregates, so we need to
+ *    apply the second phase too. Please refer to the second phase for more details.
+ *
+ * In the second phase, rewrite when aggregate exists at least two distinct aggregates:
  * 1. Expand the data. There are three aggregation groups in this query:
 
 Review comment:
   It seems this statement refers to the second example? If so, the current position looks weird...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578758706
 
 
   **[Test build #117442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117442/testReport)** for PR 27058 at commit [`5c80418`](https://github.com/apache/spark/commit/5c804182b498c0e804cb0c3157d1dbd31d182e24).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r366171011
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -316,6 +362,86 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
         }.asInstanceOf[NamedExpression]
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
+    } else if (distinctAggGroups.size == 1) {
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            // Why do we need to construct the phantom id ?
+            // First, In order to reduce costs, it is better to handle the filter clause locally.
+            // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+            // If(id > 1) 'a else null first, and use the result as output.
+            // Second, If more than one DISTINCT aggregate expression uses the same column,
+            // We need to construct the phantom attributes so as the output not lost.
+            // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+            // attribute '_gen_distinct-1 and attribute '_gen_distinct-2 instead of two 'a.
+            // Note: We just need to illusion the expression with filter clause.
+            // The illusionary mechanism may result in multiple distinct aggregations uses
+            // different column, so we still need to call `rewrite`.
+            val phantomId = NamedExpression.newExprId.id
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"_gen_distinct_$phantomId", e.dataType, nullable = true)())
+            }
+            val exprAttrLookup = exprAttrs.toMap
+            val newChildren = af.children.map(c => exprAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = None)
+            // Expand projection
+            val projection = unfoldableChildren.map {
+              case e if filter.isDefined => If(filter.get, e, nullify(e))
+              case e => e
+            }
+            (projection, exprAttrs, (ae, aggExpr))
+        }
+        val rewriteDistinctAttrMap = rewriteDistinctOperatorMap.flatMap(_._2)
+        val distinctAggChildAttrs = rewriteDistinctAttrMap.map(_._2)
+        val allAggAttrs = regularAggChildAttrMap.map(_._2) ++ distinctAggChildAttrs
+        // Construct the aggregate input projection.
+        val rewriteDistinctProjections = rewriteDistinctOperatorMap.flatMap(_._1)
+        val rewriteAggProjections =
+          Seq((a.groupingExpressions ++ regularAggChildren ++ rewriteDistinctProjections))
+        val groupByMap = a.groupingExpressions.collect {
+          case ne: NamedExpression => ne -> ne.toAttribute
+          case e => e -> AttributeReference(e.sql, e.dataType, e.nullable)()
+        }
+        val groupByAttrs = groupByMap.map(_._2)
+        // Construct the expand operator.
+        val expand = Expand(rewriteAggProjections, groupByAttrs ++ allAggAttrs, a.child)
+        val rewriteAggExprLookup =
+          (rewriteDistinctOperatorMap.map(_._3) ++ regularOperatorMap).toMap
+        val patchedAggExpressions = a.aggregateExpressions.map { e =>
+          e.transformDown {
+            case ae: AggregateExpression => rewriteAggExprLookup.getOrElse(ae, ae)
+          }.asInstanceOf[NamedExpression]
+        }
+        val expandAggregate = Aggregate(groupByAttrs, patchedAggExpressions, expand)
+        if (distinctAggExpressions.size > 1) {
+          rewrite(expandAggregate)
 
 Review comment:
   Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577519805
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572820000
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573040868
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574943784
 
 
   **[Test build #116803 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116803/testReport)** for PR 27058 at commit [`c6392db`](https://github.com/apache/spark/commit/c6392db3fbedd7442d27d263b052e0fd940f6d32).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r366171505
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,6 +118,52 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1,
+ *     COUNT(DISTINCT cat1) as cat1_cnt2,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt1, 'cat1_cnt2, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat1_cnt2, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
 
 Review comment:
   Probably, this is related to [the comment](https://github.com/apache/spark/pull/27058#discussion_r366139379). If we avoid the recursive call, I think we can have a chance to merge them.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575146628
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578633612
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572445726
 
 
   **[Test build #116370 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116370/testReport)** for PR 27058 at commit [`bcca92d`](https://github.com/apache/spark/commit/bcca92d14d13531bd02ab9f9eb19afe001f8754e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575520076
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116907/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572543838
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116370/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572542938
 
 
   **[Test build #116370 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116370/testReport)** for PR 27058 at commit [`bcca92d`](https://github.com/apache/spark/commit/bcca92d14d13531bd02ab9f9eb19afe001f8754e).
    * This patch **fails SparkR unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-575031809
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-575034151
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21599/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-575031741
 
 
   **[Test build #116825 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116825/testReport)** for PR 27058 at commit [`a83efcf`](https://github.com/apache/spark/commit/a83efcf57021167bf9829f9f1ee2039ea9e86213).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class CSVFilters(filters: Seq[sources.Filter], requiredSchema: StructType) `

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573140754
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-576645591
 
 
   **[Test build #117179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117179/testReport)** for PR 27058 at commit [`7a6b019`](https://github.com/apache/spark/commit/7a6b0197c1f32f6ad0d843b688b8abcc97344935).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577670237
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117291/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579116033
 
 
   **[Test build #117467 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117467/testReport)** for PR 27058 at commit [`8f9626b`](https://github.com/apache/spark/commit/8f9626be355d52f0ec6e0ed6750d438f9b0b0866).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573077769
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573043743
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579032752
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578645101
 
 
   **[Test build #117436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117436/testReport)** for PR 27058 at commit [`7a74aae`](https://github.com/apache/spark/commit/7a74aae09f8f696102c5b92b850d572d64fd9cb1).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574620963
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21553/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-569912064
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578633674
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369536759
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,64 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
+ *     COUNT(DISTINCT cat2) as cat2_cnt,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat2),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
+ *            projections = [('key, if ('id > 1) 'cat1 else null, 'cat2, cast('value as bigint))]
+ *            output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value])
+ *           LocalTableScan [...]
+ * }}}
+ *
+ * The rule consists of the two phases as follows.
+ *
+ * In the first phase, expands distinct aggregates which exists filter clause:
+ * 1. Guaranteed to compute filter clauses in the first aggregate locally.
+ * 2. The attributes referenced by different distinct aggregate expressions are likely to overlap,
+ *    and if no additional processing is performed, data loss will occur. To prevent this, we
+ *    generate new attributes and replace the original ones.
+ * 3. If we apply the first phase to distinct aggregate expressions which exists filter
+ *    clause, the aggregate after expand may have at least two distinct aggregates, so we need to
+ *    apply the second phase too. Please refer to the second phase for more details.
+ *
+ * In the second phase, rewrite when aggregate exists at least two distinct aggregates:
  * 1. Expand the data. There are three aggregation groups in this query:
 
 Review comment:
   Really? I think the statements below point to the second query: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala#L78-L119
   ```
    *    i. the non-distinct group;
    *    ii. the distinct 'cat1 group;
    *    iii. the distinct 'cat2 group.
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-575028437
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-575034151
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21599/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581022469
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367727323
 
 

 ##########
 File path: sql/core/src/test/resources/sql-tests/results/postgreSQL/groupingsets.sql.out
 ##########
 @@ -444,10 +444,29 @@ NULL	1
 
 
 -- !query 34
-select count(*) from gstest4 group by rollup(unhashable_col,unsortable_col)
+select ten, sum(distinct four) filter (where string(four) like '123') from onek a
+group by rollup(ten)
 -- !query 34 schema
-struct<count(1):bigint>
+struct<ten:int,sum(DISTINCT four) FILTER (WHERE CAST(four AS STRING) LIKE 123):bigint>
 -- !query 34 output
 
 Review comment:
   Checked, too: https://github.com/postgres/postgres/blob/0fca3d0a4ec297bff5a5cb01dfe345e0f63d7d63/src/test/regress/expected/groupingsets.out#L822

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364580011
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
 
 Review comment:
   The illusionary mechanism may result in multiple distinct aggregations uses different column, so we still need to call `rewrite`. (e.g., `count(distinct phantom1-a) and count(distinct a phantom2-a)`)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364515324
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"phantom$phantomId-${e.sql}", e.dataType, nullable = true)())
+            }
+            val exprAttrLookup = exprAttrs.toMap
+            val newChildren = af.children.map(c => exprAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = None)
+            // Expand projection
+            val projection = unfoldableChildren.map {
+              case e if filter.isDefined => If(filter.get, e, nullify(e))
+              case e => e
+            }
+            (projection, exprAttrs, (ae, aggExpr))
+        }
+        val rewriteDistinctAttrMap = rewriteDistinctOperatorMap.map(_._2).flatten
+        val distinctAggChildAttrs = rewriteDistinctAttrMap.map(_._2)
+        val allAggAttrs = (regularAggChildAttrMap.map(_._2) ++ distinctAggChildAttrs)
+        // Construct the aggregate input projection.
+        val rewriteDistinctProjections = rewriteDistinctOperatorMap.map(_._1).flatten
 
 Review comment:
   ditto: `.flatMap(_._2)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574620963
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21553/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581008791
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367741733
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
 
 Review comment:
   Can you describe more here? Its like `Replaces the existing references with new ones to avoid collisions between distinct and regular aggregate children...`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578758706
 
 
   **[Test build #117442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117442/testReport)** for PR 27058 at commit [`5c80418`](https://github.com/apache/spark/commit/5c804182b498c0e804cb0c3157d1dbd31d182e24).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578645101
 
 
   **[Test build #117436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117436/testReport)** for PR 27058 at commit [`7a74aae`](https://github.com/apache/spark/commit/7a74aae09f8f696102c5b92b850d572d64fd9cb1).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574620954
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574963605
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369338549
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +205,102 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // We need at least two distinct aggregates or a single distinct aggregate with a filter for
+    // this rule because aggregation strategy can handle a single distinct group without a filter.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregates(a)
+      rewriteDistinctAggregates(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregates(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Constructs pairs between old and new expressions for regular aggregates. Because we
+      // will construct a new aggregate, the children of the distinct aggregates will be
+      // changed to the generate ones, so we need creates new references to avoid collisions
+      // between distinct and regular aggregate children.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369340309
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularOperatorMap = regularAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+          val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+          val filterOpt = filter.map(_.transform {
+            case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+          })
+          val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+          (ae, aggExpr)
+      }
+
+      // Setup expand for the distinct aggregate expressions.
+      val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val rewriteDistinctOperatorMap = distinctAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          // Why do we need to construct the phantom id ?
+          // First, In order to reduce costs, it is better to handle the filter clause locally.
+          // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+          // If(id > 1) 'a else null first, and use the result as output.
+          // Second, If more than one DISTINCT aggregate expression uses the same column,
+          // We need to construct the phantom attributes so as the output not lost.
+          // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+          // attribute '_gen_distinct-1 and attribute '_gen_distinct-2 instead of two 'a.
+          // Note: We just need to illusion the expression with filter clause.
+          // The illusionary mechanism may result in multiple distinct aggregations uses
+          // different column, so we still need to call `rewrite`.
 
 Review comment:
   OK. I will update it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577519423
 
 
   **[Test build #117281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117281/testReport)** for PR 27058 at commit [`4fa0dd8`](https://github.com/apache/spark/commit/4fa0dd89ed36eefe8c7f12e39b23eb666a351673).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570250683
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578620862
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572920132
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570117792
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572386876
 
 
   **[Test build #116345 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116345/testReport)** for PR 27058 at commit [`bcca92d`](https://github.com/apache/spark/commit/bcca92d14d13531bd02ab9f9eb19afe001f8754e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-576646057
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575508198
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367724105
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularOperatorMap = regularAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+          val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+          val filterOpt = filter.map(_.transform {
+            case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+          })
+          val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+          (ae, aggExpr)
+      }
+
+      // Setup expand for the distinct aggregate expressions.
+      val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val rewriteDistinctOperatorMap = distinctAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          // Why do we need to construct the phantom id ?
+          // First, In order to reduce costs, it is better to handle the filter clause locally.
+          // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+          // If(id > 1) 'a else null first, and use the result as output.
+          // Second, If more than one DISTINCT aggregate expression uses the same column,
+          // We need to construct the phantom attributes so as the output not lost.
+          // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+          // attribute '_gen_distinct-1 and attribute '_gen_distinct-2 instead of two 'a.
+          // Note: We just need to illusion the expression with filter clause.
+          // The illusionary mechanism may result in multiple distinct aggregations uses
+          // different column, so we still need to call `rewrite`.
+          val phantomId = NamedExpression.newExprId.id
+          val unfoldableChildren = af.children.filter(!_.foldable)
+          val exprAttrs = unfoldableChildren.map { e =>
+            (e, AttributeReference(s"_gen_distinct_$phantomId", e.dataType, nullable = true)())
 
 Review comment:
   nit: `phantomId` -> `exprId`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572864514
 
 
   **[Test build #116445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116445/testReport)** for PR 27058 at commit [`dd09645`](https://github.com/apache/spark/commit/dd096453c640335fe7af9f8b89eac296ee54615c).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573043256
 
 
   **[Test build #116491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116491/testReport)** for PR 27058 at commit [`45d6971`](https://github.com/apache/spark/commit/45d69714a7a1bb6dc8a05a11a2b885174f5d3fca).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578633674
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579220034
 
 
   **[Test build #117472 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117472/testReport)** for PR 27058 at commit [`8f9626b`](https://github.com/apache/spark/commit/8f9626be355d52f0ec6e0ed6750d438f9b0b0866).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amanomer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

amanomer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r362714875
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ##########
 @@ -207,6 +207,29 @@ trait CheckAnalysis extends PredicateHelper {
                 s"of type ${condition.dataType.catalogString} is not a boolean.")
 
           case Aggregate(groupingExprs, aggregateExprs, child) =>
+            val aggExpressions = aggregateExprs.flatMap { e =>
+              e.collect {
+                case ae: AggregateExpression => ae
+              }
+            }
+            val aggWithDistinctAndFilters = aggExpressions.filter(e => e.isDistinct)
 
 Review comment:
   Why is this `aggWithDistinctAndFilters` since this is only checking for distinct?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575033417
 
 
   **[Test build #116827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116827/testReport)** for PR 27058 at commit [`a83efcf`](https://github.com/apache/spark/commit/a83efcf57021167bf9829f9f1ee2039ea9e86213).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572906391
 
 
   Ah, also, can you put a simple explain example (about how to convert a plan with distinct aggregates) in the PR description? better to put how-to-fix in this pr there.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575508244
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581022473
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117714/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578699422
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577519423
 
 
   **[Test build #117281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117281/testReport)** for PR 27058 at commit [`4fa0dd8`](https://github.com/apache/spark/commit/4fa0dd89ed36eefe8c7f12e39b23eb666a351673).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578645866
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572387262
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570117654
 
 
   **[Test build #116012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116012/testReport)** for PR 27058 at commit [`885e2f6`](https://github.com/apache/spark/commit/885e2f620a0d889f142ad3b1bef3d48dbcf28369).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572915374
 
 
   > btw, we need a different approach for supporting multiple distinct groups (SPARK-30396)? Why did you split the distinct support into two?
   
   This PR will support
   `select a, sum(distinct b) filter (where ...) from t group by a;`
   We only have one DISTINCT aggregate expr, so the columns where the aggregate function acting on is same.
   `select a, sum(distinct b) filter (where ...), count(distinct b) filter (where ...) from t group by a;`
   We have two DISTINCT aggregate exprs, but the columns where each aggregate function acting on is same.
   SPARK-30396 will support
   `select a, sum(distinct b) filter (where ...), count(distinct c) filter (where ...) from t group by a;`
   We have two DISTINCT aggregate exprs, the columns where each aggregate function acting on is different.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364545222
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"phantom$phantomId-${e.sql}", e.dataType, nullable = true)())
+            }
+            val exprAttrLookup = exprAttrs.toMap
+            val newChildren = af.children.map(c => exprAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = None)
+            // Expand projection
+            val projection = unfoldableChildren.map {
+              case e if filter.isDefined => If(filter.get, e, nullify(e))
+              case e => e
+            }
+            (projection, exprAttrs, (ae, aggExpr))
+        }
+        val rewriteDistinctAttrMap = rewriteDistinctOperatorMap.map(_._2).flatten
+        val distinctAggChildAttrs = rewriteDistinctAttrMap.map(_._2)
+        val allAggAttrs = (regularAggChildAttrMap.map(_._2) ++ distinctAggChildAttrs)
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-576981621
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21972/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575520076
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116907/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572874663
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116434/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575587787
 
 
   **[Test build #116939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116939/testReport)** for PR 27058 at commit [`5462c0c`](https://github.com/apache/spark/commit/5462c0ce605cf95e608c945e4ca8bff358740f50).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578876431
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117442/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368940140
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +207,106 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // This rule serves two purposes:
+    // One is to rewrite when there exists at least two distinct aggregates. We need at least
+    // two distinct aggregates for this rule because aggregation strategy can handle a single
+    // distinct group.
+    // Another is to expand distinct aggregates which exists filter clause so that we can
+    // evaluate the filter locally.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregate(a)
+      rewriteDistinctAggregate(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregate(a: Aggregate): Aggregate = {
 
 Review comment:
   Currently, this pr only has end-to-end tests now. So, I think we need fine-grained tests, too, for the new logic in this rule.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368940374
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +207,106 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // This rule serves two purposes:
+    // One is to rewrite when there exists at least two distinct aggregates. We need at least
+    // two distinct aggregates for this rule because aggregation strategy can handle a single
+    // distinct group.
+    // Another is to expand distinct aggregates which exists filter clause so that we can
+    // evaluate the filter locally.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregate(a)
+      rewriteDistinctAggregate(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregate(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions. Because we will construct a new
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r365198561
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -316,6 +362,86 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
         }.asInstanceOf[NamedExpression]
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
+    } else if (distinctAggGroups.size == 1) {
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            // Why do we need to construct the phantom id ?
+            // First, In order to reduce costs, it is better to handle the filter clause locally.
+            // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+            // If(id > 1) 'a else null first, and use the result as output.
+            // Second, If more than one DISTINCT aggregate expression uses the same column,
+            // We need to construct the phantom attributes so as the output not lost.
+            // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+            // attribute 'phantom1-a and attribute 'phantom2-a instead of two 'a.
+            // Note: We just need to illusion the expression with filter clause.
+            // The illusionary mechanism may result in multiple distinct aggregations uses
+            // different column, so we still need to call `rewrite`.
+            val phantomId = i + 1
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"phantom$phantomId-${e.sql}", e.dataType, nullable = true)())
 
 Review comment:
   Just a example. It's ok to drop `_group`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573043757
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21279/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367752087
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
 
 Review comment:
   How about `expandFiltersInDistinctAggregate`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367741302
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
 
 Review comment:
   `val regularAggChildren = regularAggExpressions.flatMap(_.references)`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572444344
 
 
   Retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r371613436
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
 ##########
 @@ -165,20 +165,35 @@ package object dsl {
     }
 
     def rand(e: Long): Expression = Rand(e)
-    def sum(e: Expression): Expression = Sum(e).toAggregateExpression()
-    def sumDistinct(e: Expression): Expression = Sum(e).toAggregateExpression(isDistinct = true)
-    def count(e: Expression): Expression = Count(e).toAggregateExpression()
+    def sum(e: Expression, filter: Option[Expression] = None): Expression =
+      Sum(e).toAggregateExpression(false, filter)
+    def sumDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Sum(e).toAggregateExpression(true, filter)
+    def count(e: Expression, filter: Option[Expression] = None): Expression =
+      Count(e).toAggregateExpression(false, filter)
     def countDistinct(e: Expression*): Expression =
       Count(e).toAggregateExpression(isDistinct = true)
-    def approxCountDistinct(e: Expression, rsd: Double = 0.05): Expression =
-      HyperLogLogPlusPlus(e, rsd).toAggregateExpression()
-    def avg(e: Expression): Expression = Average(e).toAggregateExpression()
-    def first(e: Expression): Expression = new First(e).toAggregateExpression()
-    def last(e: Expression): Expression = new Last(e).toAggregateExpression()
-    def min(e: Expression): Expression = Min(e).toAggregateExpression()
-    def minDistinct(e: Expression): Expression = Min(e).toAggregateExpression(isDistinct = true)
-    def max(e: Expression): Expression = Max(e).toAggregateExpression()
-    def maxDistinct(e: Expression): Expression = Max(e).toAggregateExpression(isDistinct = true)
+    def countDistinct(filter: Option[Expression], e: Expression*): Expression =
+      Count(e).toAggregateExpression(true, filter)
+    def approxCountDistinct(
+        e: Expression,
+        rsd: Double = 0.05,
+        filter: Option[Expression] = None): Expression =
+      HyperLogLogPlusPlus(e, rsd).toAggregateExpression(true, filter)
+    def avg(e: Expression, filter: Option[Expression] = None): Expression =
+      Average(e).toAggregateExpression(false, filter)
+    def first(e: Expression, filter: Option[Expression] = None): Expression =
+      new First(e).toAggregateExpression(false, filter)
+    def last(e: Expression, filter: Option[Expression] = None): Expression =
+      new Last(e).toAggregateExpression(false, filter)
+    def min(e: Expression, filter: Option[Expression] = None): Expression =
+      Min(e).toAggregateExpression(false, filter)
+    def minDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Min(e).toAggregateExpression(true, filter)
+    def max(e: Expression, filter: Option[Expression] = None): Expression =
+      Max(e).toAggregateExpression(false, filter)
+    def maxDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Max(e).toAggregateExpression(true, filter)
     def upper(e: Expression): Expression = Upper(e)
 
 Review comment:
   Ok

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-575031814
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116825/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572440680
 
 
   **[Test build #116345 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116345/testReport)** for PR 27058 at commit [`bcca92d`](https://github.com/apache/spark/commit/bcca92d14d13531bd02ab9f9eb19afe001f8754e).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367768976
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularOperatorMap = regularAggExprs.map {
 
 Review comment:
   `regularOperatorMap` -> `regularAggMap `?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-575028449
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21597/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574943784
 
 
   **[Test build #116803 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116803/testReport)** for PR 27058 at commit [`c6392db`](https://github.com/apache/spark/commit/c6392db3fbedd7442d27d263b052e0fd940f6d32).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581009195
 
 
   **[Test build #117714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117714/testReport)** for PR 27058 at commit [`a9f8812`](https://github.com/apache/spark/commit/a9f88120f60273237a8bfd170a27132b71e4d659).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572446217
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367785540
 
 

 ##########
 File path: sql/core/src/test/resources/sql-tests/inputs/group-by-filter.sql
 ##########
 @@ -58,13 +62,18 @@ select dept_id, count(distinct emp_name), count(distinct hiredate), sum(salary),
 select dept_id, count(distinct emp_name), count(distinct hiredate), sum(salary), sum(salary) filter (where id + dept_id > 500) from emp group by dept_id;
 select dept_id, count(distinct emp_name), count(distinct hiredate), sum(salary) filter (where salary < 400.00D), sum(salary) filter (where id > 200) from emp group by dept_id;
 select dept_id, count(distinct emp_name), count(distinct hiredate), sum(salary) filter (where salary < 400.00D), sum(salary) filter (where id + dept_id > 500) from emp group by dept_id;
--- [SPARK-30276] Support Filter expression allows simultaneous use of DISTINCT
--- select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate), sum(salary) from emp group by dept_id;
--- select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate) filter (where hiredate > date "2003-01-01"), sum(salary) from emp group by dept_id;
--- select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate) filter (where hiredate > date "2003-01-01"), sum(salary) filter (where salary < 400.00D) from emp group by dept_id;
--- select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate) filter (where hiredate > date "2003-01-01"), sum(salary) filter (where salary < 400.00D), sum(salary) filter (where id > 200) from emp group by dept_id;
--- select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct emp_name), sum(salary) from emp group by dept_id;
--- select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct emp_name) filter (where hiredate > date "2003-01-01"), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id > 200), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id + dept_id > 500), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name), count(distinct emp_name) filter (where id > 200), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name), count(distinct emp_name) filter (where id + dept_id > 500), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name), count(distinct emp_name) filter (where id > 200), sum(salary), sum(salary) filter (where id > 200) from emp group by dept_id;
+select dept_id, count(distinct emp_name), count(distinct emp_name) filter (where id + dept_id > 500), sum(salary), sum(salary) filter (where id > 200) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate) filter (where hiredate > date "2003-01-01"), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate) filter (where hiredate > date "2003-01-01"), sum(salary) filter (where salary < 400.00D) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct hiredate) filter (where hiredate > date "2003-01-01"), sum(salary) filter (where salary < 400.00D), sum(salary) filter (where id > 200) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct emp_name), sum(salary) from emp group by dept_id;
+select dept_id, count(distinct emp_name) filter (where id > 200), count(distinct emp_name) filter (where hiredate > date "2003-01-01"), sum(salary) from emp group by dept_id;
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-576982894
 
 
   **[Test build #117212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117212/testReport)** for PR 27058 at commit [`c9e95c3`](https://github.com/apache/spark/commit/c9e95c33bc93e18316981704a776c07f8ffb12fb).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579088581
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367791466
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularOperatorMap = regularAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+          val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+          val filterOpt = filter.map(_.transform {
+            case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+          })
+          val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+          (ae, aggExpr)
+      }
+
+      // Setup expand for the distinct aggregate expressions.
+      val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val rewriteDistinctOperatorMap = distinctAggExprs.map {
 
 Review comment:
   `val (var1, var2, var3) = distinctAggExprs.map { ... }.unzip3`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575516885
 
 
   **[Test build #116915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116915/testReport)** for PR 27058 at commit [`5462c0c`](https://github.com/apache/spark/commit/5462c0ce605cf95e608c945e4ca8bff358740f50).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367753319
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
 
 Review comment:
   I want to reuse `rewriteDistinctAggregate` and the inner `Expand`.
   Merge them will lead to the code of `rewriteDistinctAggregate` more complex.
   I think it's not worth.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575501041
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581022473
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117714/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570188414
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116017/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575519997
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116915/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r365040943
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
 
 Review comment:
   In the current approach, it seems filter conditions are evaluated in not aggregate but expand. You meant, even in this case (the same aggregate group), we cannot use the aggregate filter logic in `AggregationIterator` implemented in your previous commit?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572946945
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575467747
 
 
   **[Test build #116895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116895/testReport)** for PR 27058 at commit [`0ed5be0`](https://github.com/apache/spark/commit/0ed5be0b4e7edab5fef67c08488af7e85b999c40).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575508198
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367756787
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularOperatorMap = regularAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+          val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+          val filterOpt = filter.map(_.transform {
+            case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+          })
+          val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+          (ae, aggExpr)
+      }
+
+      // Setup expand for the distinct aggregate expressions.
+      val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val rewriteDistinctOperatorMap = distinctAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          // Why do we need to construct the phantom id ?
+          // First, In order to reduce costs, it is better to handle the filter clause locally.
+          // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+          // If(id > 1) 'a else null first, and use the result as output.
+          // Second, If more than one DISTINCT aggregate expression uses the same column,
+          // We need to construct the phantom attributes so as the output not lost.
+          // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+          // attribute '_gen_distinct-1 and attribute '_gen_distinct-2 instead of two 'a.
+          // Note: We just need to illusion the expression with filter clause.
+          // The illusionary mechanism may result in multiple distinct aggregations uses
+          // different column, so we still need to call `rewrite`.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367318890
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ##########
 @@ -207,6 +207,26 @@ trait CheckAnalysis extends PredicateHelper {
                 s"of type ${condition.dataType.catalogString} is not a boolean.")
 
           case Aggregate(groupingExprs, aggregateExprs, child) =>
+            val distinctAggregateExprs = aggregateExprs.flatMap(_.collect {
+              case ae: AggregateExpression if ae.isDistinct => ae
+            })
+            val distinctAggGroups = distinctAggregateExprs.groupBy { e =>
+              val unfoldableChildren = e.aggregateFunction.children.filter(!_.foldable).toSet
+              if (unfoldableChildren.nonEmpty) {
+                unfoldableChildren
+              } else {
+                e.aggregateFunction.children.take(1).toSet
 
 Review comment:
   This PR is changed to implement multiple DISTINCT aggregate with Filter clause. I modified the title and description of this PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367786715
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularOperatorMap = regularAggExprs.map {
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574620588
 
 
   **[Test build #116781 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116781/testReport)** for PR 27058 at commit [`aa37870`](https://github.com/apache/spark/commit/aa37870ec36a8bf182c360a37c46d138d267f8a4).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579126684
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578618570
 
 
   **[Test build #117432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117432/testReport)** for PR 27058 at commit [`71ba1f4`](https://github.com/apache/spark/commit/71ba1f46229cb9443658818b1f94b2973fbc37ce).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581009195
 
 
   **[Test build #117714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117714/testReport)** for PR 27058 at commit [`a9f8812`](https://github.com/apache/spark/commit/a9f88120f60273237a8bfd170a27132b71e4d659).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578625016
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22193/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578633679
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117432/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364545988
 
 

 ##########
 File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ##########
 @@ -192,8 +192,10 @@ class AnalysisErrorSuite extends AnalysisTest {
 
   errorTest(
     "DISTINCT and FILTER cannot be used in aggregate functions at the same time",
-    CatalystSqlParser.parsePlan("SELECT count(DISTINCT a) FILTER (WHERE c > 1) FROM TaBlE2"),
-    "DISTINCT and FILTER cannot be used in aggregate functions at the same time" :: Nil)
+    CatalystSqlParser.parsePlan("SELECT sum(DISTINCT b), " +
+      "count(DISTINCT a) FILTER (WHERE c > 1) FROM TaBlE2"),
+    "When there are multiple DISTINCT aggregate expressions acting on different " +
+    "fields, any DISTINCT aggregate expression not allow use FILTER clause." :: Nil)
 
 Review comment:
   OK. I will add tests in group-by-filter.sql

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368940326
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +207,106 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // This rule serves two purposes:
+    // One is to rewrite when there exists at least two distinct aggregates. We need at least
+    // two distinct aggregates for this rule because aggregation strategy can handle a single
+    // distinct group.
+    // Another is to expand distinct aggregates which exists filter clause so that we can
+    // evaluate the filter locally.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregate(a)
+      rewriteDistinctAggregate(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregate(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions. Because we will construct a new
+      // aggregate, the children of the distinct aggregates will be changed to the generate
+      // ones, so we need creates new references to avoid collisions between distinct and
+      // regular aggregate children.
+      val regularAggExprs = regularAggExpressions.filter(_.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularAggMap = regularAggExprs.map {
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572936002
 
 
   Yea, I know that. What I just want to know is that we need a lot of code changes to support the thrid case (SPARK-30396 )? ISTM we can easily extend this approache (expland w/ filters in the first aggregation phase) for supporting that. If so, I think its ok to merge the two work in this single PR. Any techinical barrir to do so?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570185163
 
 
   cc @cloud-fan @maropu 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364544828
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
 
 Review comment:
               // Why do we need to construct the phantom id ?
               // First, In order to reduce costs, it is better to handle the filter clause locally.
               // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
               // If(id > 1) 'a else null first, and use the result as output.
               // Second, If more than one DISTINCT aggregate expression uses the same column,
               // We need to construct the phantom attributes so as the output not lost.
               // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
               // attribute 'a and attribute 'phantom1-a instead of two 'a.
               // Note: We just need to illusion the expression with filter clause.
               // The illusionary mechanism may result in multiple distinct aggregations uses
               // different column, so we still need to call `rewrite`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364536129
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ##########
 @@ -207,6 +207,29 @@ trait CheckAnalysis extends PredicateHelper {
                 s"of type ${condition.dataType.catalogString} is not a boolean.")
 
           case Aggregate(groupingExprs, aggregateExprs, child) =>
+            val aggExpressions = aggregateExprs.flatMap { e =>
+              e.collect {
+                case ae: AggregateExpression => ae
+              }
+            }
+            val aggWithDistinctAndFilters = aggExpressions.filter(e => e.isDistinct)
+            val aggGroups = aggWithDistinctAndFilters.groupBy { e =>
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364236076
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ##########
 @@ -207,6 +207,29 @@ trait CheckAnalysis extends PredicateHelper {
                 s"of type ${condition.dataType.catalogString} is not a boolean.")
 
           case Aggregate(groupingExprs, aggregateExprs, child) =>
+            val aggExpressions = aggregateExprs.flatMap { e =>
+              e.collect {
+                case ae: AggregateExpression => ae
+              }
+            }
+            val aggWithDistinctAndFilters = aggExpressions.filter(e => e.isDistinct)
+            val aggGroups = aggWithDistinctAndFilters.groupBy { e =>
+              val unfoldableChildren = e.aggregateFunction.children.filter(!_.foldable).toSet
+              if (unfoldableChildren.nonEmpty) {
+                unfoldableChildren
+              } else {
+                e.aggregateFunction.children.take(1).toSet
+              }
+            }
+            val useDistinctAndFilter = aggWithDistinctAndFilters.exists(_.filter.isDefined)
+            // TODO: SPARK-30396 When there are multiple DISTINCT aggregate expressions
+            // acting on different fields, any DISTINCT aggregate expression allows the use of
+            // the FILTER clause
+            if (aggGroups.size > 1 && useDistinctAndFilter) {
 
 Review comment:
   nit: how about inlining `useDistinctAndFilter `;
   ```
               if (aggGroups.size > 1 && aggWithDistinctAndFilters.exists(_.filter.isDefined)) {
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570140113
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116012/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364544828
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
 
 Review comment:
               // Why do we need to construct the phantom id ?
               // First, In order to reduce costs, it is better to handle the filter clause locally.
               // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
               // If(id > 1) 'a else null first, and use the result as output.
               // Second, If more than one DISTINCT aggregate expression uses the same column,
               // We need to construct the phantom attributes so as the output not lost.
               // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
               // attribute 'a and attribute 'phantom1-a instead of two 'a.
               // Note: We just need to illusion the expression with filter clause.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573043256
 
 
   **[Test build #116491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116491/testReport)** for PR 27058 at commit [`45d6971`](https://github.com/apache/spark/commit/45d69714a7a1bb6dc8a05a11a2b885174f5d3fca).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574963615
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116803/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575508244
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367766975
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
 
 Review comment:
   Since `rewrite` has less code now, can you inline this in `apply`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367318890
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ##########
 @@ -207,6 +207,26 @@ trait CheckAnalysis extends PredicateHelper {
                 s"of type ${condition.dataType.catalogString} is not a boolean.")
 
           case Aggregate(groupingExprs, aggregateExprs, child) =>
+            val distinctAggregateExprs = aggregateExprs.flatMap(_.collect {
+              case ae: AggregateExpression if ae.isDistinct => ae
+            })
+            val distinctAggGroups = distinctAggregateExprs.groupBy { e =>
+              val unfoldableChildren = e.aggregateFunction.children.filter(!_.foldable).toSet
+              if (unfoldableChildren.nonEmpty) {
+                unfoldableChildren
+              } else {
+                e.aggregateFunction.children.take(1).toSet
 
 Review comment:
   This PR is changed to implement multiple DISTINCT aggregate with Filter clause. I modified the title and description of this PR.
   We do not need these codes now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367757765
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -331,6 +457,17 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     }
   }
 
+  /**
+   * Collect all aggregate expressions.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575520172
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116895/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577669519
 
 
   **[Test build #117291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117291/testReport)** for PR 27058 at commit [`4fa0dd8`](https://github.com/apache/spark/commit/4fa0dd89ed36eefe8c7f12e39b23eb666a351673).
    * This patch **fails SparkR unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572946419
 
 
   **[Test build #116480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116480/testReport)** for PR 27058 at commit [`c65092b`](https://github.com/apache/spark/commit/c65092b9241d53a4b49909b2ac36c7c180db6aca).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578633293
 
 
   **[Test build #117432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117432/testReport)** for PR 27058 at commit [`71ba1f4`](https://github.com/apache/spark/commit/71ba1f46229cb9443658818b1f94b2973fbc37ce).
    * This patch **fails due to an unknown error code, -9**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579087842
 
 
   **[Test build #117456 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117456/testReport)** for PR 27058 at commit [`5c80418`](https://github.com/apache/spark/commit/5c804182b498c0e804cb0c3157d1dbd31d182e24).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579038043
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570139949
 
 
   **[Test build #116017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116017/testReport)** for PR 27058 at commit [`a4fd143`](https://github.com/apache/spark/commit/a4fd143fa92cb569764b669e5c53f6dfd1f1ae6a).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577568841
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r366139379
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -316,6 +362,86 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
         }.asInstanceOf[NamedExpression]
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
+    } else if (distinctAggGroups.size == 1) {
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            // Why do we need to construct the phantom id ?
+            // First, In order to reduce costs, it is better to handle the filter clause locally.
+            // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+            // If(id > 1) 'a else null first, and use the result as output.
+            // Second, If more than one DISTINCT aggregate expression uses the same column,
+            // We need to construct the phantom attributes so as the output not lost.
+            // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+            // attribute '_gen_distinct-1 and attribute '_gen_distinct-2 instead of two 'a.
+            // Note: We just need to illusion the expression with filter clause.
+            // The illusionary mechanism may result in multiple distinct aggregations uses
+            // different column, so we still need to call `rewrite`.
+            val phantomId = NamedExpression.newExprId.id
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"_gen_distinct_$phantomId", e.dataType, nullable = true)())
+            }
+            val exprAttrLookup = exprAttrs.toMap
+            val newChildren = af.children.map(c => exprAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = None)
+            // Expand projection
+            val projection = unfoldableChildren.map {
+              case e if filter.isDefined => If(filter.get, e, nullify(e))
+              case e => e
+            }
+            (projection, exprAttrs, (ae, aggExpr))
+        }
+        val rewriteDistinctAttrMap = rewriteDistinctOperatorMap.flatMap(_._2)
+        val distinctAggChildAttrs = rewriteDistinctAttrMap.map(_._2)
+        val allAggAttrs = regularAggChildAttrMap.map(_._2) ++ distinctAggChildAttrs
+        // Construct the aggregate input projection.
+        val rewriteDistinctProjections = rewriteDistinctOperatorMap.flatMap(_._1)
+        val rewriteAggProjections =
+          Seq((a.groupingExpressions ++ regularAggChildren ++ rewriteDistinctProjections))
+        val groupByMap = a.groupingExpressions.collect {
+          case ne: NamedExpression => ne -> ne.toAttribute
+          case e => e -> AttributeReference(e.sql, e.dataType, e.nullable)()
+        }
+        val groupByAttrs = groupByMap.map(_._2)
+        // Construct the expand operator.
+        val expand = Expand(rewriteAggProjections, groupByAttrs ++ allAggAttrs, a.child)
+        val rewriteAggExprLookup =
+          (rewriteDistinctOperatorMap.map(_._3) ++ regularOperatorMap).toMap
+        val patchedAggExpressions = a.aggregateExpressions.map { e =>
+          e.transformDown {
+            case ae: AggregateExpression => rewriteAggExprLookup.getOrElse(ae, ae)
+          }.asInstanceOf[NamedExpression]
+        }
+        val expandAggregate = Aggregate(groupByAttrs, patchedAggExpressions, expand)
+        if (distinctAggExpressions.size > 1) {
+          rewrite(expandAggregate)
 
 Review comment:
   Can you avoid this recursive call? We cannot move this pre-processing logic (pushing down aggregate filters into projections in `Expand`) into the line 186? https://github.com/apache/spark/blob/1846b0261b84ce1bca079bc59fb4518bff910c18/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala#L186

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574620954
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573040878
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21277/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572543828
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367736938
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
 
 Review comment:
   nit: `regularFunChildren` -> `regularFunArguments`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572446223
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21160/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574963520
 
 
   **[Test build #116803 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116803/testReport)** for PR 27058 at commit [`c6392db`](https://github.com/apache/spark/commit/c6392db3fbedd7442d27d263b052e0fd940f6d32).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367753200
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
 
 Review comment:
   yea, it looks ok.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367741733
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
 
 Review comment:
   Can you describe more here? Its like `Creates new references to avoid collisions between distinct and regular aggregate children...`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369335821
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,64 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
+ *     COUNT(DISTINCT cat2) as cat2_cnt,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat2),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-576645591
 
 
   **[Test build #117179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117179/testReport)** for PR 27058 at commit [`7a6b019`](https://github.com/apache/spark/commit/7a6b0197c1f32f6ad0d843b688b8abcc97344935).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579037098
 
 
   **[Test build #117456 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117456/testReport)** for PR 27058 at commit [`5c80418`](https://github.com/apache/spark/commit/5c804182b498c0e804cb0c3157d1dbd31d182e24).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572386876
 
 
   **[Test build #116345 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116345/testReport)** for PR 27058 at commit [`bcca92d`](https://github.com/apache/spark/commit/bcca92d14d13531bd02ab9f9eb19afe001f8754e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575507740
 
 
   **[Test build #116911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116911/testReport)** for PR 27058 at commit [`52953ec`](https://github.com/apache/spark/commit/52953ec6b3dec4dd6d48105df2f1ef889cd9e75d).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579038043
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575519274
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367717874
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -331,6 +457,17 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     }
   }
 
+  /**
+   * Collect all aggregate expressions.
 
 Review comment:
   We don't need this comment (it is totally the same with the function name below...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367736708
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
 
 Review comment:
   nit: `filter(_.children.exists(!_.foldable))`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-581008791
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575466319
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572874663
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116434/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575499022
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575508203
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21680/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575508247
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116911/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575519987
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578625013
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578625016
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22193/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578629266
 
 
   **[Test build #117434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/117434/testReport)** for PR 27058 at commit [`7a74aae`](https://github.com/apache/spark/commit/7a74aae09f8f696102c5b92b850d572d64fd9cb1).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572874661
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575520172
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116895/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572821261
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21222/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367721603
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
 
 Review comment:
   If `expandAggregate` already has an expand (added by `expandDistinctAggregateWithFilter`) in a child, can we merge them inside `rewriteDistinctAggregate` (line 424)?
   https://github.com/apache/spark/pull/27058/files#r366138753

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367722603
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
 
 Review comment:
   nit:
   ```
     private def mayNeedtoRewrite(agg: Aggregate): Boolean = {
       val distinctAggs = collectAggregateExprs(agg).filter(_.isDistinct)
      ...
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573077769
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r366149894
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,6 +118,52 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1,
+ *     COUNT(DISTINCT cat1) as cat1_cnt2,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt1, 'cat1_cnt2, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat1_cnt2, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
 
 Review comment:
   Because I want to reuse `rewrite`. The first `Expand` created by `rewrite`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575500747
 
 
   **[Test build #116907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116907/testReport)** for PR 27058 at commit [`0f69488`](https://github.com/apache/spark/commit/0f69488539a4161c5605ce592b683b753aea0f1b).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575587787
 
 
   **[Test build #116939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116939/testReport)** for PR 27058 at commit [`5462c0c`](https://github.com/apache/spark/commit/5462c0ce605cf95e608c945e4ca8bff358740f50).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579088588
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117456/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577579811
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-569911830
 
 
   **[Test build #115987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115987/testReport)** for PR 27058 at commit [`0008bae`](https://github.com/apache/spark/commit/0008bae5ff27a6766b9ceb3bb067eac2e8e322dc).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368940820
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +207,106 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // This rule serves two purposes:
+    // One is to rewrite when there exists at least two distinct aggregates. We need at least
+    // two distinct aggregates for this rule because aggregation strategy can handle a single
+    // distinct group.
+    // Another is to expand distinct aggregates which exists filter clause so that we can
+    // evaluate the filter locally.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregate(a)
+      rewriteDistinctAggregate(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregate(a: Aggregate): Aggregate = {
 
 Review comment:
   So I should add filter API in dsl?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570140108
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572446217
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-569912064
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575498651
 
 
   **[Test build #116906 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116906/testReport)** for PR 27058 at commit [`bdda81d`](https://github.com/apache/spark/commit/bdda81dbaceb0ac7d80ebe0f03dbf79623fc6e42).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368849887
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,66 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1,
+ *     COUNT(DISTINCT cat1) as cat1_cnt2,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt1, 'cat1_cnt2, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat1_cnt2, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
+ *            projections = [('key, if ('id > 1) 'cat1 else null, 'cat1, cast('value as bigint))]
+ *            output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value])
+ *           LocalTableScan [...]
+ * }}}
+ *
+ * The rule serves two purposes:
+ * 1. Expand distinct aggregates which exists filter clause.
 
 Review comment:
   OK.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575519987
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368989206
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +207,106 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // This rule serves two purposes:
+    // One is to rewrite when there exists at least two distinct aggregates. We need at least
+    // two distinct aggregates for this rule because aggregation strategy can handle a single
+    // distinct group.
+    // Another is to expand distinct aggregates which exists filter clause so that we can
+    // evaluate the filter locally.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregate(a)
+      rewriteDistinctAggregate(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregate(a: Aggregate): Aggregate = {
 
 Review comment:
   Yea, I think its ok to do so.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575466319
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367723884
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularOperatorMap = regularAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+          val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+          val filterOpt = filter.map(_.transform {
+            case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+          })
+          val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+          (ae, aggExpr)
+      }
+
+      // Setup expand for the distinct aggregate expressions.
+      val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val rewriteDistinctOperatorMap = distinctAggExprs.map {
+        case ae @ AggregateExpression(af, _, _, filter, _) =>
+          // Why do we need to construct the phantom id ?
+          // First, In order to reduce costs, it is better to handle the filter clause locally.
+          // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+          // If(id > 1) 'a else null first, and use the result as output.
+          // Second, If more than one DISTINCT aggregate expression uses the same column,
+          // We need to construct the phantom attributes so as the output not lost.
+          // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+          // attribute '_gen_distinct-1 and attribute '_gen_distinct-2 instead of two 'a.
+          // Note: We just need to illusion the expression with filter clause.
+          // The illusionary mechanism may result in multiple distinct aggregations uses
+          // different column, so we still need to call `rewrite`.
 
 Review comment:
   Since this comment is stale, can you update/simplify it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367755294
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573140754
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574784598
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367741302
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
     case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
   }
 
   def rewrite(a: Aggregate): Aggregate = {
+    val expandAggregate = expandDistinctAggregateWithFilter(a)
+    rewriteDistinctAggregate(expandAggregate)
+  }
 
-    // Collect all aggregate expressions.
-    val aggExpressions = a.aggregateExpressions.flatMap { e =>
-      e.collect {
-        case ae: AggregateExpression => ae
+  private def expandDistinctAggregateWithFilter(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions.
+      val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
 
 Review comment:
   `val regularAggChildren = regularAggExpressions.flatMap(_.references)`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578620873
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/22191/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368860880
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +207,106 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // This rule serves two purposes:
+    // One is to rewrite when there exists at least two distinct aggregates. We need at least
+    // two distinct aggregates for this rule because aggregation strategy can handle a single
+    // distinct group.
+    // Another is to expand distinct aggregates which exists filter clause so that we can
+    // evaluate the filter locally.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregate(a)
+      rewriteDistinctAggregate(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregate(a: Aggregate): Aggregate = {
+    val aggExpressions = collectAggregateExprs(a)
+    val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+    if (distinctAggExpressions.exists(_.filter.isDefined)) {
+      // Setup expand for the 'regular' aggregate expressions. Because we will construct a new
+      // aggregate, the children of the distinct aggregates will be changed to the generate
+      // ones, so we need creates new references to avoid collisions between distinct and
+      // regular aggregate children.
+      val regularAggExprs = regularAggExpressions.filter(_.children.exists(!_.foldable))
+      val regularFunChildren = regularAggExprs
+        .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+      val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+      val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+      val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+      val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+      val regularAggMap = regularAggExprs.map {
 
 Review comment:
   OK

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r365092774
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
 
 Review comment:
   You can reference the discussion
   https://github.com/apache/spark/pull/26656#discussion_r353717031
   In order to reduce cost, we treat the filter conditions of DISTINCT aggregate expr in first aggregate or local is better.
   If we use `AggregationIterator`, the filter conditions of DISTINCT aggregate expr will be treated in second or thrid aggregate.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364515132
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -317,7 +363,74 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
     } else {
-      a
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            val phantomId = i + 1
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"phantom$phantomId-${e.sql}", e.dataType, nullable = true)())
+            }
+            val exprAttrLookup = exprAttrs.toMap
+            val newChildren = af.children.map(c => exprAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = None)
+            // Expand projection
+            val projection = unfoldableChildren.map {
+              case e if filter.isDefined => If(filter.get, e, nullify(e))
+              case e => e
+            }
+            (projection, exprAttrs, (ae, aggExpr))
+        }
+        val rewriteDistinctAttrMap = rewriteDistinctOperatorMap.map(_._2).flatten
+        val distinctAggChildAttrs = rewriteDistinctAttrMap.map(_._2)
+        val allAggAttrs = (regularAggChildAttrMap.map(_._2) ++ distinctAggChildAttrs)
 
 Review comment:
   nit: you don't need `(` and `)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574784613
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116781/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577581549
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367736428
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -151,21 +197,101 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     // We need at least two distinct aggregates for this rule because aggregation
     // strategy can handle a single distinct group.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1
 
 Review comment:
   Great

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578751948
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578876419
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-579220859
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117472/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572864758
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21233/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367758175
 
 

 ##########
 File path: sql/core/src/test/resources/sql-tests/results/postgreSQL/groupingsets.sql.out
 ##########
 @@ -444,10 +444,29 @@ NULL	1
 
 
 -- !query 34
-select count(*) from gstest4 group by rollup(unhashable_col,unsortable_col)
+select ten, sum(distinct four) filter (where string(four) like '123') from onek a
+group by rollup(ten)
 -- !query 34 schema
-struct<count(1):bigint>
+struct<ten:int,sum(DISTINCT four) FILTER (WHERE CAST(four AS STRING) LIKE 123):bigint>
 -- !query 34 output
 
 Review comment:
   The same

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574963615
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116803/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573040278
 
 
   **[Test build #116490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116490/testReport)** for PR 27058 at commit [`b425eb4`](https://github.com/apache/spark/commit/b425eb4eefecff3135af9bdc7386d9dfe109b1b1).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575675292
 
 
   **[Test build #116939 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116939/testReport)** for PR 27058 at commit [`5462c0c`](https://github.com/apache/spark/commit/5462c0ce605cf95e608c945e4ca8bff358740f50).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-577670237
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/117291/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer closed pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer closed pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369306575
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +205,102 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // We need at least two distinct aggregates or a single distinct aggregate with a filter for
+    // this rule because aggregation strategy can handle a single distinct group without a filter.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
 
 Review comment:
   not `distinctAggs.size >= 1` but `distinctAggs.size > 1`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369341486
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,64 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
+ *     COUNT(DISTINCT cat2) as cat2_cnt,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat2),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
+ *            projections = [('key, if ('id > 1) 'cat1 else null, 'cat2, cast('value as bigint))]
+ *            output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value])
+ *           LocalTableScan [...]
+ * }}}
+ *
+ * The rule consists of the two phases as follows.
+ *
+ * In the first phase, expands distinct aggregates which exists filter clause:
+ * 1. Guaranteed to compute filter clauses in the first aggregate locally.
+ * 2. The attributes referenced by different distinct aggregate expressions are likely to overlap,
+ *    and if no additional processing is performed, data loss will occur. To prevent this, we
+ *    generate new attributes and replace the original ones.
+ * 3. If we apply the first phase to distinct aggregate expressions which exists filter
+ *    clause, the aggregate after expand may have at least two distinct aggregates, so we need to
+ *    apply the second phase too. Please refer to the second phase for more details.
+ *
+ * In the second phase, rewrite when aggregate exists at least two distinct aggregates:
  * 1. Expand the data. There are three aggregation groups in this query:
 
 Review comment:
   The second example is not related to `RewriteDistinctAggregates`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r371574438
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
 ##########
 @@ -165,20 +165,35 @@ package object dsl {
     }
 
     def rand(e: Long): Expression = Rand(e)
-    def sum(e: Expression): Expression = Sum(e).toAggregateExpression()
-    def sumDistinct(e: Expression): Expression = Sum(e).toAggregateExpression(isDistinct = true)
-    def count(e: Expression): Expression = Count(e).toAggregateExpression()
+    def sum(e: Expression, filter: Option[Expression] = None): Expression =
+      Sum(e).toAggregateExpression(false, filter)
+    def sumDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Sum(e).toAggregateExpression(true, filter)
+    def count(e: Expression, filter: Option[Expression] = None): Expression =
+      Count(e).toAggregateExpression(false, filter)
     def countDistinct(e: Expression*): Expression =
       Count(e).toAggregateExpression(isDistinct = true)
-    def approxCountDistinct(e: Expression, rsd: Double = 0.05): Expression =
-      HyperLogLogPlusPlus(e, rsd).toAggregateExpression()
-    def avg(e: Expression): Expression = Average(e).toAggregateExpression()
-    def first(e: Expression): Expression = new First(e).toAggregateExpression()
-    def last(e: Expression): Expression = new Last(e).toAggregateExpression()
-    def min(e: Expression): Expression = Min(e).toAggregateExpression()
-    def minDistinct(e: Expression): Expression = Min(e).toAggregateExpression(isDistinct = true)
-    def max(e: Expression): Expression = Max(e).toAggregateExpression()
-    def maxDistinct(e: Expression): Expression = Max(e).toAggregateExpression(isDistinct = true)
+    def countDistinct(filter: Option[Expression], e: Expression*): Expression =
+      Count(e).toAggregateExpression(true, filter)
+    def approxCountDistinct(
+        e: Expression,
+        rsd: Double = 0.05,
+        filter: Option[Expression] = None): Expression =
+      HyperLogLogPlusPlus(e, rsd).toAggregateExpression(true, filter)
+    def avg(e: Expression, filter: Option[Expression] = None): Expression =
+      Average(e).toAggregateExpression(false, filter)
+    def first(e: Expression, filter: Option[Expression] = None): Expression =
+      new First(e).toAggregateExpression(false, filter)
+    def last(e: Expression, filter: Option[Expression] = None): Expression =
+      new Last(e).toAggregateExpression(false, filter)
+    def min(e: Expression, filter: Option[Expression] = None): Expression =
+      Min(e).toAggregateExpression(false, filter)
+    def minDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Min(e).toAggregateExpression(true, filter)
+    def max(e: Expression, filter: Option[Expression] = None): Expression =
+      Max(e).toAggregateExpression(false, filter)
+    def maxDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Max(e).toAggregateExpression(true, filter)
     def upper(e: Expression): Expression = Upper(e)
 
 Review comment:
   These changes above are related to `RewriteDistinctAggregatesSuite`? If no, we don't need to change this file.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578625013
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r367758738
 
 

 ##########
 File path: sql/core/src/test/resources/sql-tests/results/group-by-filter.sql.out
 ##########
 @@ -95,41 +95,73 @@ struct<count(id) FILTER (WHERE (date_format(CAST(hiredate AS TIMESTAMP), yyyy-MM
 
 
 -- !query 9
-SELECT a, COUNT(b) FILTER (WHERE a >= 2) FROM testData GROUP BY a
+SELECT COUNT(DISTINCT id) FILTER (WHERE date_format(hiredate, "yyyy-MM-dd HH:mm:ss") = "2001-01-01 00:00:00") FROM emp
 -- !query 9 schema
-struct<a:int,count(b) FILTER (WHERE (a >= 2)):bigint>
+struct<count(DISTINCT id) FILTER (WHERE (date_format(CAST(hiredate AS TIMESTAMP), yyyy-MM-dd HH:mm:ss) = 2001-01-01 00:00:00)):bigint>
 -- !query 9 output
+2
 
 Review comment:
   I can't visit this url.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364236488
 
 

 ##########
 File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ##########
 @@ -192,8 +192,10 @@ class AnalysisErrorSuite extends AnalysisTest {
 
   errorTest(
     "DISTINCT and FILTER cannot be used in aggregate functions at the same time",
-    CatalystSqlParser.parsePlan("SELECT count(DISTINCT a) FILTER (WHERE c > 1) FROM TaBlE2"),
-    "DISTINCT and FILTER cannot be used in aggregate functions at the same time" :: Nil)
+    CatalystSqlParser.parsePlan("SELECT sum(DISTINCT b), " +
+      "count(DISTINCT a) FILTER (WHERE c > 1) FROM TaBlE2"),
+    "When there are multiple DISTINCT aggregate expressions acting on different " +
+    "fields, any DISTINCT aggregate expression not allow use FILTER clause." :: Nil)
 
 Review comment:
   Can you add some tests for the foldable case?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r366155166
 
 

 ##########
 File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ##########
 @@ -192,8 +192,10 @@ class AnalysisErrorSuite extends AnalysisTest {
 
   errorTest(
     "DISTINCT and FILTER cannot be used in aggregate functions at the same time",
-    CatalystSqlParser.parsePlan("SELECT count(DISTINCT a) FILTER (WHERE c > 1) FROM TaBlE2"),
-    "DISTINCT and FILTER cannot be used in aggregate functions at the same time" :: Nil)
+    CatalystSqlParser.parsePlan("SELECT sum(DISTINCT b), " +
+      "count(DISTINCT a) FILTER (WHERE c > 1) FROM TaBlE2"),
+    "When there are multiple DISTINCT aggregate expressions acting on different " +
+    "fields, any DISTINCT aggregate expression not allow use FILTER clause." :: Nil)
 
 Review comment:
   `emp_name` is foldable? I meant a query like `select key, sum(distinct value), avg(distinct 1) ...`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364537516
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,6 +118,52 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1,
+ *     COUNT(DISTINCT cat1) as cat1_cnt2,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt1, 'cat2_cnt2, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) 'phantom1 else null),
 
 Review comment:
   You can reference https://github.com/apache/spark/blob/a4fd143fa92cb569764b669e5c53f6dfd1f1ae6a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala#L336
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574620588
 
 
   **[Test build #116781 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116781/testReport)** for PR 27058 at commit [`aa37870`](https://github.com/apache/spark/commit/aa37870ec36a8bf182c360a37c46d138d267f8a4).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r371574438
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
 ##########
 @@ -165,20 +165,35 @@ package object dsl {
     }
 
     def rand(e: Long): Expression = Rand(e)
-    def sum(e: Expression): Expression = Sum(e).toAggregateExpression()
-    def sumDistinct(e: Expression): Expression = Sum(e).toAggregateExpression(isDistinct = true)
-    def count(e: Expression): Expression = Count(e).toAggregateExpression()
+    def sum(e: Expression, filter: Option[Expression] = None): Expression =
+      Sum(e).toAggregateExpression(false, filter)
+    def sumDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Sum(e).toAggregateExpression(true, filter)
+    def count(e: Expression, filter: Option[Expression] = None): Expression =
+      Count(e).toAggregateExpression(false, filter)
     def countDistinct(e: Expression*): Expression =
       Count(e).toAggregateExpression(isDistinct = true)
-    def approxCountDistinct(e: Expression, rsd: Double = 0.05): Expression =
-      HyperLogLogPlusPlus(e, rsd).toAggregateExpression()
-    def avg(e: Expression): Expression = Average(e).toAggregateExpression()
-    def first(e: Expression): Expression = new First(e).toAggregateExpression()
-    def last(e: Expression): Expression = new Last(e).toAggregateExpression()
-    def min(e: Expression): Expression = Min(e).toAggregateExpression()
-    def minDistinct(e: Expression): Expression = Min(e).toAggregateExpression(isDistinct = true)
-    def max(e: Expression): Expression = Max(e).toAggregateExpression()
-    def maxDistinct(e: Expression): Expression = Max(e).toAggregateExpression(isDistinct = true)
+    def countDistinct(filter: Option[Expression], e: Expression*): Expression =
+      Count(e).toAggregateExpression(true, filter)
+    def approxCountDistinct(
+        e: Expression,
+        rsd: Double = 0.05,
+        filter: Option[Expression] = None): Expression =
+      HyperLogLogPlusPlus(e, rsd).toAggregateExpression(true, filter)
+    def avg(e: Expression, filter: Option[Expression] = None): Expression =
+      Average(e).toAggregateExpression(false, filter)
+    def first(e: Expression, filter: Option[Expression] = None): Expression =
+      new First(e).toAggregateExpression(false, filter)
+    def last(e: Expression, filter: Option[Expression] = None): Expression =
+      new Last(e).toAggregateExpression(false, filter)
+    def min(e: Expression, filter: Option[Expression] = None): Expression =
+      Min(e).toAggregateExpression(false, filter)
+    def minDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Min(e).toAggregateExpression(true, filter)
+    def max(e: Expression, filter: Option[Expression] = None): Expression =
+      Max(e).toAggregateExpression(false, filter)
+    def maxDistinct(e: Expression, filter: Option[Expression] = None): Expression =
+      Max(e).toAggregateExpression(true, filter)
     def upper(e: Expression): Expression = Upper(e)
 
 Review comment:
   These changes above are related to `RewriteDistinctAggregatesSuite`? If no, we don't need to change the file in this pr.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573140773
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/116491/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r369336741
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,64 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
+ *     COUNT(DISTINCT cat2) as cat2_cnt,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat2),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
+ *            projections = [('key, if ('id > 1) 'cat1 else null, 'cat2, cast('value as bigint))]
+ *            output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value])
+ *           LocalTableScan [...]
+ * }}}
+ *
+ * The rule consists of the two phases as follows.
+ *
+ * In the first phase, expands distinct aggregates which exists filter clause:
 
 Review comment:
   Sorry! I get it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-575520166
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368849887
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,7 +118,66 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1,
+ *     COUNT(DISTINCT cat1) as cat1_cnt2,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt1, 'cat1_cnt2, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_distinct_1 else null),
+ *                   count(if (('gid = 2)) '_gen_distinct_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat1_cnt2, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid]
+ *        functions = [sum('value)]
+ *        output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total])
+ *       Expand(
+ *           projections = [('key, null, null, 0, 'value),
+ *                         ('key, '_gen_distinct_1, null, 1, null),
+ *                         ('key, null, '_gen_distinct_2, 2, null)]
+ *           output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value])
+ *         Expand(
+ *            projections = [('key, if ('id > 1) 'cat1 else null, 'cat1, cast('value as bigint))]
+ *            output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value])
+ *           LocalTableScan [...]
+ * }}}
+ *
+ * The rule serves two purposes:
+ * 1. Expand distinct aggregates which exists filter clause.
 
 Review comment:
   OK. Looks better.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-572864514
 
 
   **[Test build #116445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116445/testReport)** for PR 27058 at commit [`dd09645`](https://github.com/apache/spark/commit/dd096453c640335fe7af9f8b89eac296ee54615c).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r364537771
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -118,6 +118,52 @@ import org.apache.spark.sql.types.IntegerType
  *       LocalTableScan [...]
  * }}}
  *
+ * Third example: single distinct aggregate function with filter clauses (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt1,
+ *     COUNT(DISTINCT cat1) as cat1_cnt2,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt1, 'cat2_cnt2, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) 'phantom1 else null),
 
 Review comment:
   Yea, I got it now. Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570140195
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/20809/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-574784598
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#discussion_r368859956
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -148,24 +207,106 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
     val distinctAggs = exprs.flatMap { _.collect {
       case ae: AggregateExpression if ae.isDistinct => ae
     }}
-    // We need at least two distinct aggregates for this rule because aggregation
-    // strategy can handle a single distinct group.
+    // This rule serves two purposes:
+    // One is to rewrite when there exists at least two distinct aggregates. We need at least
+    // two distinct aggregates for this rule because aggregation strategy can handle a single
+    // distinct group.
+    // Another is to expand distinct aggregates which exists filter clause so that we can
+    // evaluate the filter locally.
     // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size >= 1 || distinctAggs.exists(_.filter.isDefined)
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) => rewrite(a)
+    case a: Aggregate if mayNeedtoRewrite(a.aggregateExpressions) =>
+      val expandAggregate = extractFiltersInDistinctAggregate(a)
+      rewriteDistinctAggregate(expandAggregate)
   }
 
-  def rewrite(a: Aggregate): Aggregate = {
+  private def extractFiltersInDistinctAggregate(a: Aggregate): Aggregate = {
 
 Review comment:
   For first suggestion, you means I should add new API in dsl? dsl can't support filter clause yet.
   For second suggestion, OK.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-573040878
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/21277/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

beliefer commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r365107970
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -316,6 +362,86 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
         }.asInstanceOf[NamedExpression]
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
+    } else if (distinctAggGroups.size == 1) {
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            // Why do we need to construct the phantom id ?
+            // First, In order to reduce costs, it is better to handle the filter clause locally.
+            // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+            // If(id > 1) 'a else null first, and use the result as output.
+            // Second, If more than one DISTINCT aggregate expression uses the same column,
+            // We need to construct the phantom attributes so as the output not lost.
+            // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+            // attribute 'phantom1-a and attribute 'phantom2-a instead of two 'a.
+            // Note: We just need to illusion the expression with filter clause.
+            // The illusionary mechanism may result in multiple distinct aggregations uses
+            // different column, so we still need to call `rewrite`.
+            val phantomId = i + 1
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"phantom$phantomId-${e.sql}", e.dataType, nullable = true)())
 
 Review comment:
   OK.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570139949
 
 
   **[Test build #116017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116017/testReport)** for PR 27058 at commit [`a4fd143`](https://github.com/apache/spark/commit/a4fd143fa92cb569764b669e5c53f6dfd1f1ae6a).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#issuecomment-570184651
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #27058: [SPARK-30395][SQL] When one or more DISTINCT aggregate expressions operate on the same field, the DISTINCT aggregate expression allows the use of the FILTER clause
URL: https://github.com/apache/spark/pull/27058#discussion_r365112646
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
 ##########
 @@ -316,6 +362,86 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] {
         }.asInstanceOf[NamedExpression]
       }
       Aggregate(groupByAttrs, patchedAggExpressions, firstAggregate)
+    } else if (distinctAggGroups.size == 1) {
+      val (distinctAggExpressions, regularAggExpressions) = aggExpressions.partition(_.isDistinct)
+      if (distinctAggExpressions.exists(_.filter.isDefined)) {
+        val regularAggExprs = regularAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val regularFunChildren = regularAggExprs
+          .flatMap(_.aggregateFunction.children.filter(!_.foldable))
+        val regularFilterAttrs = regularAggExprs.flatMap(_.filterAttributes)
+        val regularAggChildren = (regularFunChildren ++ regularFilterAttrs).distinct
+        val regularAggChildAttrMap = regularAggChildren.map(expressionAttributePair)
+        val regularAggChildAttrLookup = regularAggChildAttrMap.toMap
+        val regularOperatorMap = regularAggExprs.map {
+          case ae @ AggregateExpression(af, _, _, filter, _) =>
+            val newChildren = af.children.map(c => regularAggChildAttrLookup.getOrElse(c, c))
+            val raf = af.withNewChildren(newChildren).asInstanceOf[AggregateFunction]
+            val filterOpt = filter.map(_.transform {
+              case a: Attribute => regularAggChildAttrLookup.getOrElse(a, a)
+            })
+            val aggExpr = ae.copy(aggregateFunction = raf, filter = filterOpt)
+            (ae, aggExpr)
+        }
+        val distinctAggExprs = distinctAggExpressions.filter(e => e.children.exists(!_.foldable))
+        val rewriteDistinctOperatorMap = distinctAggExprs.zipWithIndex.map {
+          case (ae @ AggregateExpression(af, _, _, filter, _), i) =>
+            // Why do we need to construct the phantom id ?
+            // First, In order to reduce costs, it is better to handle the filter clause locally.
+            // e.g. COUNT (DISTINCT a) FILTER (WHERE id > 1), evaluate expression
+            // If(id > 1) 'a else null first, and use the result as output.
+            // Second, If more than one DISTINCT aggregate expression uses the same column,
+            // We need to construct the phantom attributes so as the output not lost.
+            // e.g. SUM (DISTINCT a), COUNT (DISTINCT a) FILTER (WHERE id > 1) will output
+            // attribute 'phantom1-a and attribute 'phantom2-a instead of two 'a.
+            // Note: We just need to illusion the expression with filter clause.
+            // The illusionary mechanism may result in multiple distinct aggregations uses
+            // different column, so we still need to call `rewrite`.
+            val phantomId = i + 1
+            val unfoldableChildren = af.children.filter(!_.foldable)
+            val exprAttrs = unfoldableChildren.map { e =>
+              (e, AttributeReference(s"phantom$phantomId-${e.sql}", e.dataType, nullable = true)())
 
 Review comment:
   Since this name is shown in explain, I think a related name is better. Why `phantom`? I think common terms used in spark are better.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27058: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
URL: https://github.com/apache/spark/pull/27058#issuecomment-578699422
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org