You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/20 07:14:57 UTC

[GitHub] [spark] viirya opened a new pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

viirya opened a new pull request #28876:
URL: https://github.com/apache/spark/pull/28876


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch applies `NormalizeFloatingNumbers` to distinct aggregate to fix a regression of distinct aggregate on NaNs.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   We added `NormalizeFloatingNumbers` optimization rule in 3.0.0 to normalize special floating numbers (NaN and -0.0). But it is missing in distinct aggregate so causes a regression. We need to apply this rule on distinct aggregate to fix it.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   Yes, fixing a regression of distinct aggregate on NaNs.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Added unit test.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443172691



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.

Review comment:
       nit: maybe `groupingExpressions ` -> `distinctExpressions`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647072406






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443291919



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       If we broaden the scope, `SparkStrategies` already is looking at the detail of `functionsWithDistinct` like the following.
   ```scala
           val (functionsWithDistinct, functionsWithoutDistinct) =
             aggregateExpressions.partition(_.isDistinct)
           if (functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length > 1) {
             // This is a sanity check. We should not reach here when we have multiple distinct
             // column sets. Our `RewriteDistinctAggregates` should take care this case.
             sys.error("You hit a query analyzer bug. Please report your query to " +
                 "Spark user mailing list.")
           }
   ```
   
   And the very next line is the same logic block for `groupingExpression`.
   ```scala
           // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
           // `groupingExpressions` is not extracted during logical phase.
           val normalizedGroupingExpressions = groupingExpressions.map { e =>
             NormalizeFloatingNumbers.normalize(e) match {
               case n: NamedExpression => n
               case other => Alias(other, e.name)(exprId = e.exprId)
             }
           }
   ```
   
   Given the above, I guess what you concerned is only one line source code, `val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children`.
   
   And, the following comment is about the definition of `functionsWithDistinct` which is generated from the above. The comment is not a detail hidden from `SparkStrategies`. For me, it seems to be an assumption given by `SparkStrategies`.
   ```
   // functionsWithDistinct is guaranteed to be non-empty. Even though it may contain more than one
     // DISTINCT aggregate function, all of those functions will have the same column expressions.
     // For example, it would be valid for functionsWithDistinct to be
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647013561


   cc @cloud-fan @dongjoon-hyun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] abellina commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
abellina commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443183186



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+    // `groupingExpressions` is not extracted during logical phase.
+    val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>

Review comment:
       Thanks for taking the time @viirya. I am not 100% sure when all the cases that need named expression, but that the physical node output expressions need to be named, makes sense to me. Seems like any downstream node that needs to refer to an output needs things like `ExprId` in order to distinguish fields.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443221086



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##########
@@ -1012,4 +1012,20 @@ class DataFrameAggregateSuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate") {
+    withTempView("view") {
+      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
+      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)

Review comment:
       ok. :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298986



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.aggregate
 
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers

Review comment:
       yea. :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443297250



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443164908



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+    // `groupingExpressions` is not extracted during logical phase.
+    val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>

Review comment:
       If you are questioning about why we need to have named expressions here, I think it is because we need these distinct expressions to be in the result expressions in Aggregate physical operator. These result expressions are for the output attributes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298402



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       Moved to `SparkStrategies`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-646984991






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443172768



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>

Review comment:
       @viirya, out of curiosity, can't we do this in the same place as https://github.com/apache/spark/blob/17586f9ed25cbb1dd7dbb221aecb8749623e2175/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L434-L441?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-646956822






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443219073



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##########
@@ -1012,4 +1012,20 @@ class DataFrameAggregateSuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate") {
+    withTempView("view") {
+      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
+      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)

Review comment:
       No, just copied from the original report on the jira. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443163113



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because

Review comment:
       +1 for @abellina 's question.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443266734



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       This works correctly, @viirya . However, as an architectural perspective, I'm wondering if we can collect these `NormalizeFloatingNumbers.normalize` application in `SparkStrategies` instead of putting `AggUtils`. Actually, in `SparkStragegies`, we have the same logic for `normalizedGroupingExpressions` and missed this conversion at `functionsWithDistinct`. IMO, it would be easier if we keep this converting in one place, `SparkStragegies`, in order to avoid missing corner cases like this.
   
   This is just a suggestion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443220017



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##########
@@ -1012,4 +1012,20 @@ class DataFrameAggregateSuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate") {
+    withTempView("view") {
+      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
+      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)

Review comment:
       Let's rename it. It's actually a bit odds to have a constant-like name




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647330963


   **[Test build #124354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124354/testReport)** for PR 28876 at commit [`bc159ab`](https://github.com/apache/spark/commit/bc159ab6d5191a9b7e01bafd7a65d39c03eb419c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443172995



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##########
@@ -1012,4 +1012,20 @@ class DataFrameAggregateSuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate") {
+    withTempView("view") {
+      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
+      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)
+
+      Seq(("mithunr", Float.NaN),
+        ("mithunr", POS_NAN_1),
+        ("mithunr", POS_NAN_2),
+        ("abellina", 1.0f),
+        ("abellina", 2.0f)).toDF("uid", "score").createOrReplaceTempView("view")
+
+      val df = spark.sql(" select uid, count(distinct score) from view group by 1 order by 1 asc ")

Review comment:
       Sorry tiny nit: the space in the query string.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-646956822






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647131600






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443266734



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       This works correctly, @viirya . However, as an architectural perspective, I'm wondering if we can collect these `NormalizeFloatingNumbers.normalize` application in `SparkStrategies` instead of putting `AggUtils`. Actually, in `SparkStragegies`, we have the same logic for `normalizedGroupingExpressions` and missed this conversion at `functionsWithDistinct`. IMO, it would be easier if we keep this converting in one place, `SparkStragegies`, in order to avoid missing corner cases like this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-646956695


   **[Test build #124316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124316/testReport)** for PR 28876 at commit [`d11001c`](https://github.com/apache/spark/commit/d11001cae9382a75bb42edc1e4925325704a5ba9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] abellina commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
abellina commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443159462



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+    // `groupingExpressions` is not extracted during logical phase.
+    val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>

Review comment:
       ```suggestion
       val normalizedNamedDistinctExpressions = namedDistinctExpressions.map { e =>
   ```

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because

Review comment:
       could the normalized distinct expressions be combined with the `namedDistinctExpressions` above (so there's a single `match` block)?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+    // `groupingExpressions` is not extracted during logical phase.
+    val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>

Review comment:
       I have a basic catalyst question and feel free to send me away. The question is what about being "named" is a requirement in this case. I bet it has to do with expression binding, but I am not entirely sure, and was wondering if you had that answer since you had to special case it here.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+    // `groupingExpressions` is not extracted during logical phase.
+    val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>
+      NormalizeFloatingNumbers.normalize(e) match {
+        case n: NamedExpression => n
+        case other => Alias(other, e.name)(exprId = e.exprId)
+      }
+    }
+    val distinctAttributes = normalizednamedDistinctExpressions.map(_.toAttribute)

Review comment:
       ```suggestion
       val distinctAttributes = normalizedNamedDistinctExpressions.map(_.toAttribute)
   ```

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -159,7 +168,7 @@ object AggUtils {
       // DISTINCT column. For example, for AVG(DISTINCT value) GROUP BY key, the grouping
       // expressions will be [key, value].
       createAggregate(
-        groupingExpressions = groupingExpressions ++ namedDistinctExpressions,
+        groupingExpressions = groupingExpressions ++ normalizednamedDistinctExpressions,

Review comment:
       ```suggestion
           groupingExpressions = groupingExpressions ++ normalizedNamedDistinctExpressions,
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647248244


   **[Test build #124345 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124345/testReport)** for PR 28876 at commit [`bc159ab`](https://github.com/apache/spark/commit/bc159ab6d5191a9b7e01bafd7a65d39c03eb419c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647085024






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443266734



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       This works correctly, @viirya . As an architectural perspective, I'm wondering if we can collect these `NormalizeFloatingNumbers.normalize` application in `SparkStrategies` instead of putting `AggUtils`. Actually, in `SparkStragegies`, we have the same logic for `normalizedGroupingExpressions` and missed this conversion at `functionsWithDistinct`. IMO, it would be easier if we keep this converting in one place, `SparkStragegies`, in order to avoid missing corner cases like this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443291919



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       If we broaden the scope, `SparkStrategies` already is looking at the detail of `functionsWithDistinct` like the following.
   ```scala
           val (functionsWithDistinct, functionsWithoutDistinct) =
             aggregateExpressions.partition(_.isDistinct)
           if (functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length > 1) {
             // This is a sanity check. We should not reach here when we have multiple distinct
             // column sets. Our `RewriteDistinctAggregates` should take care this case.
             sys.error("You hit a query analyzer bug. Please report your query to " +
                 "Spark user mailing list.")
           }
   ```
   
   And the very next line is the same logic block for `groupingExpression`.
   ```scala
           // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
           // `groupingExpressions` is not extracted during logical phase.
           val normalizedGroupingExpressions = groupingExpressions.map { e =>
             NormalizeFloatingNumbers.normalize(e) match {
               case n: NamedExpression => n
               case other => Alias(other, e.name)(exprId = e.exprId)
             }
           }
   ```
   
   Given the above, I guess what you concerned is only one line source code, `val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children`.
   
   And, the comment is about the definition of `functionsWithDistinct` which is generated from the above. The comment is not a detail hidden from `SparkStrategies`. For me, it seems to be an assumption given by `SparkStrategies`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-646956695


   **[Test build #124316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124316/testReport)** for PR 28876 at commit [`d11001c`](https://github.com/apache/spark/commit/d11001cae9382a75bb42edc1e4925325704a5ba9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647246334






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647245921






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647163464






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443172691



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.

Review comment:
       nit: maybe `groupingExpressions ` -> `distinctExpressions`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443290777



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.

Review comment:
       Maybe typo? `groupingExpressions` -> `distinctExpressions`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443172768



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>

Review comment:
       @viirya, out of curiosity, can't we do this in the same place as https://github.com/apache/spark/blob/17586f9ed25cbb1dd7dbb221aecb8749623e2175/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L327-L334?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647053554






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647967700


   late LGTM, thanks, @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647048947


   cc @gatorsmile too since this is a correctness issue at 3.0.0. I believe we need to include this in 3.0.1.
   cc @HyukjinKwon since he is interested in 3.0.1 release.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647072192


   **[Test build #124328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124328/testReport)** for PR 28876 at commit [`dd66f7e`](https://github.com/apache/spark/commit/dd66f7eabf6e80b1b08dec6fe6790f0b97943e67).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647131433


   **[Test build #124339 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124339/testReport)** for PR 28876 at commit [`0026e14`](https://github.com/apache/spark/commit/0026e143f97296dade947bda01ec07d661575ecc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647328741






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443271061



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       We could have `NormalizeFloatingNumbers.normalize` in `AggUtils`, but we need to have it both in `AggUtils.planAggregateWithoutDistinct` and `AggUtils.planAggregateWithOneDistinct`.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647053489


   **[Test build #124328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124328/testReport)** for PR 28876 at commit [`dd66f7e`](https://github.com/apache/spark/commit/dd66f7eabf6e80b1b08dec6fe6790f0b97943e67).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647067619






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647329566


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647471220






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443269072



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       Generally I agree. However, we need to use `distinctExpressions` for the `normalizedNamedDistinctExpressions`. 
   
   If we want to put this `NormalizeFloatingNumbers` invocation in `SparkStrategies`, we need to move `distinctExpressions` from `AggUtils.planAggregateWithOneDistinct` to `SparkStrategies` too. 
   
   It would look like, in `SparkStrategies`:
   
   ```scala
   if (functionsWithDistinct.isEmpty) {
     AggUtils.planAggregateWithoutDistinct(
       normalizedGroupingExpressions,
       aggregateExpressions,
       resultExpressions,
       planLater(child))
   } else {
     // functionsWithDistinct is guaranteed to be non-empty. Even though it may contain more than one
     // DISTINCT aggregate function, all of those functions will have the same column expressions.
     // For example, it would be valid for functionsWithDistinct to be
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
     val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
         // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
         // `groupingExpressions` is not extracted during logical phase.
         NormalizeFloatingNumbers.normalize(e) match {
           case ne: NamedExpression => ne
           case other => Alias(other, other.toString)()
         }
       }
     AggUtils.planAggregateWithOneDistinct(
       normalizedGroupingExpressions,
       functionsWithDistinct,
       functionsWithoutDistinct,
       distinctExpressions,
       normalizedNamedDistinctExpressions,
       resultExpressions,
       planLater(child))
   }
   ```
   
   This leaks more details from `AggUtil` in `SparkStrategies`. Looks not pretty good.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443273718



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       I left a comment for the same thing, and just resolved it by myself. I don't have a strong opinion on this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647248573






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647048542


   Thank you for pinging me, @viirya .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647471220






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647053554






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647248573






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443165193



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -148,7 +149,15 @@ object AggUtils {
       case ne: NamedExpression => ne
       case other => Alias(other, other.toString)()
     }
-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because

Review comment:
       Looks okay to combine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647066600


   Looks right to me


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443173523



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##########
@@ -1012,4 +1012,20 @@ class DataFrameAggregateSuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate") {
+    withTempView("view") {
+      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
+      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)
+
+      Seq(("mithunr", Float.NaN),
+        ("mithunr", POS_NAN_1),
+        ("mithunr", POS_NAN_2),
+        ("abellina", 1.0f),
+        ("abellina", 2.0f)).toDF("uid", "score").createOrReplaceTempView("view")
+
+      val df = spark.sql(" select uid, count(distinct score) from view group by 1 order by 1 asc ")

Review comment:
       Fixed. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647084865


   **[Test build #124330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124330/testReport)** for PR 28876 at commit [`6764a36`](https://github.com/apache/spark/commit/6764a364126bdd018077dcb6130124d46c1f60ba).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443218025



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##########
@@ -1012,4 +1012,20 @@ class DataFrameAggregateSuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate") {
+    withTempView("view") {
+      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
+      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)

Review comment:
       nit: any reason to use uppercases for the variables?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443173161



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>

Review comment:
       Ah, NVM. `distinctExpressions` alone should be normalized here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647329312






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647471531


   Thank you all. Merged to master/3.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647330963


   **[Test build #124354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124354/testReport)** for PR 28876 at commit [`bc159ab`](https://github.com/apache/spark/commit/bc159ab6d5191a9b7e01bafd7a65d39c03eb419c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443266734



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       This works correctly, @viirya . However, in architectural perspectives, I'm wondering if we can collect these `NormalizeFloatingNumbers.normalize` application in `SparkStrategies` instead of putting `AggUtils`. Actually, in `SparkStragegies`, we have the same logic for `normalizedGroupingExpressions` and missed this conversion at `functionsWithDistinct`. IMO, it would be easier if we keep this converting in one place, `SparkStragegies`, in order to avoid missing corner cases like this.
   
   This is just a suggestion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647163160


   **[Test build #124339 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124339/testReport)** for PR 28876 at commit [`0026e14`](https://github.com/apache/spark/commit/0026e143f97296dade947bda01ec07d661575ecc).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443272106



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       If moving into `AggUtils` sounds better for you all, I can move it. WDYT?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-646984783


   **[Test build #124316 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124316/testReport)** for PR 28876 at commit [`d11001c`](https://github.com/apache/spark/commit/d11001cae9382a75bb42edc1e4925325704a5ba9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647331634






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647053313


   @abellina @dongjoon-hyun Thanks for comment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647085024






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443297885



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       It looks weird if we do normalization in both `SparkStrategies` and `AggUtils`, better to put them in one place.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647072406






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647067566


   **[Test build #124330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124330/testReport)** for PR 28876 at commit [`6764a36`](https://github.com/apache/spark/commit/6764a364126bdd018077dcb6130124d46c1f60ba).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647163464






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647245921


   **[Test build #124344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124344/testReport)** for PR 28876 at commit [`e5211c3`](https://github.com/apache/spark/commit/e5211c3538584c2bcb6ebe0569278a01dc313857).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-646984991






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298784



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.aggregate
 
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers

Review comment:
       nit: I guess it can be removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] abellina commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
abellina commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647045875


   @viirya thanks for looking at this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647053489


   **[Test build #124328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124328/testReport)** for PR 28876 at commit [`dd66f7e`](https://github.com/apache/spark/commit/dd66f7eabf6e80b1b08dec6fe6790f0b97943e67).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647329312






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647470012


   **[Test build #124354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124354/testReport)** for PR 28876 at commit [`bc159ab`](https://github.com/apache/spark/commit/bc159ab6d5191a9b7e01bafd7a65d39c03eb419c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647331634






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #28876:
URL: https://github.com/apache/spark/pull/28876


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647246339






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443270046



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       Rather, we cannot move the existing `NormalizeFloatingNumbers` logic (int `SparkStrategies`) into AggUtils?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443291919



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       If we broaden the scope, `SparkStrategies` already is looking at the detail of `functionsWithDistinct` like the following.
   ```scala
           val (functionsWithDistinct, functionsWithoutDistinct) =
             aggregateExpressions.partition(_.isDistinct)
           if (functionsWithDistinct.map(_.aggregateFunction.children.toSet).distinct.length > 1) {
             // This is a sanity check. We should not reach here when we have multiple distinct
             // column sets. Our `RewriteDistinctAggregates` should take care this case.
             sys.error("You hit a query analyzer bug. Please report your query to " +
                 "Spark user mailing list.")
           }
   ```
   
   And the very next line is the same logic block for `groupingExpression`.
   ```scala
           // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
           // `groupingExpressions` is not extracted during logical phase.
           val normalizedGroupingExpressions = groupingExpressions.map { e =>
             NormalizeFloatingNumbers.normalize(e) match {
               case n: NamedExpression => n
               case other => Alias(other, e.name)(exprId = e.exprId)
             }
           }
   ```
   
   Given the above, I guess what you concerned is only one line source code, `val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children`.
   
   And, the following comment is about the definition of `functionsWithDistinct` which is generated by `SparkStrategies`. So, it's not a leaked detail or hidden from `SparkStrategies`. For me, it seems to be an assumption given by `SparkStrategies`.
   ```
   // functionsWithDistinct is guaranteed to be non-empty. Even though it may contain more than one
     // DISTINCT aggregate function, all of those functions will have the same column expressions.
     // For example, it would be valid for functionsWithDistinct to be
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647131600






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443266734



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.
+      NormalizeFloatingNumbers.normalize(e) match {
+        case ne: NamedExpression => ne
+        case other => Alias(other, other.toString)()

Review comment:
       This works correctly, @viirya . However, in architectural perspectives, I'm wondering if we can collect these `NormalizeFloatingNumbers.normalize` invocation in `SparkStrategies` instead of putting `AggUtils`. Actually, in `SparkStragegies`, we have the same logic for `normalizedGroupingExpressions` and missed this conversion at `functionsWithDistinct`. IMO, it would be easier if we keep this converting in one place, `SparkStragegies`, in order to avoid missing corner cases like this.
   
   This is just a suggestion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647067566


   **[Test build #124330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124330/testReport)** for PR 28876 at commit [`6764a36`](https://github.com/apache/spark/commit/6764a364126bdd018077dcb6130124d46c1f60ba).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647067619






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28876:
URL: https://github.com/apache/spark/pull/28876#issuecomment-647131433


   **[Test build #124339 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124339/testReport)** for PR 28876 at commit [`0026e14`](https://github.com/apache/spark/commit/0026e143f97296dade947bda01ec07d661575ecc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443220786



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala
##########
@@ -1012,4 +1012,20 @@ class DataFrameAggregateSuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate") {
+    withTempView("view") {
+      val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001)
+      val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff)

Review comment:
       hahaha, I see. Yea, I think its better to rename it, too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28876: [SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28876:
URL: https://github.com/apache/spark/pull/28876#discussion_r443298442



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -144,11 +145,16 @@ object AggUtils {
     // [COUNT(DISTINCT foo), MAX(DISTINCT foo)], but [COUNT(DISTINCT bar), COUNT(DISTINCT foo)] is
     // disallowed because those two distinct aggregates have different column expressions.
     val distinctExpressions = functionsWithDistinct.head.aggregateFunction.children
-    val namedDistinctExpressions = distinctExpressions.map {
-      case ne: NamedExpression => ne
-      case other => Alias(other, other.toString)()
+    val normalizedNamedDistinctExpressions = distinctExpressions.map { e =>
+      // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+      // `groupingExpressions` is not extracted during logical phase.

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org