You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/23 07:07:25 UTC

[GitHub] [spark] itholic opened a new pull request, #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

itholic opened a new pull request, #38769:
URL: https://github.com/apache/spark/pull/38769

   ### What changes were proposed in this pull request?
   
   This PR proposes to rename `COLUMN_NOT_IN_GROUP_BY_CLAUSE` to `MISSING_AGGREGATION`.
   
   Also, improve its error message.
   
   ### Why are the changes needed?
   
   The current error class name and its error message doesn't illustrate the error cause and resolution correctly.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   ```
   ./build/sbt “sql/testOnly org.apache.spark.sql.SQLQueryTestSuite*”
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first <expression> one is also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   1. Maybe should we separate the parameters into 2 parts something like <expression> and <expression_any_value> ??
   
   2. Or maybe we can simply recommend to use the `any_value()` function as below:
   
   ```
   ... aggregate the expression, or use `any_value()` for <expression> if you do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1030326286


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   Yeah, it looks like the example below:
   
   ```
   [MISSING_AGGREGATION] The non-aggregating expression "c2" is based on columns which are not participating in the GROUP BY clause.
   Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value("c2")" if you do not care which of the values within a group is returned.;
   ```
   
   Can we fix it to ``` `any_value("...")` ``` or something ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on PR #38769:
URL: https://github.com/apache/spark/pull/38769#issuecomment-1333268350

   +1, LGTM. Merging to master.
   Thank you, @itholic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first <expression> one is also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   1. Maybe should we separate the parameters into 2 parts something like <expression> and <expression_any_value> ??
   
   ```scala
       new AnalysisException(
         errorClass = "MISSING_AGGREGATION",
         messageParameters = Map(
           "expression" -> toSQLExpr(expression)
           "expressionAnyValue" -> toSQLExpr(new AnyValue(expression)))
   ```
   
   2. Or maybe we can simply recommend to use the `any_value()` function as below:
   
   ```
   ... aggregate the expression, or use `any_value()` for <expression> if you do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #38769:
URL: https://github.com/apache/spark/pull/38769#issuecomment-1325994942

   Fixed the Python test first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1035556872


##########
python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py:
##########
@@ -475,7 +475,7 @@ def test_invalid_args(self):
         mean_udf = self.pandas_agg_mean_udf
 
         with QuietTest(self.sc):
-            with self.assertRaisesRegex(AnalysisException, "nor.*aggregate function"):
+            with self.assertRaisesRegex(AnalysisException, "[MISSING_AGGREGATION]"):

Review Comment:
   We should test this with similar logic with `checkError` in Scala side.
   
   Let me create the JIRA and update soon after getting some feedback from community.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk closed pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
URL: https://github.com/apache/spark/pull/38769


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first <expression> one is also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   1. Maybe should we separate the parameters into 2 parts something like <expression> and <expression_any_value> ??
   
   ```scala
       new AnalysisException(
         errorClass = "MISSING_AGGREGATION",
         messageParameters = Map(
           "expression" -> toSQLExpr(expression)
           "expression" -> toSQLExpr(new AnyValue(expression)))
   ```
   
   2. Or maybe we can simply recommend to use the `any_value()` function as below:
   
   ```
   ... aggregate the expression, or use `any_value()` for <expression> if you do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1030326286


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   Yeah, it looks like the example below:
   
   ```
   [MISSING_AGGREGATION] The non-aggregating expression "c2" is based on columns which are not participating in the GROUP BY clause.
   Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value("c2")" if you do not care which of the values within a group is returned.;
   ```
   
   Do we want to fix it to ``` `any_value("...")` ``` or something ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031072613


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   Cool. Let me update soon.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on PR #38769:
URL: https://github.com/apache/spark/pull/38769#issuecomment-1328769206

   @itholic Please, rebase on the recent master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1030254794


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   `expression` is quoted already, correct? How this `\"any_value(<expression>)\"` looks like `"any_value("...")"`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031085119


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   Or maybe we simply recommend to use the `any_value()` function as below:
   
   ```
   ... aggregate the expression, or use `any_value()` for <expression> if you do not care ...
   ```
   
   something like that?



##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   Or maybe we can simply recommend to use the `any_value()` function as below:
   
   ```
   ... aggregate the expression, or use `any_value()` for <expression> if you do not care ...
   ```
   
   something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1035556872


##########
python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py:
##########
@@ -475,7 +475,7 @@ def test_invalid_args(self):
         mean_udf = self.pandas_agg_mean_udf
 
         with QuietTest(self.sc):
-            with self.assertRaisesRegex(AnalysisException, "nor.*aggregate function"):
+            with self.assertRaisesRegex(AnalysisException, "[MISSING_AGGREGATION]"):

Review Comment:
   I think maybe we should test this with similar logic to `checkError` in Scala side.
   
   Let me create the JIRA and update soon after getting some feedback from community.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1033295996


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   This issue has been fixed by https://github.com/apache/spark/pull/38805



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on PR #38769:
URL: https://github.com/apache/spark/pull/38769#issuecomment-1331119124

   @itholic Could you fix the test failure. It seems it is related to your changes:
   ```
   python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py.test_invalid_args
   "COLUMN_NOT_IN_GROUP_BY_CLAUSE" does not match "[MISSING_AGGREGATION] The non-aggregating expression "v" is based on columns which are not participating in the GROUP BY clause.
   Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value("v")" if you do not care which of the values within a group is returned.;
   Aggregate [id#668L], [id#668L, plus_one(v#674)#687 AS plus_one(v)#688]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1035860841


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   @itholic Since the PR has been merged, I am waiting for when you update your PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first <expression> one is also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   Maybe should we separate the parameters into 2 parts like <expression> and <expression_any_value> ??



##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   I just tried your suggestion, and I checked the first <expression> one is also affected, so it looks a bit wired IMO.
   For example, the error message will be shown like:
   
   ```
   ... The non-aggregating expression "anyvalue(c2)" is based on columns ...
   ... aggregate the expression, or use "any_value(c2)" if you do not care ...
   ```
   
   Maybe should we separate the parameters into 2 parts something like <expression> and <expression_any_value> ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1031089197


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   > "anyvalue(c2)" .. "any_value(c2)"
   
   hmm, could you open a PR and override `prettyName` in `AnyValue`, and pass `prettyName` to `FirstLast.validateIgnoreNullExpr()`. Also, cc @vitaliili-db 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.

Posted by GitBox <gi...@apache.org>.

MaxGekk commented on code in PR #38769:
URL: https://github.com/apache/spark/pull/38769#discussion_r1030594602


##########
core/src/main/resources/error/error-classes.json:
##########
@@ -785,6 +779,13 @@
       "Malformed Protobuf messages are detected in message deserialization. Parse Mode: <failFastMode>. To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'."
     ]
   },
+  "MISSING_AGGREGATION" : {
+    "message" : [
+      "The non-aggregating expression <expression> is based on columns which are not participating in the GROUP BY clause.",
+      "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value(<expression>)\" if you do not care which of the values within a group is returned."

Review Comment:
   How about just leave `expression` in the error format:
   ```
   ... or use <expression> if you do not care which ...
   ```
   and pass the message parameter:
   ```scala
       new AnalysisException(
         errorClass = "MISSING_AGGREGATION",
         messageParameters = Map("expression" -> toSQLExpr(new AnyValue(expression)))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org