You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/04/25 00:47:00 UTC
[jira] [Commented] (SPARK-38983) Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;)

    [ https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527264#comment-17527264 ] 

Hyukjin Kwon commented on SPARK-38983:
--------------------------------------

Is the issue about error message? or are you saying str and bool should work?

> Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;)
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38983
>                 URL: https://issues.apache.org/jira/browse/SPARK-38983
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.2, 3.2.1
>         Environment: I have reproduced this error in two environments. I would be happy to answer questions about either.
> h1. Environment 1
> I first encountered this error on my employer's Azure Databricks cluster, which runs Spark version 3.1.2. I have limited access to cluster configuration information, but I can ask if it will help.
> h1. Environment 2
> I reproduced the error by running the same code in the Pyspark shell from Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to environment information here. Running {{spark-submit --version}} produced the following output:
> {{Welcome to Spark version 3.2.1}}
> {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}}
> {{Branch HEAD}}
> {{Compiled by user hgao on 2022-01-20T19:26:14Z}}
> {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}}
> {{Url https://github.com/apache/spark}}
>            Reporter: Chris Kimmel
>            Priority: Minor
>              Labels: cube, error_message_improvement, exception-handling, grouping, rollup
>
> h1. Code to reproduce
> {{print(spark.version) # My environment, Azure DataBricks, defines spark automatically.}}
> {{from pyspark.sql import functions as f}}
> {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
> {{  ('a',),}}
> {{  ('b',),}}
> {{]}}
> {{s = t.StructType([}}
> {{  t.StructField('col1', t.StringType())}}
> {{])}}
> {{df = spark.createDataFrame(l, s)}}
> {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
> {{  df}}
> {{  .cube(f.col('col1'))}}
> {{  .agg(f.grouping('col1') & f.lit(True))}}
> {{  .collect()}}
> {{)}}
> h1. Expected results
> The code produces an {{AnalysisException()}} with error message along the lines of:
> {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and boolean).;}}
> h1. Actual results
> The code throws an {{AnalysisException()}} with error message
> {{AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;}}
> Python provides the following traceback:
> {{---------------------------------------------------------------------------}}
> {{AnalysisException                         Traceback (most recent call last)}}
> {{<command-2283735107422632> in <module>}}
> {{     15 }}
> {{     16 ( # This expression raises an AnalysisException()}}
> {{---> 17   df}}
> {{     18   .cube(f.col('col1'))}}
> {{{}     19   .agg(f.grouping('col1') & f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs){}}}
> {{    116             # Columns}}
> {{    117             assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"}}
> {{--> 118             jdf = self._jgd.agg(exprs[0]._jc,}}
> {{    119                                 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))}}
> {{{}    120         return DataFrame(jdf, self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in _{_}call{_}_(self, *args){}}}
> {{   1302 }}
> {{   1303         answer = self.gateway_client.send_command(command)}}
> {{-> 1304         return_value = get_return_value(}}
> {{   1305             answer, self.gateway_client, self.target_id, self.name)}}
> {{   1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
> {{    121                 # Hide where the exception came from that shows a non-Pythonic}}
> {{    122                 # JVM exception message.}}
> {{--> 123                 raise converted from None}}
> {{    124             else:}}
> {{{}    125                 raise{}}}{{{}AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;{}}}
> {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS (grouping(col1) AND true)#551]}}
> {{+- LogicalRDD [col1#548|#548], false}}
> h1. Workaround
> Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that {{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False.
> {{(  # This expression does not raise an AnalysisException()}}
> {{  df}}
> {{  .cube(f.col('col1'))}}
> {{  .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
> {{  .collect()}}
> {{)}}
> h1. Additional notes
> The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to reproduce".
> The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in "Code to reproduce".
> h1. Related tickets
> https://issues.apache.org/jira/browse/SPARK-22748
> h1. Relevant documentation
>  * [Spark SQL GROUPBY, ROLLUP, and CUBE semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
>  * [DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
>  * [DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
>  * [DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
>  * [functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
>  * [functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org