You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/04/25 00:47:00 UTC
[jira] [Commented] (SPARK-38983) Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;)
[ https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527264#comment-17527264 ]
Hyukjin Kwon commented on SPARK-38983:
--------------------------------------
Is the issue about error message? or are you saying str and bool should work?
> Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;)
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-38983
> URL: https://issues.apache.org/jira/browse/SPARK-38983
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.1.2, 3.2.1
> Environment: I have reproduced this error in two environments. I would be happy to answer questions about either.
> h1. Environment 1
> I first encountered this error on my employer's Azure Databricks cluster, which runs Spark version 3.1.2. I have limited access to cluster configuration information, but I can ask if it will help.
> h1. Environment 2
> I reproduced the error by running the same code in the Pyspark shell from Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to environment information here. Running {{spark-submit --version}} produced the following output:
> {{Welcome to Spark version 3.2.1}}
> {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}}
> {{Branch HEAD}}
> {{Compiled by user hgao on 2022-01-20T19:26:14Z}}
> {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}}
> {{Url https://github.com/apache/spark}}
> Reporter: Chris Kimmel
> Priority: Minor
> Labels: cube, error_message_improvement, exception-handling, grouping, rollup
>
> h1. Code to reproduce
> {{print(spark.version) # My environment, Azure DataBricks, defines spark automatically.}}
> {{from pyspark.sql import functions as f}}
> {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
> {{ ('a',),}}
> {{ ('b',),}}
> {{]}}
> {{s = t.StructType([}}
> {{ t.StructField('col1', t.StringType())}}
> {{])}}
> {{df = spark.createDataFrame(l, s)}}
> {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
> {{ df}}
> {{ .cube(f.col('col1'))}}
> {{ .agg(f.grouping('col1') & f.lit(True))}}
> {{ .collect()}}
> {{)}}
> h1. Expected results
> The code produces an {{AnalysisException()}} with error message along the lines of:
> {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and boolean).;}}
> h1. Actual results
> The code throws an {{AnalysisException()}} with error message
> {{AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;}}
> Python provides the following traceback:
> {{---------------------------------------------------------------------------}}
> {{AnalysisException Traceback (most recent call last)}}
> {{<command-2283735107422632> in <module>}}
> {{ 15 }}
> {{ 16 ( # This expression raises an AnalysisException()}}
> {{---> 17 df}}
> {{ 18 .cube(f.col('col1'))}}
> {{{} 19 .agg(f.grouping('col1') & f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs){}}}
> {{ 116 # Columns}}
> {{ 117 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"}}
> {{--> 118 jdf = self._jgd.agg(exprs[0]._jc,}}
> {{ 119 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))}}
> {{{} 120 return DataFrame(jdf, self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in _{_}call{_}_(self, *args){}}}
> {{ 1302 }}
> {{ 1303 answer = self.gateway_client.send_command(command)}}
> {{-> 1304 return_value = get_return_value(}}
> {{ 1305 answer, self.gateway_client, self.target_id, self.name)}}
> {{ 1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
> {{ 121 # Hide where the exception came from that shows a non-Pythonic}}
> {{ 122 # JVM exception message.}}
> {{--> 123 raise converted from None}}
> {{ 124 else:}}
> {{{} 125 raise{}}}{{{}AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;{}}}
> {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS (grouping(col1) AND true)#551]}}
> {{+- LogicalRDD [col1#548|#548], false}}
> h1. Workaround
> Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that {{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False.
> {{( # This expression does not raise an AnalysisException()}}
> {{ df}}
> {{ .cube(f.col('col1'))}}
> {{ .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
> {{ .collect()}}
> {{)}}
> h1. Additional notes
> The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to reproduce".
> The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in "Code to reproduce".
> h1. Related tickets
> https://issues.apache.org/jira/browse/SPARK-22748
> h1. Relevant documentation
> * [Spark SQL GROUPBY, ROLLUP, and CUBE semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
> * [DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
> * [DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
> * [DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
> * [functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
> * [functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org