You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "huldar chen (Jira)" <ji...@apache.org> on 2022/11/25 04:46:00 UTC

[jira] [Comment Edited] (SPARK-41236) The renamed field name cannot be recognized after group filtering

    [ https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638493#comment-17638493 ] 

huldar chen edited comment on SPARK-41236 at 11/25/22 4:45 AM:
---------------------------------------------------------------

If the aggregated column to be queried is renamed to the name of an existing column in the table, when parsing the aggregated column, it will be bound to the original column ID in the table, resulting in an exception.
Should be parsed from aggregateExpressions first.
like cases

case 1:
{code:java}
select collect_set(testdata2.a) as a
from testdata2
group by b
having size(a) > 0 {code}
analyzedPlan is:
{code:java}
'Filter (size(tempresolvedcolumn(a#3, a), true) > 0)
+- Aggregate [b#4], [collect_set(a#3, 0, 0) AS a#44]
   +- SubqueryAlias testdata2
      +- View (`testData2`, [a#3,b#4])
         +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
            +- ExternalRDD [obj#2] {code}
tempresolvedcolumn(a#3, a) should bind a#44.

 

case 2:
{code:java}
select collect_set(testdata2.a) as b
from testdata2
group by b
having size(b) > 0 {code}
analyzedPlan is:
{code:java}
'Project [b#44]
+- 'Filter (size(tempresolvedcolumn(b#4, b), true) > 0)
   +- Aggregate [b#4], [collect_set(a#3, 0, 0) AS b#44, b#4]
      +- SubqueryAlias testdata2
         +- View (`testData2`, [a#3,b#4])
            +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
               +- ExternalRDD [obj#2]
 {code}
tempresolvedcolumn(b#4, b) should bind b#44.

The buggy code is in org.apache.spark.sql.catalyst.analysis.Analyzer.ResolveAggregateFunctions#resolveExprsWithAggregate#

resolveCol

 


was (Author: huldar):
If the aggregated column to be queried is renamed to the name of an existing column in the table, when parsing the aggregated column, it will be bound to the original column ID in the table, resulting in an exception.
Should be parsed from aggregateExpressions first.
like cases

case 1:
{code:java}
select collect_set(testdata2.a) as a
from testdata2
group by b
having size(a) > 0 {code}
analyzedPlan is:
{code:java}
'Filter (size(tempresolvedcolumn(a#3, a), true) > 0)
+- Aggregate [b#4], [collect_set(a#3, 0, 0) AS a#44]
   +- SubqueryAlias testdata2
      +- View (`testData2`, [a#3,b#4])
         +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
            +- ExternalRDD [obj#2] {code}
tempresolvedcolumn(a#3, a) should bind a#44.

 

case 2:
{code:java}
select collect_set(testdata2.a) as b
from testdata2
group by b
having size(b) > 0 {code}
analyzedPlan is:
{code:java}
'Project [b#44]
+- 'Filter (size(tempresolvedcolumn(b#4, b), true) > 0)
   +- Aggregate [b#4], [collect_set(a#3, 0, 0) AS b#44, b#4]
      +- SubqueryAlias testdata2
         +- View (`testData2`, [a#3,b#4])
            +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4]
               +- ExternalRDD [obj#2]
 {code}
tempresolvedcolumn(b#4, b) should bind b#44.

The buggy code is inorg.apache.spark.sql.catalyst.analysis.Analyzer.ResolveAggregateFunctions#resolveExprsWithAggregate#

resolveCol

 

> The renamed field name cannot be recognized after group filtering
> -----------------------------------------------------------------
>
>                 Key: SPARK-41236
>                 URL: https://issues.apache.org/jira/browse/SPARK-41236
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: jingxiong zhong
>            Priority: Major
>
> {code:java}
> select collect_set(age) as age
> from db_table.table1
> group by name
> having size(age) > 1 
> {code}
> a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0
> Is it a bug or a new standard?
> h3. *like this:*
> {code:sql}
> create db1.table1(age int, name string);
> insert into db1.table1 values(1, 'a');
> insert into db1.table1 values(2, 'b');
> insert into db1.table1 values(3, 'c');
> --then run sql like this 
> select collect_set(age) as age from db1.table1 group by name having size(age) > 1 ;
> {code}
> h3. Stack Information
> org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input columns: [age]; line 4 pos 12;
> 'Filter (size('age, true) > 1)
> +- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0]
>    +- SubqueryAlias spark_catalog.db1.table1
>       +- HiveTableRelation [`db1`.`table1`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], Partition Cols: []]
> 	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
> 	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
> 	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128)
> 	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127)
> 	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532)
> 	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154)
> 	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153)
> 	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
> 	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181)
> 	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196)
> 	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192)
> 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88)
> 	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
> 	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196)
> 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> 	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196)
> 	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88)
> 	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86)
> 	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78)
> 	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98)
> 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> 	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
> 	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
> 	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
> 	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org