You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (JIRA)" <ji...@apache.org> on 2018/11/13 17:59:26 UTC
[jira] [Resolved] (SPARK-25942) Aggregate expressions shouldn't be resolved on AppendColumns

     [ https://issues.apache.org/jira/browse/SPARK-25942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan resolved SPARK-25942.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0

Issue resolved by pull request 22944
[https://github.com/apache/spark/pull/22944]

> Aggregate expressions shouldn't be resolved on AppendColumns
> ------------------------------------------------------------
>
>                 Key: SPARK-25942
>                 URL: https://issues.apache.org/jira/browse/SPARK-25942
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Liang-Chi Hsieh
>            Assignee: Liang-Chi Hsieh
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Dataset.groupByKey will bring in new attributes from serializer. If key type is the same as original Dataset's object type, they have same serializer output and so the attribute names will conflict.
> This won't be a problem at most of cases, if we don't refer conflict attributes:
> {code:java}
> val ds: Dataset[(ClassData, Long)] = Seq(ClassData("one", 1), ClassData("two", 2)).toDS()
>   .map(c => ClassData(c.a, c.b + 1))
>   .groupByKey(p => p).count()
>  {code}
> But if we use conflict attributes, `Analyzer` will complain about ambiguous references:
> {code}
> val ds = Seq(1, 2, 3).toDS()
> val agg = ds.groupByKey(_ >= 2).agg(sum("value").as[Long], sum($"value" + 1).as[Long])
> {code}
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: Reference 'value' is ambiguous, could be: value, value.;                                          
> [info]   at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:247)
> [info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
> [info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$38.apply(Analyzer.scala:889)
> [info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$38.apply(Analyzer.scala:891)
> ...
> {code}
> Based on the API document and implementation details of KeyValueGroupedDataset, we should not allow aggregate expressions on KeyValueGroupedDataset to access key attributes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org