You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Irina Truong (JIRA)" <ji...@apache.org> on 2017/03/21 22:00:43 UTC
[jira] [Comment Edited] (SPARK-4296) Throw "Expression not in GROUP
BY" when using same expression in group by clause and select clause
[ https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935409#comment-15935409 ]
Irina Truong edited comment on SPARK-4296 at 3/21/17 9:59 PM:
--------------------------------------------------------------
I have the same exception with pyspark when my expression uses a compiled and registered Scala UDF. This is how it's registered:
{noformat}
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
{noformat}
And this is how it's called:
{noformat}
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n +- OneRowRelation$\n"
{noformat}
was (Author: irinatruong):
I'm have the same exception with pyspark when my expression uses a compiled and registered Scala UDF:
sqlContext.registerJavaFunction("round_date", 'my.package.RoundDate')
ipdb> sqlContext.sql("SELECT round_date(t.ts, '1day') from (select timestamp('2017-02-02T10:11:12') as ts union select timestamp('2017-02-02T10:19:00') as ts) as t group by round_date(t.ts, '1day')").show()
*** AnalysisException: u"expression 't.`ts`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nAggregate [UDF(ts#80, 1day)], [UDF(ts#80, 1day) AS UDF(ts, 1day)#82]\n+- SubqueryAlias t\n +- Distinct\n +- Union\n :- Project [cast(2017-02-02T10:11:12 as timestamp) AS ts#80]\n : +- OneRowRelation$\n +- Project [cast(2017-02-02T10:19:00 as timestamp) AS ts#81]\n +- OneRowRelation$\n"
> Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Reporter: Shixiong Zhu
> Assignee: Cheng Lian
> Priority: Blocker
> Fix For: 1.2.1, 1.3.0
>
>
> When the input data has a complex structure, using same expression in group by clause and select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), Person("Jim", Birthday("1980-02-28"))))
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS date#9) AS c1#3]
> Subquery people
> LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias expression.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org