You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2015/07/28 19:04:04 UTC

[jira] [Updated] (SPARK-9338) Aliases from SELECT not available in GROUP BY

     [ https://issues.apache.org/jira/browse/SPARK-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust updated SPARK-9338:
------------------------------------
    Issue Type: New Feature  (was: Bug)

> Aliases from SELECT not available in GROUP BY
> ---------------------------------------------
>
>                 Key: SPARK-9338
>                 URL: https://issues.apache.org/jira/browse/SPARK-9338
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 1.4.1
>         Environment: Observed on Mac OS X and Ubuntu 14.04
>            Reporter: James Aley
>
> It feels like this should really be a known issue, but I've not been able to find any mailing list or JIRA tickets for exactly this. There are a few closed/resolved tickets about specific types of exceptions, but I couldn't find this exact problem, so apologies if this is a dupe!
> Spark SQL doesn't appear to support referencing aliases from a SELECT in the GROUP BY part of the query. This is confusing our analysts, as it works in most other tools they use. Here's an example to reproduce:
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema =
>   StructType(
>     StructField("x", IntegerType, nullable=false) ::
>     StructField("y",
>       StructType(StructField("a", DoubleType, nullable=false) :: Nil),
>       nullable=false) :: Nil)
> val rdd = sc.parallelize(
>   Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4, Row(2.5)) :: Nil)
> val df = sqlContext.createDataFrame(rdd, schema)
> // DataFrame content looks like this:
> // x   z
> // 1   {a: 1.0}
> // 2   {a: 1.34}
> // 3   {a: 2.3}
> // 4   {a: 2.5}
> df.registerTempTable("test_data")
> sqlContext.udf.register("roundToInt", (x: Double) => x.toInt)
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data GROUP BY grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given input columns x, y
> sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given input columns x, y;
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data GROUP BY roundToInt(y.a)").show()
> // =>
> // +---+----+
> // |grp|   s|
> // +---+----+
> // |  1|2.34|
> // |  2| 4.8|
> // +---+----+
> {code}
> As you can see, it's particularly inconvenient when using UDFs on nested fields, as it means repeating some potentially complex expressions. It's very common for us to want to make a date type conversion (from epoch milliseconds or something) from some nested field, then reference it in multiple places in the query. With this issue, it makes for quite verbose queries. 
> Might it also mean that we're mapping these functions over the data twice? I can't quite tell from the explain output whether that's been optimised out or not, but here it is for somebody who understands :-)
> {code}
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data GROUP BY roundToInt(y.a)").explain()
> // == Physical Plan ==
> // Aggregate false, [PartialGroup#126], [PartialGroup#126 AS grp#116,CombineSum(PartialSum#125L) AS s#117L]
> // Exchange (HashPartitioning 200)
> // Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L]
> // PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at <console>:31
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org