You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/12/19 22:04:13 UTC

[jira] [Commented] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema

    [ https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254016#comment-14254016 ] 

Michael Armbrust commented on SPARK-4564:
-----------------------------------------

It is however consistent with SQL, where GROUP BY expression are only included if they are part of the SELECT clause.  Since the goal here is to provide programatic SQL I'm inclined to stick with the current semantics.  Changing this would also be a fairly major breaking change to the API if people were dependent on the position of columns in the result.

> SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4564
>                 URL: https://issues.apache.org/jira/browse/SPARK-4564
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0
>         Environment: Mac OSX, local mode, but should hold true for all environments
>            Reporter: Dean Wampler
>
> In the following example, I would expect the "grouped" schema to contain two fields, the String name and the Long count, but it only contains the Long count.
> {code}
> // Assumes val sc = new SparkContext(...), e.g., in Spark Shell
> import org.apache.spark.sql.{SQLContext, SchemaRDD}
> import org.apache.spark.sql.catalyst.expressions._
> val sqlc = new SQLContext(sc)
> import sqlc._
> case class Record(name: String, n: Int)
> val records = List(
>   Record("three",   1),
>   Record("three",   2),
>   Record("two",     3),
>   Record("three",   4),
>   Record("two",     5))
> val recs = sc.parallelize(records)
> recs.registerTempTable("records")
> val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
> grouped.printSchema
> // root
> //  |-- count: long (nullable = false)
> grouped foreach println
> // [2]
> // [3]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org