You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dean Wampler (JIRA)" <ji...@apache.org> on 2014/11/23 17:52:12 UTC

[jira] [Commented] (SPARK-4564) SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema

    [ https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222409#comment-14222409 ] 

Dean Wampler commented on SPARK-4564:
-------------------------------------

As soon as I reported this, I thought of a way to project out the field. {{('name, Count('n) as 'count)}}, making the whole expression:

{code}
val grouped = recs.select('name, 'n).groupBy('name)('name, Count('n) as 'count)
{code}

However, the behavior is still inconsistent with similar methods in RDD and PairRDDFunctions, where the grouping expression is part of the output schema.

> SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4564
>                 URL: https://issues.apache.org/jira/browse/SPARK-4564
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0
>         Environment: Mac OSX, local mode, but should hold true for all environments
>            Reporter: Dean Wampler
>
> In the following example, I would expect the "grouped" schema to contain two fields, the String name and the Long count, but it only contains the Long count.
> {code}
> // Assumes val sc = new SparkContext(...), e.g., in Spark Shell
> import org.apache.spark.sql.{SQLContext, SchemaRDD}
> import org.apache.spark.sql.catalyst.expressions._
> val sqlc = new SQLContext(sc)
> import sqlc._
> case class Record(name: String, n: Int)
> val records = List(
>   Record("three",   1),
>   Record("three",   2),
>   Record("two",     3),
>   Record("three",   4),
>   Record("two",     5))
> val recs = sc.parallelize(records)
> recs.registerTempTable("records")
> val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
> grouped.printSchema
> // root
> //  |-- count: long (nullable = false)
> grouped foreach println
> // [2]
> // [3]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org