You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dean Wampler (JIRA)" <ji...@apache.org> on 2014/11/23 17:52:12 UTC
[jira] [Commented] (SPARK-4564)
SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the
groupingExprs as part of the output schema
[ https://issues.apache.org/jira/browse/SPARK-4564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222409#comment-14222409 ]
Dean Wampler commented on SPARK-4564:
-------------------------------------
As soon as I reported this, I thought of a way to project out the field. {{('name, Count('n) as 'count)}}, making the whole expression:
{code}
val grouped = recs.select('name, 'n).groupBy('name)('name, Count('n) as 'count)
{code}
However, the behavior is still inconsistent with similar methods in RDD and PairRDDFunctions, where the grouping expression is part of the output schema.
> SchemaRDD.groupBy(groupingExprs)(aggregateExprs) doesn't return the groupingExprs as part of the output schema
> --------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-4564
> URL: https://issues.apache.org/jira/browse/SPARK-4564
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.1.0
> Environment: Mac OSX, local mode, but should hold true for all environments
> Reporter: Dean Wampler
>
> In the following example, I would expect the "grouped" schema to contain two fields, the String name and the Long count, but it only contains the Long count.
> {code}
> // Assumes val sc = new SparkContext(...), e.g., in Spark Shell
> import org.apache.spark.sql.{SQLContext, SchemaRDD}
> import org.apache.spark.sql.catalyst.expressions._
> val sqlc = new SQLContext(sc)
> import sqlc._
> case class Record(name: String, n: Int)
> val records = List(
> Record("three", 1),
> Record("three", 2),
> Record("two", 3),
> Record("three", 4),
> Record("two", 5))
> val recs = sc.parallelize(records)
> recs.registerTempTable("records")
> val grouped = recs.select('name, 'n).groupBy('name)(Count('n) as 'count)
> grouped.printSchema
> // root
> // |-- count: long (nullable = false)
> grouped foreach println
> // [2]
> // [3]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org