You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Guilherme Beltramini (JIRA)" <ji...@apache.org> on 2019/01/28 13:58:00 UTC
[jira] [Created] (SPARK-26752) Multiple aggregate methods in the
same column in DataFrame
Guilherme Beltramini created SPARK-26752:
--------------------------------------------
Summary: Multiple aggregate methods in the same column in DataFrame
Key: SPARK-26752
URL: https://issues.apache.org/jira/browse/SPARK-26752
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.0
Reporter: Guilherme Beltramini
The agg function in [org.apache.spark.sql.RelationalGroupedDataset|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset] accepts as input:
* Column*
* Map[String, String]
* (String, String)*
I'm proposing to add Map[String, Seq[String]], where the keys are the columns to aggregate, and the values are the functions to apply the aggregation.
In the example below (running in spark-shell, with Spark 2.4.0), I'm showing a workaround. What I'm proposing is that agg should accept aggMap as input:
{code:java}
scala> val df = Seq(("a", 1), ("a", 2), ("a", 3), ("a", 4), ("b", 10), ("b", 20), ("c", 100)).toDF("col1", "col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
scala> df.show
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 2|
| a| 3|
| a| 4|
| b| 10|
| b| 20|
| c| 100|
+----+----+
scala> val aggMap = Map("col1" -> Seq("count"), "col2" -> Seq("min", "max", "mean"))
aggMap: scala.collection.immutable.Map[String,Seq[String]] = Map(col1 -> List(count), col2 -> List(min, max, mean))
scala> val aggSeq = aggMap.toSeq.flatMap{ case (c: String, fns: Seq[String]) => Seq(c).zipAll(fns, c, "") }
aggSeq: Seq[(String, String)] = ArrayBuffer((col1,count), (col2,min), (col2,max), (col2,mean))
scala> val dfAgg = df.groupBy("col1").agg(aggSeq.head, aggSeq.tail: _*)
dfAgg: org.apache.spark.sql.DataFrame = [col1: string, count(col1): bigint ... 3 more fields]
scala> dfAgg.orderBy("col1").show
+----+-----------+---------+---------+---------+
|col1|count(col1)|min(col2)|max(col2)|avg(col2)|
+----+-----------+---------+---------+---------+
| a| 4| 1| 4| 2.5|
| b| 2| 10| 20| 15.0|
| c| 1| 100| 100| 100.0|
+----+-----------+---------+---------+---------+
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org