You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Edgar H <ka...@gmail.com> on 2022/02/28 12:50:35 UTC
[Spark SQL] Null when trying to use corr() with a Window
Morning all, been struggling with this for a while and can't really seem to
understand what I'm doing wrong...
Having the following DataFrame I want to apply the corr function over the
following DF;
val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
val sampleSet = Seq(
("group1", "id1", 1, 1, 6),
("group1", "id2", 2, 2, 5),
("group1", "id3", 3, 3, 4),
("group2", "id4", 4, 4, 3),
("group2", "id5", 5, 5, 2),
("group2", "id6", 6, 6, 1)
)
val initialSet = sparkSession
.createDataFrame(sampleSet)
.toDF(sampleColumns: _*)
----- .show()
+------+---+------+------+----------+
| group| id|count1|count2|orderCount|
+------+---+------+------+----------+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+------+---+------+------+----------+
val initialSetWindow = Window
.partitionBy("group")
.orderBy("orderCountSum")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val groupedSet = initialSet
.groupBy(
"group"
).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
.withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
----- .show()
+------+---------+---------+-------------+----+
| group|count1Sum|count2Sum|orderCountSum| cf|
+------+---------+---------+-------------+----+
|group1| 6| 6| 15|null|
|group2| 15| 15| 6|null|
+------+---------+---------+-------------+----+
When trying to apply the corr function, some of the resulting values in cf
are null for some reason:
The question is, *how can I apply corr to each of the rows within their
subgroup (Window)?* Would like to obtain the corr value per Row and
subgroup (group1 and group2), and even if more nested IDs were added (group
+ id) it should still work.