You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by the3rdNotch <st...@notch.bz> on 2015/08/03 16:30:35 UTC
How to calculate standard deviation of grouped data in a DataFrame?
I have user logs that I have taken from a csv and converted into a DataFrame
in order to leverage the SparkSQL querying features. A single user will
create numerous entries per hour, and I would like to gather some basic
statistical information for each user; really just the count of the user
instances, the average, and the standard deviation of numerous columns. I
was able to quickly get the mean and count information by using
groupBy($"user") and the aggregator with SparkSQL functions for count and
avg:
*val meanData = selectedData.groupBy($"user").agg(count($"logOn"),
avg($"transaction"),
avg($"submit"), avg($"submitsPerHour"), avg($"replies"),
avg($"repliesPerHour"), avg($"duration"))*
However, I cannot seem to find an equally elegant way to calculate the
standard deviation. So far I can only calculate it by mapping a string,
double pair and use StatCounter().stdev utility:
*val stdevduration = duration.groupByKey().mapValues(value =>
org.apache.spark.util.StatCounter(value).stdev)*
This returns an RDD however, and I would like to try and keep it all in a
DataFrame for further queries to be possible on the returned data. Is there
a similarly simplistic method the calculating the standard deviation like
there is for the mean and count?
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-standard-deviation-of-grouped-data-in-a-DataFrame-tp24114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org