You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Peter Figliozzi <pe...@gmail.com> on 2016/09/27 00:52:50 UTC
median of groups
I'm trying to figure out a nice way to get the median of a DataFrame
column *once
it is grouped. *
It's easy enough now to get the min, max, mean, and other things that are
part of spark.sql.functions:
df.groupBy("foo", "bar").agg(mean($"column1"))
And it's easy enough to get the median of a column before grouping, using
approxQuantile.
However approxQuantile is part of DataFrame.stat i.e. a
DataFrameStatFunctions.
Is there a way to use it inside the .agg?
Or do we need a user defined aggregation function?
Or some other way?
Stack Overflow version of the question here
<http://stackoverflow.com/questions/39693730/median-of-groups-in-a-dataframe-spark-2-0>
.
Thanks,
Pete
Re: median of groups
Posted by ayan guha <gu...@gmail.com>.
I have used percentile_approx (with 0.5) function from hive,using
sqlContext sql commands.
On Tue, Sep 27, 2016 at 10:52 AM, Peter Figliozzi <pe...@gmail.com>
wrote:
> I'm trying to figure out a nice way to get the median of a DataFrame
> column *once it is grouped. *
>
> It's easy enough now to get the min, max, mean, and other things that are
> part of spark.sql.functions:
>
> df.groupBy("foo", "bar").agg(mean($"column1"))
>
> And it's easy enough to get the median of a column before grouping, using
> approxQuantile.
>
> However approxQuantile is part of DataFrame.stat i.e. a
> DataFrameStatFunctions.
>
> Is there a way to use it inside the .agg?
>
> Or do we need a user defined aggregation function?
>
> Or some other way?
> Stack Overflow version of the question here
> <http://stackoverflow.com/questions/39693730/median-of-groups-in-a-dataframe-spark-2-0>
> .
>
> Thanks,
>
> Pete
>
>
--
Best Regards,
Ayan Guha