You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Peter Figliozzi <pe...@gmail.com> on 2016/09/27 00:52:50 UTC

median of groups

I'm trying to figure out a nice way to get the median of a DataFrame
column *once
it is grouped.  *

It's easy enough now to get the min, max, mean, and other things that are
part of spark.sql.functions:

df.groupBy("foo", "bar").agg(mean($"column1"))

And it's easy enough to get the median of a column before grouping, using
approxQuantile.

However approxQuantile is part of DataFrame.stat i.e. a
DataFrameStatFunctions.

Is there a way to use it inside the .agg?

Or do we need a user defined aggregation function?

Or some other way?
Stack Overflow version of the question here
<http://stackoverflow.com/questions/39693730/median-of-groups-in-a-dataframe-spark-2-0>
.

Thanks,

Pete

Re: median of groups

Posted by ayan guha <gu...@gmail.com>.
I have used percentile_approx (with 0.5) function from hive,using
sqlContext sql commands.

On Tue, Sep 27, 2016 at 10:52 AM, Peter Figliozzi <pe...@gmail.com>
wrote:

> I'm trying to figure out a nice way to get the median of a DataFrame
> column *once it is grouped.  *
>
> It's easy enough now to get the min, max, mean, and other things that are
> part of spark.sql.functions:
>
> df.groupBy("foo", "bar").agg(mean($"column1"))
>
> And it's easy enough to get the median of a column before grouping, using
> approxQuantile.
>
> However approxQuantile is part of DataFrame.stat i.e. a
> DataFrameStatFunctions.
>
> Is there a way to use it inside the .agg?
>
> Or do we need a user defined aggregation function?
>
> Or some other way?
> Stack Overflow version of the question here
> <http://stackoverflow.com/questions/39693730/median-of-groups-in-a-dataframe-spark-2-0>
> .
>
> Thanks,
>
> Pete
>
>


-- 
Best Regards,
Ayan Guha