You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Viktor ARDELEAN <vi...@gmail.com> on 2016/02/10 08:44:15 UTC

Pyspark - how to use UDFs with dataframe groupby

Hello,

I am using following transformations on RDD:

rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\
           .aggregateByKey([], lambda accumulatorList, value:
accumulatorList + [value], lambda list1, list2: [list1] + [list2])

I want to use the dataframe groupBy + agg transformation instead of
map + aggregateByKey because as far as I know dataframe
transformations are faster than RDD transformations.

I just can't figure out how to use custom aggregate functions with agg.

*First step is clear:*

groupedData = df.groupBy("a","b","c")

*Second step is not very clear to me:*

dfAgg = groupedData.agg(<I should call here a UDF that transforms each
row to a list and merges it?>)

The agg documentations says the following:
agg(**exprs*)
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.GroupedData.agg>

Compute aggregates and returns the result as a DataFrame
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.DataFrame>
.

The available aggregate functions are avg, max, min, sum, count.

If exprs is a single dict mapping from string to string, then the key is
the column to perform aggregation on, and the value is the aggregate
function.

Alternatively, exprs can also be a list of aggregate Column
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column>
 expressions.
Parameters: *exprs* – a dict mapping from column name (string) to aggregate
functions (string), or a list of Column
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column>
.

Thanks for help!
-- 
Viktor

*P*   Don't print this email, unless it's really necessary. Take care of
the environment.

Re: Pyspark - how to use UDFs with dataframe groupby

Posted by Davies Liu <da...@databricks.com>.
short answer: PySpark does not support UDAF (user defined aggregate
function) for now.

On Tue, Feb 9, 2016 at 11:44 PM, Viktor ARDELEAN <vi...@gmail.com>
wrote:

> Hello,
>
> I am using following transformations on RDD:
>
> rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\
>            .aggregateByKey([], lambda accumulatorList, value: accumulatorList + [value], lambda list1, list2: [list1] + [list2])
>
> I want to use the dataframe groupBy + agg transformation instead of map + aggregateByKey because as far as I know dataframe transformations are faster than RDD transformations.
>
> I just can't figure out how to use custom aggregate functions with agg.
>
> *First step is clear:*
>
> groupedData = df.groupBy("a","b","c")
>
> *Second step is not very clear to me:*
>
> dfAgg = groupedData.agg(<I should call here a UDF that transforms each row to a list and merges it?>)
>
> The agg documentations says the following:
> agg(**exprs*)
> <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.GroupedData.agg>
>
> Compute aggregates and returns the result as a DataFrame
> <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.DataFrame>
> .
>
> The available aggregate functions are avg, max, min, sum, count.
>
> If exprs is a single dict mapping from string to string, then the key is
> the column to perform aggregation on, and the value is the aggregate
> function.
>
> Alternatively, exprs can also be a list of aggregate Column
> <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column>
>  expressions.
> Parameters: *exprs* – a dict mapping from column name (string) to
> aggregate functions (string), or a list of Column
> <https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html?highlight=min#pyspark.sql.Column>
> .
>
> Thanks for help!
> --
> Viktor
>
> *P*   Don't print this email, unless it's really necessary. Take care of
> the environment.
>