You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by icexelloss <gi...@git.apache.org> on 2017/10/02 03:16:27 UTC
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
@rxin, `transform` takes a function: pd.Series -> pd.Series and apply the function on all columns:
```
df.show()
id v1 v2 v3
a 1.0 4.0 0.0
a 2.0 5.0 1.0
a 3.0 6.0 1.0
df.groupby('id').transform(pandas_udf(lambda v: v - v.mean(), DoubleType())).show()
id v1 v2 v3
a -1.0 -1.0 -0.666667
a 0.0 0.0 0.333333
a 1.0 1.0 0.333333
```
This is mimicking `pd.DataFrame.groupby().transform`
`apply` takes a function: pd.DataFrame -> pd.DataFrame and is similar to `flatMapGroups`
The name `apply` is originated from the R paper "The Split-Apply-Combine Strategy for Data Analysis" and is used in both pandas and R to describe this function, so the name `apply` should be pretty straight forward to pandas/python user.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org