You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by icexelloss <gi...@git.apache.org> on 2017/10/02 03:16:27 UTC

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/18732
  
    @rxin, `transform` takes a function: pd.Series -> pd.Series and apply the function on all columns:
    
    ```
    df.show()
    
     id   v1   v2  v3
     a  1.0  4.0  0.0
     a  2.0  5.0  1.0
     a  3.0  6.0  1.0
    
    df.groupby('id').transform(pandas_udf(lambda v: v - v.mean(), DoubleType())).show()
    
     id   v1   v2        v3
     a  -1.0 -1.0    -0.666667
     a   0.0  0.0     0.333333
     a   1.0  1.0     0.333333
    ```
    
    This is mimicking `pd.DataFrame.groupby().transform`
    
    `apply` takes a function: pd.DataFrame -> pd.DataFrame and is similar to `flatMapGroups`
    
    The name `apply` is originated from the R paper "The Split-Apply-Combine Strategy for Data Analysis" and is used in both pandas and R to describe this function, so the name `apply` should be pretty straight forward to pandas/python user. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org