You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2022/11/08 11:17:43 UTC

[spark] branch master updated: [SPARK-41014][PYTHON][DOC] Improve documentation and typing of groupby and cogroup applyInPandas

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 0eaa8e1e76a [SPARK-41014][PYTHON][DOC] Improve documentation and typing of groupby and cogroup applyInPandas
0eaa8e1e76a is described below

commit 0eaa8e1e76ab6ecdd3b51d751857e50530ccdeb6
Author: Enrico Minack <gi...@enrico.minack.dev>
AuthorDate: Tue Nov 8 20:17:31 2022 +0900

    [SPARK-41014][PYTHON][DOC] Improve documentation and typing of groupby and cogroup applyInPandas
    
    ### What changes were proposed in this pull request?
    Documentation of method `applyInPandas` for grouby and cogroup should mention in the main description that there are two allowed signatures for the provided function. The Examples state that this is possible, and the parameters sections states that for cogroup.
    
    Also type information for `PandasCogroupedMapFunction` should mention the three-argument callable alternative.
    
    ### Why are the changes needed?
    Better documenting these alternative to find this piece of information quicker.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Untested
    
    Closes #38509 from EnricoMi/branch-pyspark-group-and-cogroup-doc-and-typing.
    
    Authored-by: Enrico Minack <gi...@enrico.minack.dev>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/sql/pandas/_typing/__init__.pyi |  5 ++++-
 python/pyspark/sql/pandas/group_ops.py         | 15 ++++++++++-----
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/pandas/_typing/__init__.pyi b/python/pyspark/sql/pandas/_typing/__init__.pyi
index acca8c00f2a..69279727ca9 100644
--- a/python/pyspark/sql/pandas/_typing/__init__.pyi
+++ b/python/pyspark/sql/pandas/_typing/__init__.pyi
@@ -336,6 +336,9 @@ PandasMapIterFunction = Callable[[Iterable[DataFrameLike]], Iterable[DataFrameLi
 
 ArrowMapIterFunction = Callable[[Iterable[pyarrow.RecordBatch]], Iterable[pyarrow.RecordBatch]]
 
-PandasCogroupedMapFunction = Callable[[DataFrameLike, DataFrameLike], DataFrameLike]
+PandasCogroupedMapFunction = Union[
+    Callable[[DataFrameLike, DataFrameLike], DataFrameLike],
+    Callable[[Any, DataFrameLike, DataFrameLike], DataFrameLike],
+]
 
 GroupedMapPandasUserDefinedFunction = NewType("GroupedMapPandasUserDefinedFunction", FunctionType)
diff --git a/python/pyspark/sql/pandas/group_ops.py b/python/pyspark/sql/pandas/group_ops.py
index c34a285144f..bca96eaf205 100644
--- a/python/pyspark/sql/pandas/group_ops.py
+++ b/python/pyspark/sql/pandas/group_ops.py
@@ -114,7 +114,9 @@ class PandasGroupedOpsMixin:
         as a `DataFrame`.
 
         The function should take a `pandas.DataFrame` and return another
-        `pandas.DataFrame`. For each group, all columns are passed together as a `pandas.DataFrame`
+        `pandas.DataFrame`. Alternatively, the user can pass a function that takes
+        a tuple of the grouping key(s) and a `pandas.DataFrame`.
+        For each group, all columns are passed together as a `pandas.DataFrame`
         to the user-function and the returned `pandas.DataFrame` are combined as a
         :class:`DataFrame`.
 
@@ -129,8 +131,9 @@ class PandasGroupedOpsMixin:
         Parameters
         ----------
         func : function
-            a Python native function that takes a `pandas.DataFrame`, and outputs a
-            `pandas.DataFrame`.
+            a Python native function that takes a `pandas.DataFrame` and outputs a
+            `pandas.DataFrame`, or that takes one tuple (grouping keys) and a
+            `pandas.DataFrame` and outputs a `pandas.DataFrame`.
         schema : :class:`pyspark.sql.types.DataType` or str
             the return type of the `func` in PySpark. The value can be either a
             :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.
@@ -377,7 +380,9 @@ class PandasCogroupedOps:
         as a `DataFrame`.
 
         The function should take two `pandas.DataFrame`\\s and return another
-        `pandas.DataFrame`.  For each side of the cogroup, all columns are passed together as a
+        `pandas.DataFrame`. Alternatively, the user can pass a function that takes
+        a tuple of the grouping key(s) and the two `pandas.DataFrame`\\s.
+        For each side of the cogroup, all columns are passed together as a
         `pandas.DataFrame` to the user-function and the returned `pandas.DataFrame` are combined as
         a :class:`DataFrame`.
 
@@ -394,7 +399,7 @@ class PandasCogroupedOps:
         func : function
             a Python native function that takes two `pandas.DataFrame`\\s, and
             outputs a `pandas.DataFrame`, or that takes one tuple (grouping keys) and two
-            pandas ``DataFrame``\\s, and outputs a pandas ``DataFrame``.
+            ``pandas.DataFrame``\\s, and outputs a ``pandas.DataFrame``.
         schema : :class:`pyspark.sql.types.DataType` or str
             the return type of the `func` in PySpark. The value can be either a
             :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org