You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/30 23:52:42 UTC

[GitHub] [spark] ueshin opened a new pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

ueshin opened a new pull request #34160:
URL: https://github.com/apache/spark/pull/34160


   ### What changes were proposed in this pull request?
   
   Fix `DataFrameGroupBy.apply` without shortcut.
   
   ### Why are the changes needed?
   
   `DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`.
   
   ```py
   >>> ps.options.compute.shortcut_limit = 3
   >>> psdf = ps.DataFrame(
   ...     {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
   ...     columns=["a", "b", "c"],
   ... )
   >>> psdf.groupby("b").apply(lambda x: x["a"])
   org.apache.spark.api.python.PythonException: Traceback (most recent call last):
   ...
   ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   The error above will be gone:
   
   ```py
   >>> psdf.groupby("b").apply(lambda x: x["a"])
   b
   1  0    1
      1    2
   2  2    3
   3  3    4
   5  4    5
   8  5    6
   Name: a, dtype: int64
   ```
   
   ### How was this patch tested?
   
   Added tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932534562


   **[Test build #143791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143791/testReport)** for PR 34160 at commit [`cbd5c1c`](https://github.com/apache/spark/commit/cbd5c1cd3cbd667e5aa8344bde1c4959065d6f1e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931814109


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48287/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931798893


   **[Test build #143776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143776/testReport)** for PR 34160 at commit [`c599b19`](https://github.com/apache/spark/commit/c599b19d46dfddcb190073fc0c1f4133ecf2c302).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #34160:
URL: https://github.com/apache/spark/pull/34160


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932606395


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48304/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932631982


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48304/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932855671


   Merged to master and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931828791


   **[Test build #143776 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143776/testReport)** for PR 34160 at commit [`c599b19`](https://github.com/apache/spark/commit/c599b19d46dfddcb190073fc0c1f4133ecf2c302).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932586093


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143791/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932586093


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143791/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932598499


   **[Test build #143792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143792/testReport)** for PR 34160 at commit [`cbd5c1c`](https://github.com/apache/spark/commit/cbd5c1cd3cbd667e5aa8344bde1c4959065d6f1e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931829440


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48287/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
ueshin commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932555518


   Jenkins, retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932612451






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #34160:
URL: https://github.com/apache/spark/pull/34160#discussion_r720524733



##########
File path: python/pyspark/pandas/groupby.py
##########
@@ -1207,17 +1208,24 @@ def pandas_apply(pdf: pd.DataFrame, *a: Any, **k: Any) -> Any:
                 pdf[groupkey_name].rename(psser.name)
                 for groupkey_name, psser in zip(groupkey_names, self._groupkeys)
             ]
+            grouped = pdf.groupby(groupkeys)
             if is_series_groupby:
-                pser_or_pdf = pdf.groupby(groupkeys)[name].apply(pandas_apply, *args, **kwargs)
+                pser_or_pdf = grouped[name].apply(pandas_apply, *args, **kwargs)
             else:
-                pser_or_pdf = pdf.groupby(groupkeys).apply(pandas_apply, *args, **kwargs)
+                pser_or_pdf = grouped.apply(pandas_apply, *args, **kwargs)
             psser_or_psdf = ps.from_pandas(pser_or_pdf)
 
             if len(pdf) <= limit:
                 if isinstance(psser_or_psdf, ps.Series) and is_series_groupby:
                     psser_or_psdf = psser_or_psdf.rename(cast(SeriesGroupBy, self)._psser.name)
                 return cast(Union[Series, DataFrame], psser_or_psdf)
 
+            if len(grouped) <= 1:

Review comment:
       Thanks for the suggestion, but for the shortcut path which returns at line 1221, which means all the data in the DataFrame collected and handled by pandas properly, so we don't need to show any warnings.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
ueshin commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931823463


   cc @HyukjinKwon @xinrong-databricks @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932586674


   **[Test build #143792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143792/testReport)** for PR 34160 at commit [`cbd5c1c`](https://github.com/apache/spark/commit/cbd5c1cd3cbd667e5aa8344bde1c4959065d6f1e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932620817


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48304/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932580651


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48303/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932534562


   **[Test build #143791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143791/testReport)** for PR 34160 at commit [`cbd5c1c`](https://github.com/apache/spark/commit/cbd5c1cd3cbd667e5aa8344bde1c4959065d6f1e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932586674


   **[Test build #143792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143792/testReport)** for PR 34160 at commit [`cbd5c1c`](https://github.com/apache/spark/commit/cbd5c1cd3cbd667e5aa8344bde1c4959065d6f1e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932607673


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48303/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #34160:
URL: https://github.com/apache/spark/pull/34160#discussion_r720522774



##########
File path: python/pyspark/pandas/groupby.py
##########
@@ -1295,6 +1303,8 @@ def wrapped_func(
                 pdf_or_ser = pdf.groupby(groupkey_names)[name].apply(wrapped_func, *args, **kwargs)
             else:
                 pdf_or_ser = pdf.groupby(groupkey_names).apply(wrapped_func, *args, **kwargs)
+                if should_return_series and isinstance(pdf_or_ser, pd.DataFrame):

Review comment:
       No, I don't think it's related.
   
   Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,:
   
   ```py
   >>> pdf = pd.DataFrame(
   ...      {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
   ...      columns=["a", "b", "c"],
   ... )
   
   >>> pdf.groupby('b').apply(lambda x: x['a'])
   b
   1  0    1
      1    2
   2  2    3
   3  3    4
   5  4    5
   8  5    6
   Name: a, dtype: int64
   >>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a'])
   a  0  1
   b
   1  1  2
   ```
   
   As you can see, if there is only one group, it returns a "wide" `DataFrame` instead of `Series`.
   
   In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932544663


   **[Test build #143791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143791/testReport)** for PR 34160 at commit [`cbd5c1c`](https://github.com/apache/spark/commit/cbd5c1cd3cbd667e5aa8344bde1c4959065d6f1e).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #34160:
URL: https://github.com/apache/spark/pull/34160#discussion_r720525599



##########
File path: python/pyspark/pandas/groupby.py
##########
@@ -1295,6 +1303,8 @@ def wrapped_func(
                 pdf_or_ser = pdf.groupby(groupkey_names)[name].apply(wrapped_func, *args, **kwargs)
             else:
                 pdf_or_ser = pdf.groupby(groupkey_names).apply(wrapped_func, *args, **kwargs)
+                if should_return_series and isinstance(pdf_or_ser, pd.DataFrame):

Review comment:
       Also updated the PR description.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932612451






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-932631982


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48304/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34160:
URL: https://github.com/apache/spark/pull/34160#discussion_r720618110



##########
File path: python/pyspark/pandas/groupby.py
##########
@@ -1295,6 +1303,8 @@ def wrapped_func(
                 pdf_or_ser = pdf.groupby(groupkey_names)[name].apply(wrapped_func, *args, **kwargs)
             else:
                 pdf_or_ser = pdf.groupby(groupkey_names).apply(wrapped_func, *args, **kwargs)
+                if should_return_series and isinstance(pdf_or_ser, pd.DataFrame):

Review comment:
       Okay, LGTM!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #34160:
URL: https://github.com/apache/spark/pull/34160#discussion_r720283715



##########
File path: python/pyspark/pandas/groupby.py
##########
@@ -1207,17 +1208,24 @@ def pandas_apply(pdf: pd.DataFrame, *a: Any, **k: Any) -> Any:
                 pdf[groupkey_name].rename(psser.name)
                 for groupkey_name, psser in zip(groupkey_names, self._groupkeys)
             ]
+            grouped = pdf.groupby(groupkeys)
             if is_series_groupby:
-                pser_or_pdf = pdf.groupby(groupkeys)[name].apply(pandas_apply, *args, **kwargs)
+                pser_or_pdf = grouped[name].apply(pandas_apply, *args, **kwargs)
             else:
-                pser_or_pdf = pdf.groupby(groupkeys).apply(pandas_apply, *args, **kwargs)
+                pser_or_pdf = grouped.apply(pandas_apply, *args, **kwargs)
             psser_or_psdf = ps.from_pandas(pser_or_pdf)
 
             if len(pdf) <= limit:
                 if isinstance(psser_or_psdf, ps.Series) and is_series_groupby:
                     psser_or_psdf = psser_or_psdf.rename(cast(SeriesGroupBy, self)._psser.name)
                 return cast(Union[Series, DataFrame], psser_or_psdf)
 
+            if len(grouped) <= 1:

Review comment:
       I don't know enough about this to judge, but do you want to check this right after line 1211 above? or maybe it doesn't matter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931844060






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931844059






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34160:
URL: https://github.com/apache/spark/pull/34160#issuecomment-931798893


   **[Test build #143776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143776/testReport)** for PR 34160 at commit [`c599b19`](https://github.com/apache/spark/commit/c599b19d46dfddcb190073fc0c1f4133ecf2c302).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #34160: [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34160:
URL: https://github.com/apache/spark/pull/34160#discussion_r719961791



##########
File path: python/pyspark/pandas/groupby.py
##########
@@ -1295,6 +1303,8 @@ def wrapped_func(
                 pdf_or_ser = pdf.groupby(groupkey_names)[name].apply(wrapped_func, *args, **kwargs)
             else:
                 pdf_or_ser = pdf.groupby(groupkey_names).apply(wrapped_func, *args, **kwargs)
+                if should_return_series and isinstance(pdf_or_ser, pd.DataFrame):

Review comment:
       Looks all good one question. Does the function return a DataFrame (instead of a Series) because we drop the grouping keys above (https://github.com/apache/spark/pull/34160/files#diff-87acf2f7c70b4a2eeac5da33e34d1ae85951d400ace08bff6c09126a0a6431d9R1199)? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org