You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/09/04 03:09:04 UTC

[GitHub] [spark] itholic opened a new pull request, #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

itholic opened a new pull request, #42788:
URL: https://github.com/apache/spark/pull/42788

   ### What changes were proposed in this pull request?
   
   This PR added warning messages throughout the Pandas API on Spark wherever the `numeric_only` parameter is used with different default value.
   
   This is to inform users of the different default behavior in Pandas API on Spark compared to traditional Pandas.
   
   ### Why are the changes needed?
   
   The default behavior of the `numeric_only` parameter in Pandas API on Spark is different from traditional Pandas due to the inability to support mixed data types in a single column.
   
   This could potentially lead to unexpected results for users unfamiliar with this distinction. By providing a clear warning message, we aim to prevent confusion and potential mistakes.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Yes. Users will now see a warning message when they use functions or methods that involve the `numeric_only` parameter without explicitly setting it. This message informs them of the different default behaviors between Pandas API on Spark and traditional Pandas.
   
   ```python
   >>> psdf = ps.DataFrame({"A": [1, 2, 3], "B": ['a', 'b', 'c']})
   >>> psdf.cov()
   /Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/pandas/frame.py:9051: UserWarning: In Pandas API on Spark, the 'numeric_only' parameter defaults to 'True' if not set, due to the inability to mix different data types in a single column. This behavior might differ from the traditional Pandas behavior where 'numeric_only' defaults to 'False'. Please ensure to set this parameter explicitly to avoid unexpected results.
     warnings.warn(
        A
   A  1.0
   ```
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Manually tested. The existing CI should pass.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #42788:
URL: https://github.com/apache/spark/pull/42788#discussion_r1321310948


##########
python/pyspark/pandas/frame.py:
##########
@@ -9029,6 +9047,14 @@ def cov(self, min_periods: Optional[int] = None, ddof: int = 1) -> "DataFrame":
         """
         if not isinstance(ddof, int):
             raise TypeError("ddof must be integer")
+        if numeric_only is None:

Review Comment:
   Yeah, IMHO we should always recommend to use the `numeric_only` parameter explicitly to prevent future confusion, because the default value of the `numeric_only` parameter is still different from Pandas even if the result is the same for some cases.



##########
python/pyspark/pandas/groupby.py:
##########
@@ -608,18 +608,18 @@ def max(self, numeric_only: Optional[bool] = False, min_count: int = -1) -> Fram
             min_count=min_count,
         )
 
-    def mean(self, numeric_only: Optional[bool] = True) -> FrameLike:
+    def mean(self, numeric_only: Optional[bool] = None) -> FrameLike:
         """
         Compute mean of groups, excluding missing values.
 
         Parameters
         ----------
-        numeric_only : bool, default True
+        numeric_only : bool, default None
             Include only float, int, boolean columns. If None, will attempt to use
             everything, then use only numeric data. False is not supported.
             This parameter is mainly for pandas compatibility.
 
-            .. versionadded:: 3.4.0
+            .. versionchanged:: 4.0.0

Review Comment:
   Makes sense to me. Updated!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42788:
URL: https://github.com/apache/spark/pull/42788#issuecomment-1714885800

   @itholic I think I'm missing something, I can not understand `the inability to support mixed data types in a single column`:
   why we don't follow the default value (`False`)?
   
   It look like we can check the schema and filter out non-numerical columns if needed 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on code in PR #42788:
URL: https://github.com/apache/spark/pull/42788#discussion_r1321310948


##########
python/pyspark/pandas/frame.py:
##########
@@ -9029,6 +9047,14 @@ def cov(self, min_periods: Optional[int] = None, ddof: int = 1) -> "DataFrame":
         """
         if not isinstance(ddof, int):
             raise TypeError("ddof must be integer")
+        if numeric_only is None:

Review Comment:
   Yeah, IMHO we should always recommend to use the `numeric_only` parameter explicitly to prevent future confusion, because the default value of the `numeric_only` parameter is still different from Pandas even if the result is the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42788:
URL: https://github.com/apache/spark/pull/42788#discussion_r1320988487


##########
python/pyspark/pandas/frame.py:
##########
@@ -9029,6 +9047,14 @@ def cov(self, min_periods: Optional[int] = None, ddof: int = 1) -> "DataFrame":
         """
         if not isinstance(ddof, int):
             raise TypeError("ddof must be integer")
+        if numeric_only is None:

Review Comment:
   do we need to warn users if all columns are numeric?



##########
python/pyspark/pandas/groupby.py:
##########
@@ -608,18 +608,18 @@ def max(self, numeric_only: Optional[bool] = False, min_count: int = -1) -> Fram
             min_count=min_count,
         )
 
-    def mean(self, numeric_only: Optional[bool] = True) -> FrameLike:
+    def mean(self, numeric_only: Optional[bool] = None) -> FrameLike:
         """
         Compute mean of groups, excluding missing values.
 
         Parameters
         ----------
-        numeric_only : bool, default True
+        numeric_only : bool, default None
             Include only float, int, boolean columns. If None, will attempt to use
             everything, then use only numeric data. False is not supported.
             This parameter is mainly for pandas compatibility.
 
-            .. versionadded:: 3.4.0
+            .. versionchanged:: 4.0.0

Review Comment:
   shall we keep the old `versionchanged` and add this new one?
   both `versionchanged` should doc what was changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42788:
URL: https://github.com/apache/spark/pull/42788#issuecomment-1714912254

   So far, we don't follow the Pandas behavior since we couldn't support the object-dtype for stat functions in some cases as beolw:
   ```python
   # DataFrame
   >>> pdf
      A  B
   0  1  a
   1  2  b
   2  3  c
   
   # Pandas works
   >>> pdf.min(numeric_only=False)
   A    1
   B    a
   dtype: object
   
   # Pandas API on Spark doesn't work
   >>> ps.from_pandas(pdf).min(numeric_only=False)
   ...
   pyarrow.lib.ArrowInvalid: Could not convert 'a' with type str: tried to convert to int64
   ```
   
   But on my second thought, it's a bug from our code in Pandas API on Spark so we can support `numeric_only=False` as default by fixing the existing bug.
   
   Let me just close this ticket, and change the default value instead.
   
   Thanks for pointing out, @zhengruifeng !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic closed pull request #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic closed pull request #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`
URL: https://github.com/apache/spark/pull/42788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org