You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/05 00:32:24 UTC

[GitHub] [spark] xinrong-databricks opened a new pull request, #36452: Adjust `numeric_only` of `GroupBy.mean/median` to match pandas 1.4

xinrong-databricks opened a new pull request, #36452:
URL: https://github.com/apache/spark/pull/36452

   ### What changes were proposed in this pull request?
   The PR is proposed to
   - Implement `numeric_only` of `GroupBy.mean`
   - Adjust `numeric_only` of `GroupBy.median`
   
   
   ### Why are the changes needed?
   Improve API compatibility with pandas.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. `numeric_only` parameter behaves the same as pandas 1.4's.
   Take `GroupBy.mean` as an example:
   ```py
   >>> psdf = ps.DataFrame({"A": [1, 2, 1, 2],"B": [3.1, 4.1, 4.1, 3.1],"C": ["a", "b", "b", "a"],"D": [True, False, False, True]})
   >>> psdf
      A    B  C      D
   0  1  3.1  a   True
   1  2  4.1  b  False
   2  1  4.1  b  False
   3  2  3.1  a   True
   
   >>> psdf.groupby('A').mean(numeric_only=False)
   ...: FutureWarning: Dropping invalid columns in GroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
     warnings.warn(
        B    D
   A          
   1  3.6  0.5
   2  3.6  0.5
   
   >>> psdf.groupby('A').mean(numeric_only=None)
   ...: FutureWarning: Dropping invalid columns in GroupBy.mean is deprecated. In a future version, a TypeError will be raised. Before calling .mean, select only columns which should be valid for the function.
     warnings.warn(
        B    D
   A          
   1  3.6  0.5
   2  3.6  0.5
   
   >>> psdf.groupby('A').mean(numeric_only=True)
        B    D
   A          
   1  3.6  0.5
   2  3.6  0.5
   ```
   
   ### How was this patch tested?
   Unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
Yikun commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r866587031


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   Looks like it is right.
   
   But the different is Pandas are using [`None` as default value](https://github.com/pandas-dev/pandas/blob/4bfe3d07b4858144c219b9346329027024102ab6/pandas/core/groupby/groupby.py#L1966), and use [_resolve_numeric_only](https://github.com/pandas-dev/pandas/blob/4bfe3d07b4858144c219b9346329027024102ab6/pandas/core/groupby/groupby.py#L1207) as a resolve to set default value (`true` or `false`).
   
   @xinrong-databricks kept behaviors according pandas doc, could you also take a look, should we also add some logic like this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r868580132


##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -35,6 +35,19 @@
 
 
 class GroupByTest(PandasOnSparkTestCase, TestUtils):
+    def pdf(self):

Review Comment:
   That really makes sense. Thanks for explaining that!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r867029180


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   Pandas are using None as default value and there are further conversions as below:
   
   - DataFrameGroupBy
   `numeric_only=True/False/None` -converts to-> `numeric_only=True`
   
   In pandas, when numeric_only is False/None, a FutureWarning will be raised, and then numeric_only is converted to True.
   
   **So I think setting True as the default value in pandas-on-Spark for numeric_only makes sense in this case**.
   
   - SeriesGroupBy
   `numeric_only=None` -converts to-> `numeric_only=False`
   
   However, an error will be raised if the aggregation column is non-numeric, regardless of the value of numeric_only.
   
   **So setting True as the default value in pandas-on-Spark for numeric_only seems fine**.
   
   How do you think about that? @HyukjinKwon @Yikun 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
Yikun commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r867292067


##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -35,6 +35,19 @@
 
 
 class GroupByTest(PandasOnSparkTestCase, TestUtils):
+    def pdf(self):
+        return pd.DataFrame(
+            {
+                "A": [1, 2, 1, 2],
+                "B": [3.1, 4.1, 4.1, 3.1],
+                "C": ["a", "b", "b", "a"],
+                "D": [True, False, False, True],
+            }
+        )
+
+    def psdf(self):

Review Comment:
   same



##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -35,6 +35,19 @@
 
 
 class GroupByTest(PandasOnSparkTestCase, TestUtils):
+    def pdf(self):

Review Comment:
   nit: let's add `@property` for `pdf` and `psdf` to:
   1. Simply pdf init like `x=self.pdf` instead of `x=self.pdf()`
   2. Keep the pdf/psdf property style with other tests
   3. Make pdf/psdf read only, if some try to set pdf accidently will raise: `AttributeError: can't set attribute`



##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   I think it's ok for me, and you have added the regression test for `SeriesGroupBy` we can easy to catch this change if pandas changes these logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r866431698


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   Optional means that it can be `None` IIRC. Is that correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4
URL: https://github.com/apache/spark/pull/36452


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r868581909


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   Sorry I edited the explanation for SeriesGroupBy. Let me know if you have other concerns :).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36452:
URL: https://github.com/apache/spark/pull/36452#issuecomment-1123028910

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r867029180


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   Pandas are using None as default value and there are further conversions as below:
   
   - DataFrameGroupBy
   `numeric_only=True/False/None` -converts to-> `numeric_only=True`
   
   In pandas, when numeric_only is False/None, a FutureWarning will be raised, and then numeric_only is converted to True.
   
   **So I think setting True as the default value in pandas-on-Spark for numeric_only makes sense in this case**.
   
   - SeriesGroupBy
   `numeric_only=None` -converts to-> `numeric_only=False`
   
   **So setting False as the default value in pandas-on-Spark for numeric_only seems fine**.
   
   How do you think about that? @HyukjinKwon @Yikun 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r867029180


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   - DataFrameGroupBy
   `numeric_only=True/False/None` -converts to-> `numeric_only=True`
   
   In pandas, when numeric_only is False/None, a FutureWarning will be raised, and then numeric_only is converted to True.
   
   **So I think setting True as the default value in pandas-on-Spark for numeric_only makes sense in this case**.
   
   - SeriesGroupBy
   `numeric_only=None` -converts to-> `numeric_only=False`
   
   **So setting False as the default value in pandas-on-Spark for numeric_only seems fine**.
   
   How do you think about that? @HyukjinKwon @Yikun 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36452:
URL: https://github.com/apache/spark/pull/36452#issuecomment-1119173582

   @ueshin @HyukjinKwon @itholic @Yikun May I get a review? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r867029180


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   
   - DataFrameGroupBy
   numeric_only=True/False/None -converts to-> numeric_only=True
   
   In pandas, when numeric_only is False/None, a FutureWarning will be raised, and then numeric_only is converted to True.
   
   **So I think setting True as the default value in pandas-on-Spark for numeric_only makes sense in this case**.
   
   - SeriesGroupBy
   numeric_only=None -converts to-> numeric_only=False
   
   **So setting False as the default value in pandas-on-Spark for numeric_only seems fine**.
   
   How do you think about that? @HyukjinKwon @Yikun 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a diff in pull request #36452: [SPARK-39109][PYTHON] Adjust `GroupBy.mean/median` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.
Yikun commented on code in PR #36452:
URL: https://github.com/apache/spark/pull/36452#discussion_r867292558


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2673,7 +2682,7 @@ def get_group(self, name: Union[Name, List[Name]]) -> FrameLike:
 
         return self._cleanup_and_return(DataFrame(internal))
 
-    def median(self, numeric_only: bool = True, accuracy: int = 10000) -> FrameLike:
+    def median(self, numeric_only: Optional[bool] = True, accuracy: int = 10000) -> FrameLike:

Review Comment:
   I think it's ok for me, and we have already added regresion for `SeriesGroupBy`, we can easy to catch these logic changes in future. Thanks for explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org