You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/03 22:05:43 UTC

[GitHub] [spark] xinrong-databricks opened a new pull request, #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

xinrong-databricks opened a new pull request, #36444:
URL: https://github.com/apache/spark/pull/36444

   ### What changes were proposed in this pull request?
   Adjust `GroupBy.std` to match pandas 1.4.
   
   ### Why are the changes needed?
   Improve API compatibility with pandas.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   ```py
   >>> psdf = ps.DataFrame(
   ...             {
   ...                 "A": [1, 2, 1, 2],
   ...                 "B": [3.1, 4.1, 4.1, 3.1],
   ...                 "C": ["a", "b", "b", "a"],
   ...                 "D": [True, False, False, True],
   ...             }
   ...         )
   >>> psdf
      A    B  C      D
   0  1  3.1  a   True
   1  2  4.1  b  False
   2  1  4.1  b  False
   3  2  3.1  a   True
   
   ### Before
   >>> psdf.groupby('A')[['C']].std()
   Empty DataFrame
   Columns: []
   Index: [1, 2]
   
   ### After
   >>> psdf.groupby('A')[['C']].std()
   ...
   TypeError: Unaccepted data types of aggregation columns; numeric or bool expected.
   ```
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4
URL: https://github.com/apache/spark/pull/36444


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on PR #36444:
URL: https://github.com/apache/spark/pull/36444#issuecomment-1116711302

   @ueshin @HyukjinKwon @itholic Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on code in PR #36444:
URL: https://github.com/apache/spark/pull/36444#discussion_r864298316


##########
python/pyspark/pandas/groupby.py:
##########
@@ -640,6 +640,17 @@ def std(self, ddof: int = 1) -> FrameLike:
         """
         assert ddof in (0, 1)
 
+        # Raise the TypeError when all aggregation columns are of unaccepted data types
+        all_unaccepted = True
+        for _agg_col in self._agg_columns:
+            if isinstance(_agg_col.spark.data_type, (NumericType, BooleanType)):
+                all_unaccepted = False
+                break
+        if all_unaccepted:
+            raise TypeError(

Review Comment:
   pandas 1.4 behaves as below:
   ```
   >>> pdf = pd.DataFrame(
   ...             {
   ...                 "A": [1, 2, 1, 2],
   ...                 "B": [3.1, 4.1, 4.1, 3.1],
   ...                 "C": ["a", "b", "b", "a"],
   ...                 "D": [True, False, False, True],
   ...             }
   ...         )
   >>> pdf.groupby('A')[['C']].std()
   Traceback (most recent call last):
   ...
   ValueError: could not convert string to float: 'a'
   >>> pdf.groupby('A').std()
             B         D
   A                    
   1  0.707107  0.707107
   2  0.707107  0.707107
   ```
   
   I think `TypeError` is more appropriate than `ValueError` raised by pandas.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.

Yikun commented on PR #36444:
URL: https://github.com/apache/spark/pull/36444#issuecomment-1119136038

   @xinrong-databricks Thanks for clarify, I think current PR title is fine for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36444:
URL: https://github.com/apache/spark/pull/36444#issuecomment-1116786842

   cc @Yikun since he already worked on related items.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-databricks commented on pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.

xinrong-databricks commented on PR #36444:
URL: https://github.com/apache/spark/pull/36444#issuecomment-1118872415

   Good point @Yikun! I named the PR as so becauseSpark 3.4 is claimed to match pandas 1.4. I may change the PR title if you feel it inappropriate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36444: [SPARK-39095][PYTHON] Adjust `GroupBy.std` to match pandas 1.4

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36444:
URL: https://github.com/apache/spark/pull/36444#issuecomment-1119187555

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org