You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/13 23:15:30 UTC

[GitHub] [spark] xinrong-databricks opened a new pull request, #36547: Implement `skipna` parameter of `Groupby.all`

xinrong-databricks opened a new pull request, #36547:
URL: https://github.com/apache/spark/pull/36547

   ### What changes were proposed in this pull request?
   Implement `skipna` parameter of `Groupby.all`.
   
   
   ### Why are the changes needed?
   To increase pandas API coverage.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. `skipna parameter of `Groupby.all` is supported.
   
   
   ### How was this patch tested?
   Unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
ueshin commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1128165133

   I'm afraid I'm not sure the point here.
   
   I don't think methods affect the builtin functions:
   
   ```py
   >>> class A:
   ...   def all(self):
   ...     print("all")
   ...
   >>>
   >>> A().all()
   all
   >>> all
   <built-in function all>
   >>> all()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   TypeError: all() takes exactly one argument (0 given)
   >>> all([True])
   True
   >>> all([False])
   False
   ```
   
   Also I'm not sure how `@final` helps to avoid shading issue.
   
   Btw, 
   
   > The PEP targets Python 3.8 so it seems fine to introduce it to the master branch.
   
   We are still supporting Python 3.7. 
   
   https://github.com/apache/spark/blob/47d237c74ccb1836e3de82dc583499ffd3f25755/python/setup.py#L276


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1130377446

   Rebased for conflicting files `python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
zero323 commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1129228724

   > Would you give an example in which case we may diverge from pandas? I
   
   Sure thing @xinrong-databricks. Sorry for being enigmatic before. So,  very simple case would be something like this:
   
   ```python
   >>> import pandas as pd
   >>> import pyspark.pandas as ps
   >>> ps.Series(["foo", ""]).all()
   True
   >>> pd.Series(["foo", ""]).all()
   False
   ```
   
   In Pandas we follow standard Python truthness  convention so anything non empty (not empty string, not empty list) or not zero is True.
   
   In Spark, in case of strings, we expect `true` / `false` otherwise we evaluate cast to `NULL`.
   
   Furthermore, we have cases where `boolean` cast is not allowed. For example this is valid in Pandas
   
   ```python
   >>> pd.Series([["foo"], ["bar"]]).all()
   True
   ```
   but invalid in Pandas on Spark:
   
   ```python
   >>> ps.Series([["foo"], ["bar"]]).all()
   Traceback (most recent call last):
   ...
   AnalysisException: cannot resolve 'CAST(`0` AS BOOLEAN)' due to data type mismatch: cannot cast array<string> to boolean;
   'Aggregate [unresolvedalias(min(coalesce(cast(0#46 as boolean), true)), Some(org.apache.spark.sql.Column$$Lambda$1717/0x0000000100e64040@16600c7b))]
   +- Project [0#46]
      +- Project [__index_level_0__#45L, 0#46, monotonically_increasing_id() AS __natural_order__#49L]
         +- LogicalRDD [__index_level_0__#45L, 0#46], false
   ```
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`
URL: https://github.com/apache/spark/pull/36547


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1129166446

   Rebased master to retrigger irrelevant failed test. No new changes after review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1128007756

   Yes, def `all` is a built in function in python. 
   Spark python codebase has no more than 56 of this now, which can be a problem.
     
   Sometimes shading python built-in functions is the right way.  And in this case, it's right. 
   
   But if you look at the code that pandas are using, they [decorate the function](https://github.com/pandas-dev/pandas/blob/v1.4.2/pandas/core/groupby/groupby.py#L1810) with `@final` 
   This is [PEP 591 – Adding a final qualifier to typing](https://peps.python.org/pep-0591/)   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1130389206

   Yes, we can add `@final` to python built in functions when we drop Python 3.7 
   
   Thanks everyone :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
zero323 commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1128565616

   By itself LGTM. To concur with others, I also don't see shading issue and if it there was one, we're not introducing a new method here and changing a name at this point would be a breaking change., even if we ignore Pandas parity.
   
   Speaking about parity ‒ did we intentionally diverge from Pandas contract in the original implementation?  Boolean cast not only is not going to cover all types, but also yield different results in some cases (not something to be fixed here, if at all, just want to clarify things to decide if we need a new JIRA ticket for that).
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1128134042

   The comment on PySpark's shading python built-in functions really makes sense. The PEP targets Python 3.8 so it seems fine to introduce it to the master branch. May I get some opinions on that? 
   
   FYI @ueshin @HyukjinKwon @itholic @zero323 @Yikun @zhengruifeng 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: Implement `skipna` parameter of `Groupby.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1127916884

   @bjornjorgensen We may want to reach parity with pandas https://github.com/pandas-dev/pandas/blob/v1.4.2/pandas/core/groupby/groupby.py#L1810-L1828. I may not see how `all_to_skip` saves us from defining `all`. Would you elaborate? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1130383732

   Thanks for the ideas from all of you on the built-in function shading!
   
   To summarize, built-in function shading doesn't seem to be a critical problem; meanwhile, since we still support Python 3.7, `@final` may not be introduced immediately. We will certainly consider adding `@final` when we deprecate support for Python 3.7.
   
   WDYT @bjornjorgensen?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1140596294

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on pull request #36547: Implement `skipna` parameter of `Groupby.all`

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1126676101

   [all](https://docs.python.org/3/library/functions.html#all) is a built in function in python. Can we rename this to `def all_to_skip()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36547: Implement `skipna` parameter of `Groupby.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36547:
URL: https://github.com/apache/spark/pull/36547#discussion_r873968124


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2862,6 +2879,9 @@ def _reduce_for_stat_function(
         )
         psdf = DataFrame(internal)
 
+        return self._prepare_return(psdf)
+
+    def _prepare_return(self, psdf: DataFrame) -> DataType:

Review Comment:
   Code of `_prepare_return` is extracted from `_reduce_for_stat_function`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1138764710

   May I get a review @HyukjinKwon thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a diff in pull request #36547: Implement `skipna` parameter of `Groupby.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on code in PR #36547:
URL: https://github.com/apache/spark/pull/36547#discussion_r873967446


##########
python/pyspark/pandas/groupby.py:
##########
@@ -2879,7 +2898,43 @@ def _reduce_for_stat_function(
                 psdf = psdf.reset_index(level=should_drop_index, drop=True)
             if len(should_drop_index) < len(self._groupkeys):
                 psdf = psdf.reset_index()
-        return self._cleanup_and_return(psdf)
+        psdf = self._cleanup_and_return(psdf)
+        return psdf
+
+    def _prepare_reduce(

Review Comment:
   Code of `_prepare_reduce` is extracted from `_reduce_for_stat_function`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
Yikun commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1128324461

   > But if you look at the code that pandas are using, they [decorate the function](https://github.com/pandas-dev/pandas/blob/v1.4.2/pandas/core/groupby/groupby.py#L1810) with @final
   
   Actually, the `final` is not for the decoration of built-in function in Pandas, it `to indicate to the reader that a the method is not overridden` according https://github.com/pandas-dev/pandas/commit/7073ee1612019f4e0e9d6a1b205257e8eab02382 merge history.
   
   As @ueshin mentioned, there also are no issue when using build-in same name func under the namespace, this is also the recommand way to [pep-20](https://peps.python.org/pep-0020/#the-zen-of-python): `Namespaces are one honking great idea -- let's do more of those!`.
   
   > Spark python codebase has no more than 56 shading python built-in functions now, which can be a problem.
   
   The only potential issue I can imagine, is for Spark dev (not users), we have to use `builtins.set` when we want to call the built-in under the same namespace, such as:
   
   ```python
   import builtins
   from typing import final
   
   class A():
       def __init__(self, value):
           self.set(value)
       # class method with same name
       def set(self, value):
           self.value=value
       # TypeError: set() missing 1 required positional argument: 'value'
       # test_val = set([1,2,3])
       # This is ok, becasue we are using set under builtins namespace
       test_val = builtins.set([1,2,3])
   ```
   
   As above demo, even we add `final`, we still can't avoid this case, we should using `builtins.set` to resolve the namespace problems.
    
   BTW, If we need to add `final` (if we think `to indicate to the reader that a the method is not overridden` is very important), we could add this later in separate patch. I'm OK but doesn't have very strong feel with this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1129178343

   > Boolean cast not only is not going to cover all types, but also yield different results in some cases
   
   Would you give an example in which case we may diverge from pandas? I don't think the original implementation intentionally diverges from pandas. Thanks @zero323 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #36547: [SPARK-39197][PYTHON] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1128336669

   +1 for what @ueshin @Yikun commented.
   
   > The only potential issue I can imagine, we have to use builtins.set when we want to call the built-in func under the same namespace
   
   we may simply follow this https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L353-L354


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #36547: [SPARK-39197][PYTHON][PS] Implement `skipna` parameter of `GroupBy.all`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1130376172

   Thank you @zero323 for the clear explanation! Good point.
   
   I created https://issues.apache.org/jira/browse/SPARK-39227 to resolve the behavior difference on boolean casts mentioned.
   
   Meanwhile, we may also want to include Series/Frame that contains empty strings/ lists as test cases, in order to expose such behavior differences in the future. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org