You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/09 03:43:08 UTC

[GitHub] [spark] Yikun commented on a diff in pull request #36414: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

Yikun commented on code in PR #36414:
URL: https://github.com/apache/spark/pull/36414#discussion_r867604372


##########
python/pyspark/pandas/frame.py:
##########
@@ -583,6 +583,7 @@ def _reduce_for_stat_function(
         name: str,
         axis: Optional[Axis] = None,
         numeric_only: bool = True,
+        skipna: bool = True,

Review Comment:
   add parameter doc for `skipna`



##########
python/pyspark/pandas/generic.py:
##########
@@ -1312,11 +1326,20 @@ def sum(psser: "Series") -> Column:
             return F.coalesce(F.sum(spark_column), SF.lit(0))
 
         return self._reduce_for_stat_function(
-            sum, name="sum", axis=axis, numeric_only=numeric_only, min_count=min_count
+            sum,
+            name="sum",
+            axis=axis,
+            numeric_only=numeric_only,
+            min_count=min_count,
+            skipna=skipna,
         )
 
     def product(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None, min_count: int = 0
+        self,
+        axis: Optional[Axis] = None,
+        skipna: bool = True,

Review Comment:
   +1 to keep pandas behavior, we might want to mention this in migration guide?



##########
python/pyspark/pandas/series.py:
##########
@@ -6859,13 +6860,16 @@ def _reduce_for_stat_function(
         sfun : the stats function to be used for aggregation
         name : original pandas API name.
         axis : used only for sanity check because series only support index axis.
-        numeric_only : not used by this implementation, but passed down by stats functions
+        numeric_only : not used by this implementation, but passed down by stats functions.
         """
         axis = validate_axis(axis)
         if axis == 1:
             raise NotImplementedError("Series does not support columns axis.")
 
-        scol = sfun(self)
+        if not skipna and self.hasnans:

Review Comment:
   +1, see https://github.com/apache/spark/commit/e031d00b52e1973ea62f0a80da948f169c73dad9 as ref



##########
python/pyspark/pandas/series.py:
##########
@@ -6859,13 +6860,16 @@ def _reduce_for_stat_function(
         sfun : the stats function to be used for aggregation
         name : original pandas API name.
         axis : used only for sanity check because series only support index axis.
-        numeric_only : not used by this implementation, but passed down by stats functions
+        numeric_only : not used by this implementation, but passed down by stats functions.

Review Comment:
   doc for `skipna`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org