You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/21 05:47:45 UTC

[GitHub] [spark] itholic commented on a diff in pull request #37945: [SPARK-40498][PS] Implement `kendall` and `min_periods` in `Series.corr`

itholic commented on code in PR #37945:
URL: https://github.com/apache/spark/pull/37945#discussion_r975971640


##########
python/pyspark/pandas/series.py:
##########
@@ -3312,16 +3318,25 @@ def autocorr(self, periods: int = 1) -> float:
             )
         return np.nan if corr is None else corr
 
-    def corr(self, other: "Series", method: str = "pearson") -> float:
+    def corr(
+        self, other: "Series", method: str = "pearson", min_periods: Optional[int] = None
+    ) -> float:
         """
         Compute correlation with `other` Series, excluding missing values.
 
+        .. versionadded:: 3.3.0
+
         Parameters
         ----------
         other : Series
-        method : {'pearson', 'spearman'}
+        method : {'pearson', 'spearman', 'kendall'}
             * pearson : standard correlation coefficient
             * spearman : Spearman rank correlation
+            * kendall : Kendall Tau correlation coefficient

Review Comment:
   Maybe we should add `.. versionchanged:: 3.4.0` for mentioning about the `kendall` ??
   
   ```python
           method : {'pearson', 'spearman', 'kendall'}
               * pearson : standard correlation coefficient
               * spearman : Spearman rank correlation
               * kendall : Kendall Tau correlation coefficient
   
               .. versionchanged:: 3.4.0
                   support 'kendall' for method parameter
   ```



##########
python/pyspark/pandas/tests/test_stats.py:
##########
@@ -258,8 +258,6 @@ def test_skew_kurt_numerical_stability(self):
         self.assert_eq(psdf.kurt(), pdf.kurt(), almost=True)
 
     def test_dataframe_corr(self):
-        # existing 'test_corr' is mixed by df.corr and ser.corr, will delete 'test_corr'
-        # when we have separate tests for df.corr and ser.corr

Review Comment:
   Nice!



##########
python/pyspark/pandas/series.py:
##########
@@ -3333,29 +3348,74 @@ def corr(self, other: "Series", method: str = "pearson") -> float:
         ...                    's2': [.3, .6, .0, .1]})
         >>> s1 = df.s1
         >>> s2 = df.s2
-        >>> s1.corr(s2, method='pearson')  # doctest: +ELLIPSIS
-        -0.851064...
+        >>> s1.corr(s2, method='pearson')
+        -0.85106...
 
-        >>> s1.corr(s2, method='spearman')  # doctest: +ELLIPSIS
-        -0.948683...
+        >>> s1.corr(s2, method='spearman')
+        -0.94868...
 
-        Notes
-        -----
-        There are behavior differences between pandas-on-Spark and pandas.

Review Comment:
   Oh, so now we can have the same behavior with pandas ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org