You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/01 18:21:14 UTC

[GitHub] [spark] ueshin commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

ueshin commented on a change in pull request #33752:
URL: https://github.com/apache/spark/pull/33752#discussion_r700456001



##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2905,6 +2905,21 @@ def test_combine_first(self):
 
         self.assert_eq(psser1.combine_first(psser2), pser1.combine_first(pser2))
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(

Review comment:
       nit: shall we use `pdf` for DataFrame?

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2905,6 +2905,21 @@ def test_combine_first(self):
 
         self.assert_eq(psser1.combine_first(psser2), pser1.combine_first(pser2))
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(
+            {
+                "s1": [0.90010907, 0.13484424, 0.62036035],
+                "s2": [0.12528585, 0.26962463, 0.51111198],
+            },
+            index=[0, 1, 2],
+        )
+
+        pcov = pser["s1"].cov(pser["s2"])
+
+        psser = ps.from_pandas(pser)

Review comment:
       ditto: `psdf`?

##########
File path: python/pyspark/pandas/tests/test_ops_on_diff_frames.py
##########
@@ -1793,6 +1793,23 @@ def test_rank(self):
             pdf["Col2"].rank().loc[pdf["Col1"] == 20], psdf["Col2"].rank().loc[psdf["Col1"] == 20]
         )
 
+    def test_cov(self):
+        pser1 = pd.Series([0.90010907, 0.13484424, 0.62036035], index=[0, 1, 2])
+        pser2 = pd.Series([0.12528585, 0.26962463, 0.51111198], index=[1, 2, 3])
+        pcov = pser1.cov(pser2)
+
+        psser1 = ps.from_pandas(pser1)
+        psser2 = ps.from_pandas(pser2)
+        pscov = psser1.cov(psser2)
+        self.assert_eq(pcov, pscov, almost=True)

Review comment:
       Could you also try the cases:
   - with `min_periods` on boundaries (2 and 3?)
   - when the lengths of Series are different

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2905,6 +2905,21 @@ def test_combine_first(self):
 
         self.assert_eq(psser1.combine_first(psser2), pser1.combine_first(pser2))
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(
+            {
+                "s1": [0.90010907, 0.13484424, 0.62036035],
+                "s2": [0.12528585, 0.26962463, 0.51111198],
+            },
+            index=[0, 1, 2],
+        )
+
+        pcov = pser["s1"].cov(pser["s2"])
+
+        psser = ps.from_pandas(pser)
+        pscov = psser["s1"].cov(psser["s2"])
+        self.assert_eq(pcov, pscov, almost=True)

Review comment:
       Could you also try the cases:
   - with `min_periods` on boundaries (3 and 4?)
   - when the dtypes of the Series are not numeric




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org