You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/07 11:50:15 UTC

[GitHub] [spark] dchvn opened a new pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

dchvn opened a new pull request #34213:
URL: https://github.com/apache/spark/pull/34213


   ### What changes were proposed in this pull request?
   
   Implement DataFrame.cov
   
   ### Why are the changes needed?
   
   Increase pandas API coverage in PySpark
   
   ### Does this PR introduce _any_ user-facing change?
   User can use
   
   ``` python
   >>> psdf = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
   ...                   columns=['dogs', 'cats'])
   >>> psdf.cov()
          dogs      cats
   dogs  0.666667 -1.000000
   cats -1.000000  1.666667
   
   >>> pdf = pd.DataFrame(
   ...     {
   ...         "a": [1, np.nan, 3, 4],
   ...         "b": [True, False, False, True],
   ...         "c": [True, True, False, True],
   ...     }
   ... )
   >>> psdf = ps.from_pandas(pdf)
   >>> psdf.cov()
             a         b         c
   a  2.333333 -0.166667 -0.166667
   b -0.166667  0.333333  0.166667
   c -0.166667  0.166667  0.250000
   ```
   
   ### How was this patch tested?
   
   unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982303245


   Ping @HyukjinKwon Thanks :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937897012


   CC @ueshin  @HyukjinKwon @itholic, FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937811486


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143982/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982273172


   **[Test build #145748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145748/testReport)** for PR 34213 at commit [`dc20b38`](https://github.com/apache/spark/commit/dc20b382141b250deba95b6e755b41ed50fc76f3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937748434


   **[Test build #143968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143968/testReport)** for PR 34213 at commit [`8d7ed7f`](https://github.com/apache/spark/commit/8d7ed7f18bbc851a7bdcc3705548304970810ef7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937897012


   CC @ueshin  @HyukjinKwon @itholic, FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923248



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":

Review comment:
       Thanks for reviewing!
   I think we could keep the interface consist with pandas, like ```Series.cov```.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r746231586



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":
+        """
+        Compute pairwise covariance of columns, excluding NA/null values.
+
+        Compute the pairwise covariance among the series of a DataFrame.
+        The returned data frame is the `covariance matrix
+        <https://en.wikipedia.org/wiki/Covariance_matrix>`__ of the columns
+        of the DataFrame.
+
+        Both NA and null values are automatically excluded from the
+        calculation. (See the note below about bias from missing values.)
+        A threshold can be set for the minimum number of
+        observations for each value created. Comparisons with observations
+        below this threshold will be returned as ``NaN``.
+
+        This method is generally used for the analysis of time series data to
+        understand the relationship between different measures
+        across time.
+
+        .. versionadded:: 3.3.0
+
+        Parameters
+        ----------
+        min_periods : int, optional
+            Minimum number of observations required per pair of columns
+            to have a valid result.
+
+        Returns
+        -------
+        DataFrame
+            The covariance matrix of the series of the DataFrame.
+
+        See Also
+        --------
+        Series.cov : Compute covariance with another Series.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
+        ...                   columns=['dogs', 'cats'])
+        >>> df.cov()
+                  dogs      cats
+        dogs  0.666667 -1.000000
+        cats -1.000000  1.666667
+
+        >>> np.random.seed(42)
+        >>> df = ps.DataFrame(np.random.randn(1000, 5),
+        ...                   columns=['a', 'b', 'c', 'd', 'e'])
+        >>> df.cov()
+                  a         b         c         d         e
+        a  0.998438 -0.020161  0.059277 -0.008943  0.014144
+        b -0.020161  1.059352 -0.008543 -0.024738  0.009826
+        c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
+        d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
+        e  0.014144  0.009826 -0.000271 -0.013692  0.977795
+
+        **Minimum number of periods**
+
+        This method also supports an optional ``min_periods`` keyword
+        that specifies the required minimum number of non-NA observations for
+        each column pair in order to have a valid result:
+
+        >>> np.random.seed(42)
+        >>> df = pd.DataFrame(np.random.randn(20, 3),
+        ...                   columns=['a', 'b', 'c'])
+        >>> df.loc[df.index[:5], 'a'] = np.nan
+        >>> df.loc[df.index[5:10], 'b'] = np.nan
+        >>> sdf = ps.from_pandas(df)
+        >>> sdf.cov(min_periods=12)
+                  a         b         c
+        a  0.316741       NaN -0.150812
+        b       NaN  1.248003  0.191417
+        c -0.150812  0.191417  0.895202
+        """
+        min_periods = 1 if min_periods is None else min_periods
+
+        # Only compute covariance for Boolean and Numeric except Decimal
+        psdf = self[
+            [
+                col
+                for col in self.columns
+                if isinstance(self[col].spark.data_type, BooleanType)
+                or (
+                    isinstance(self[col].spark.data_type, NumericType)
+                    and not isinstance(self[col].spark.data_type, DecimalType)
+                )
+            ]
+        ]
+
+        num_cols = len(psdf.columns)

Review comment:
       Updated! Please take another look when you find some time. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937931838


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48463/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982350720


   Thanks! @itholic @Yikun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950063876


   **[Test build #144551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144551/testReport)** for PR 34213 at commit [`d2b819d`](https://github.com/apache/spark/commit/d2b819d2e1d6e229aaad5804c5e0417ba157bcf9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923406



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":
+        """
+        Compute pairwise covariance of columns, excluding NA/null values.
+
+        Compute the pairwise covariance among the series of a DataFrame.
+        The returned data frame is the `covariance matrix
+        <https://en.wikipedia.org/wiki/Covariance_matrix>`__ of the columns
+        of the DataFrame.
+
+        Both NA and null values are automatically excluded from the
+        calculation. (See the note below about bias from missing values.)
+        A threshold can be set for the minimum number of
+        observations for each value created. Comparisons with observations
+        below this threshold will be returned as ``NaN``.
+
+        This method is generally used for the analysis of time series data to
+        understand the relationship between different measures
+        across time.
+
+        .. versionadded:: 3.3.0
+
+        Parameters
+        ----------
+        min_periods : int, optional
+            Minimum number of observations required per pair of columns
+            to have a valid result.
+
+        Returns
+        -------
+        DataFrame
+            The covariance matrix of the series of the DataFrame.
+
+        See Also
+        --------
+        Series.cov : Compute covariance with another Series.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
+        ...                   columns=['dogs', 'cats'])
+        >>> df.cov()
+                  dogs      cats
+        dogs  0.666667 -1.000000
+        cats -1.000000  1.666667
+
+        >>> np.random.seed(42)
+        >>> df = ps.DataFrame(np.random.randn(1000, 5),
+        ...                   columns=['a', 'b', 'c', 'd', 'e'])
+        >>> df.cov()
+                  a         b         c         d         e
+        a  0.998438 -0.020161  0.059277 -0.008943  0.014144
+        b -0.020161  1.059352 -0.008543 -0.024738  0.009826
+        c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
+        d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
+        e  0.014144  0.009826 -0.000271 -0.013692  0.977795
+
+        **Minimum number of periods**
+
+        This method also supports an optional ``min_periods`` keyword
+        that specifies the required minimum number of non-NA observations for
+        each column pair in order to have a valid result:
+
+        >>> np.random.seed(42)
+        >>> df = pd.DataFrame(np.random.randn(20, 3),
+        ...                   columns=['a', 'b', 'c'])
+        >>> df.loc[df.index[:5], 'a'] = np.nan
+        >>> df.loc[df.index[5:10], 'b'] = np.nan
+        >>> sdf = ps.from_pandas(df)
+        >>> sdf.cov(min_periods=12)
+                  a         b         c
+        a  0.316741       NaN -0.150812
+        b       NaN  1.248003  0.191417
+        c -0.150812  0.191417  0.895202
+        """
+        min_periods = 1 if min_periods is None else min_periods
+
+        # Only compute covariance for Boolean and Numeric except Decimal
+        psdf = self[
+            [
+                col
+                for col in self.columns
+                if isinstance(self[col].spark.data_type, BooleanType)
+                or (
+                    isinstance(self[col].spark.data_type, NumericType)
+                    and not isinstance(self[col].spark.data_type, DecimalType)
+                )
+            ]
+        ]
+
+        num_cols = len(psdf.columns)

Review comment:
       Seem we need a quick check ```min_periods > len(self)``` not ```num_cols``` ? Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-984227620


   would you mind checking and making a followup to fix up the tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937798361


   **[Test build #143982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143982/testReport)** for PR 34213 at commit [`8d7ed7f`](https://github.com/apache/spark/commit/8d7ed7f18bbc851a7bdcc3705548304970810ef7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937718644


   **[Test build #143968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143968/testReport)** for PR 34213 at commit [`8d7ed7f`](https://github.com/apache/spark/commit/8d7ed7f18bbc851a7bdcc3705548304970810ef7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950102382


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-965081140


   **[Test build #145046 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145046/testReport)** for PR 34213 at commit [`b14287e`](https://github.com/apache/spark/commit/b14287e4173f92d81f48d39b1a2e201caf176547).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-952512142


   Ping @HyukjinKwon , Many thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982173085


   Otherwise, I guess it's ready to go given two LGTMs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982295991


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145748/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] ueshin commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

ueshin commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-990226363


   @dchvn Are you working on the test fix? Otherwise, I'd take it over. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937718644






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937984715


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48463/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937759901






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-964765546


   **[Test build #145046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145046/testReport)** for PR 34213 at commit [`b14287e`](https://github.com/apache/spark/commit/b14287e4173f92d81f48d39b1a2e201caf176547).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-964840572


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49517/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923937



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":

Review comment:
       TODO note updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-984226794


   Hm, actually @dchvn, seems like the tests fails with the minimum pandas version 1.0.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982285733


   **[Test build #145748 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145748/testReport)** for PR 34213 at commit [`dc20b38`](https://github.com/apache/spark/commit/dc20b382141b250deba95b6e755b41ed50fc76f3).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982334604


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982295991


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145748/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982172917


   Sorry it slept through my fingers. Mind rebasing and making the tests pass?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982319967


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50219/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-948176783


   @Yikun FYI if you find some time to review this 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r733270913



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":

Review comment:
       There are a `ddof` arg in [pandas#def cov](https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L9581-L9585), so better add a TODO in here.
   
   BTW, for min_periods, I guess it's just to keep interface consist with pandas, but I think:
   1. it could change to `def cov(self, min_periods: Optional[int] = 1) -> "DataFrame":`, and remove L8279
   2. add a note on the min_periods doc(like `min_periods would be set to 1 if min_periods is None.`) to explain what we extactly done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950091643


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144551/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937760151


   **[Test build #143982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143982/testReport)** for PR 34213 at commit [`8d7ed7f`](https://github.com/apache/spark/commit/8d7ed7f18bbc851a7bdcc3705548304970810ef7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-964823389


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49517/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937759901






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937759901


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950086802


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r734923848



##########
File path: python/pyspark/pandas/tests/test_dataframe.py
##########
@@ -6025,6 +6025,64 @@ def test_multi_index_dtypes(self):
             )
             self.assert_eq(psmidx.dtypes, expected)
 
+    def test_cov(self):
+        # SPARK-36396: Implement DataFrame.cov
+
+        # int
+        pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
+        psdf = ps.from_pandas(pdf)
+        self.assert_eq(pdf.cov(), psdf.cov(), almost=True)
+        self.assert_eq(pdf.cov(min_periods=4), psdf.cov(min_periods=4), almost=True)
+        self.assert_eq(pdf.cov(min_periods=5), psdf.cov(min_periods=5), almost=True)

Review comment:
       thanks! updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-964765546


   **[Test build #145046 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145046/testReport)** for PR 34213 at commit [`b14287e`](https://github.com/apache/spark/commit/b14287e4173f92d81f48d39b1a2e201caf176547).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-984228067


   sure, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937760151


   **[Test build #143982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143982/testReport)** for PR 34213 at commit [`8d7ed7f`](https://github.com/apache/spark/commit/8d7ed7f18bbc851a7bdcc3705548304970810ef7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r733270913



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":

Review comment:
       There are a `ddof` arg in [pandas#def cov](https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L9581-L9585), so better add a TODO in here.
   
   BTW, for min_periods, I guess it's just to keep interface consist with pandas, but I think:
   1. it could change to `def cov(self, min_periods: Optional[int] = 1) -> "DataFrame":`, and remove L8279
   2. add a note on the min_periods(like `min_periods would be set to 1 if min_periods is None.`) to explain what we extactly done.

##########
File path: python/pyspark/pandas/tests/test_dataframe.py
##########
@@ -6025,6 +6025,64 @@ def test_multi_index_dtypes(self):
             )
             self.assert_eq(psmidx.dtypes, expected)
 
+    def test_cov(self):
+        # SPARK-36396: Implement DataFrame.cov
+
+        # int
+        pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
+        psdf = ps.from_pandas(pdf)
+        self.assert_eq(pdf.cov(), psdf.cov(), almost=True)
+        self.assert_eq(pdf.cov(min_periods=4), psdf.cov(min_periods=4), almost=True)
+        self.assert_eq(pdf.cov(min_periods=5), psdf.cov(min_periods=5), almost=True)
+
+        # bool
+        pdf = pd.DataFrame(
+            {
+                "a": [1, np.nan, 3, 4],
+                "b": [True, False, False, True],
+                "c": [True, True, False, True],
+            }
+        )
+        psdf = ps.from_pandas(pdf)
+        self.assert_eq(pdf.cov(), psdf.cov(), almost=True)
+        self.assert_eq(pdf.cov(min_periods=4), psdf.cov(min_periods=4), almost=True)
+        self.assert_eq(pdf.cov(min_periods=5), psdf.cov(min_periods=5), almost=True)

Review comment:
       ditto

##########
File path: python/pyspark/pandas/tests/test_dataframe.py
##########
@@ -6025,6 +6025,64 @@ def test_multi_index_dtypes(self):
             )
             self.assert_eq(psmidx.dtypes, expected)
 
+    def test_cov(self):
+        # SPARK-36396: Implement DataFrame.cov
+
+        # int
+        pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"])
+        psdf = ps.from_pandas(pdf)
+        self.assert_eq(pdf.cov(), psdf.cov(), almost=True)
+        self.assert_eq(pdf.cov(min_periods=4), psdf.cov(min_periods=4), almost=True)
+        self.assert_eq(pdf.cov(min_periods=5), psdf.cov(min_periods=5), almost=True)

Review comment:
       nit: I guess you could remove the `almost=True` in here(min_periods=5). 
   
   the results are NaN, so better to do a complete match test.

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":
+        """
+        Compute pairwise covariance of columns, excluding NA/null values.
+
+        Compute the pairwise covariance among the series of a DataFrame.
+        The returned data frame is the `covariance matrix
+        <https://en.wikipedia.org/wiki/Covariance_matrix>`__ of the columns
+        of the DataFrame.
+
+        Both NA and null values are automatically excluded from the
+        calculation. (See the note below about bias from missing values.)
+        A threshold can be set for the minimum number of
+        observations for each value created. Comparisons with observations
+        below this threshold will be returned as ``NaN``.
+
+        This method is generally used for the analysis of time series data to
+        understand the relationship between different measures
+        across time.
+
+        .. versionadded:: 3.3.0
+
+        Parameters
+        ----------
+        min_periods : int, optional
+            Minimum number of observations required per pair of columns
+            to have a valid result.
+
+        Returns
+        -------
+        DataFrame
+            The covariance matrix of the series of the DataFrame.
+
+        See Also
+        --------
+        Series.cov : Compute covariance with another Series.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
+        ...                   columns=['dogs', 'cats'])
+        >>> df.cov()
+                  dogs      cats
+        dogs  0.666667 -1.000000
+        cats -1.000000  1.666667
+
+        >>> np.random.seed(42)
+        >>> df = ps.DataFrame(np.random.randn(1000, 5),
+        ...                   columns=['a', 'b', 'c', 'd', 'e'])
+        >>> df.cov()
+                  a         b         c         d         e
+        a  0.998438 -0.020161  0.059277 -0.008943  0.014144
+        b -0.020161  1.059352 -0.008543 -0.024738  0.009826
+        c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
+        d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
+        e  0.014144  0.009826 -0.000271 -0.013692  0.977795
+
+        **Minimum number of periods**
+
+        This method also supports an optional ``min_periods`` keyword
+        that specifies the required minimum number of non-NA observations for
+        each column pair in order to have a valid result:
+
+        >>> np.random.seed(42)
+        >>> df = pd.DataFrame(np.random.randn(20, 3),
+        ...                   columns=['a', 'b', 'c'])
+        >>> df.loc[df.index[:5], 'a'] = np.nan
+        >>> df.loc[df.index[5:10], 'b'] = np.nan
+        >>> sdf = ps.from_pandas(df)
+        >>> sdf.cov(min_periods=12)
+                  a         b         c
+        a  0.316741       NaN -0.150812
+        b       NaN  1.248003  0.191417
+        c -0.150812  0.191417  0.895202
+        """
+        min_periods = 1 if min_periods is None else min_periods
+
+        # Only compute covariance for Boolean and Numeric except Decimal
+        psdf = self[
+            [
+                col
+                for col in self.columns
+                if isinstance(self[col].spark.data_type, BooleanType)
+                or (
+                    isinstance(self[col].spark.data_type, NumericType)
+                    and not isinstance(self[col].spark.data_type, DecimalType)
+                )
+            ]
+        ]
+
+        num_cols = len(psdf.columns)
+        data_cols = psdf._internal.data_spark_column_names
+        cov_scols = []
+        count_not_null_scols = []
+
+        # Count number of null row between two columns
+        # Example:
+        #    a   b   c
+        # 0  1   1   1
+        # 1  NaN 2   2
+        # 2  3   NaN 3
+        # 3  4   4   4
+        #
+        #    a           b             c
+        # a  count(a, a) count(a, b) count(a, c)
+        # b              count(b, b) count(b, c)
+        # c                          count(c, c)
+        #
+        # count_not_null_scols =
+        # [F.count(a, a), F.count(a, b), F.count(a, c), F.count(b, b), F.count(b, c), F.count(c, c)]
+        for r in range(0, num_cols):
+            for c in range(r, num_cols):
+                count_not_null_scols.append(
+                    F.count(
+                        F.when(F.col(data_cols[r]).isNotNull() & F.col(data_cols[c]).isNotNull(), 1)
+                    )
+                )
+
+        count_not_null = (
+            psdf._internal.spark_frame.replace(float("nan"), None)
+            .select(*count_not_null_scols)
+            .head(1)[0]
+        )
+
+        # Calculate covariance between two columns
+        # Example:
+        # with min_periods = 3
+        #    a   b   c
+        # 0  1   1   1
+        # 1  NaN 2   2
+        # 2  3   NaN 3
+        # 3  4   4   4
+        #
+        #    a         b         c
+        # a  cov(a, a) None      cov(a, c)
+        # b            cov(b, b) cov(b, c)
+        # c                      cov(c, c)
+        #
+        # cov_scols = [F.cov(a, a), None, F.cov(a, c), F.cov(b, b), F.cov(b, c), F.cov(c, c)]
+        step = 0
+        for r in range(0, num_cols):
+            step += r
+            for c in range(r, num_cols):
+                cov_scols.append(
+                    F.covar_samp(
+                        F.col(data_cols[r]).cast("double"), F.col(data_cols[c]).cast("double")
+                    )
+                    if count_not_null[r * num_cols + c - step] >= min_periods
+                    else F.lit(None)
+                )
+
+        pair_cov = psdf._internal.spark_frame.select(*cov_scols).head(1)[0]
+
+        # Convert from row to 2D array
+        # Example:
+        # pair_cov = [cov(a, a), None, cov(a, c), cov(b, b), cov(b, c), cov(c, c)]
+        #
+        # cov =
+        #
+        #    a         b         c
+        # a  cov(a, a) None      cov(a, c)
+        # b            cov(b, b) cov(b, c)
+        # c                      cov(c, c)
+        cov = np.zeros([num_cols, num_cols])

Review comment:
       note for myself: looks like it's a reshape operation, we could extract into utils or somewhere in future.

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":
+        """
+        Compute pairwise covariance of columns, excluding NA/null values.
+
+        Compute the pairwise covariance among the series of a DataFrame.
+        The returned data frame is the `covariance matrix
+        <https://en.wikipedia.org/wiki/Covariance_matrix>`__ of the columns
+        of the DataFrame.
+
+        Both NA and null values are automatically excluded from the
+        calculation. (See the note below about bias from missing values.)
+        A threshold can be set for the minimum number of
+        observations for each value created. Comparisons with observations
+        below this threshold will be returned as ``NaN``.
+
+        This method is generally used for the analysis of time series data to
+        understand the relationship between different measures
+        across time.
+
+        .. versionadded:: 3.3.0
+
+        Parameters
+        ----------
+        min_periods : int, optional
+            Minimum number of observations required per pair of columns
+            to have a valid result.
+
+        Returns
+        -------
+        DataFrame
+            The covariance matrix of the series of the DataFrame.
+
+        See Also
+        --------
+        Series.cov : Compute covariance with another Series.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
+        ...                   columns=['dogs', 'cats'])
+        >>> df.cov()
+                  dogs      cats
+        dogs  0.666667 -1.000000
+        cats -1.000000  1.666667
+
+        >>> np.random.seed(42)
+        >>> df = ps.DataFrame(np.random.randn(1000, 5),
+        ...                   columns=['a', 'b', 'c', 'd', 'e'])
+        >>> df.cov()
+                  a         b         c         d         e
+        a  0.998438 -0.020161  0.059277 -0.008943  0.014144
+        b -0.020161  1.059352 -0.008543 -0.024738  0.009826
+        c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
+        d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
+        e  0.014144  0.009826 -0.000271 -0.013692  0.977795
+
+        **Minimum number of periods**
+
+        This method also supports an optional ``min_periods`` keyword
+        that specifies the required minimum number of non-NA observations for
+        each column pair in order to have a valid result:
+
+        >>> np.random.seed(42)
+        >>> df = pd.DataFrame(np.random.randn(20, 3),
+        ...                   columns=['a', 'b', 'c'])
+        >>> df.loc[df.index[:5], 'a'] = np.nan
+        >>> df.loc[df.index[5:10], 'b'] = np.nan
+        >>> sdf = ps.from_pandas(df)
+        >>> sdf.cov(min_periods=12)
+                  a         b         c
+        a  0.316741       NaN -0.150812
+        b       NaN  1.248003  0.191417
+        c -0.150812  0.191417  0.895202
+        """
+        min_periods = 1 if min_periods is None else min_periods
+
+        # Only compute covariance for Boolean and Numeric except Decimal
+        psdf = self[
+            [
+                col
+                for col in self.columns
+                if isinstance(self[col].spark.data_type, BooleanType)
+                or (
+                    isinstance(self[col].spark.data_type, NumericType)
+                    and not isinstance(self[col].spark.data_type, DecimalType)
+                )
+            ]
+        ]
+
+        num_cols = len(psdf.columns)

Review comment:
       nit: Looks like we could add a quick return, to speedup the case when min_periods `min_periods > num_cols`, fill nan in df and return directly, then we no need to setup the spark job to improve the performance, like [pandas done](https://github.com/pandas-dev/pandas/blob/f3f90c33966e3ba334c459ad89d607f820caa1f8/pandas/core/frame.py#L9663)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937718644






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937984758


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48463/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937759901


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143968/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r733280815



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":
+        """
+        Compute pairwise covariance of columns, excluding NA/null values.
+
+        Compute the pairwise covariance among the series of a DataFrame.
+        The returned data frame is the `covariance matrix
+        <https://en.wikipedia.org/wiki/Covariance_matrix>`__ of the columns
+        of the DataFrame.
+
+        Both NA and null values are automatically excluded from the
+        calculation. (See the note below about bias from missing values.)
+        A threshold can be set for the minimum number of
+        observations for each value created. Comparisons with observations
+        below this threshold will be returned as ``NaN``.
+
+        This method is generally used for the analysis of time series data to
+        understand the relationship between different measures
+        across time.
+
+        .. versionadded:: 3.3.0
+
+        Parameters
+        ----------
+        min_periods : int, optional
+            Minimum number of observations required per pair of columns
+            to have a valid result.
+
+        Returns
+        -------
+        DataFrame
+            The covariance matrix of the series of the DataFrame.
+
+        See Also
+        --------
+        Series.cov : Compute covariance with another Series.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
+        ...                   columns=['dogs', 'cats'])
+        >>> df.cov()
+                  dogs      cats
+        dogs  0.666667 -1.000000
+        cats -1.000000  1.666667
+
+        >>> np.random.seed(42)
+        >>> df = ps.DataFrame(np.random.randn(1000, 5),
+        ...                   columns=['a', 'b', 'c', 'd', 'e'])
+        >>> df.cov()
+                  a         b         c         d         e
+        a  0.998438 -0.020161  0.059277 -0.008943  0.014144
+        b -0.020161  1.059352 -0.008543 -0.024738  0.009826
+        c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
+        d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
+        e  0.014144  0.009826 -0.000271 -0.013692  0.977795
+
+        **Minimum number of periods**
+
+        This method also supports an optional ``min_periods`` keyword
+        that specifies the required minimum number of non-NA observations for
+        each column pair in order to have a valid result:
+
+        >>> np.random.seed(42)
+        >>> df = pd.DataFrame(np.random.randn(20, 3),
+        ...                   columns=['a', 'b', 'c'])
+        >>> df.loc[df.index[:5], 'a'] = np.nan
+        >>> df.loc[df.index[5:10], 'b'] = np.nan
+        >>> sdf = ps.from_pandas(df)
+        >>> sdf.cov(min_periods=12)
+                  a         b         c
+        a  0.316741       NaN -0.150812
+        b       NaN  1.248003  0.191417
+        c -0.150812  0.191417  0.895202
+        """
+        min_periods = 1 if min_periods is None else min_periods
+
+        # Only compute covariance for Boolean and Numeric except Decimal
+        psdf = self[
+            [
+                col
+                for col in self.columns
+                if isinstance(self[col].spark.data_type, BooleanType)
+                or (
+                    isinstance(self[col].spark.data_type, NumericType)
+                    and not isinstance(self[col].spark.data_type, DecimalType)
+                )
+            ]
+        ]
+
+        num_cols = len(psdf.columns)

Review comment:
       nit: Looks like we could add a quick return, to speedup the case when min_periods `min_periods > num_cols`, fill nan in df and return directly, then we no need to setup the spark job to improve the performance, like [pandas done](https://github.com/pandas-dev/pandas/blob/f3f90c33966e3ba334c459ad89d607f820caa1f8/pandas/core/frame.py#L9663).
   
   But it's not the break change, feel free to open the separate PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-965081205


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145046/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937718644


   **[Test build #143968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143968/testReport)** for PR 34213 at commit [`8d7ed7f`](https://github.com/apache/spark/commit/8d7ed7f18bbc851a7bdcc3705548304970810ef7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #34213:
URL: https://github.com/apache/spark/pull/34213#discussion_r737085007



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -8201,6 +8202,185 @@ def update(self, other: "DataFrame", join: str = "left", overwrite: bool = True)
         internal = self._internal.with_new_sdf(sdf, data_fields=data_fields)
         self._update_internal_frame(internal, requires_same_anchor=False)
 
+    def cov(self, min_periods: Optional[int] = None) -> "DataFrame":
+        """
+        Compute pairwise covariance of columns, excluding NA/null values.
+
+        Compute the pairwise covariance among the series of a DataFrame.
+        The returned data frame is the `covariance matrix
+        <https://en.wikipedia.org/wiki/Covariance_matrix>`__ of the columns
+        of the DataFrame.
+
+        Both NA and null values are automatically excluded from the
+        calculation. (See the note below about bias from missing values.)
+        A threshold can be set for the minimum number of
+        observations for each value created. Comparisons with observations
+        below this threshold will be returned as ``NaN``.
+
+        This method is generally used for the analysis of time series data to
+        understand the relationship between different measures
+        across time.
+
+        .. versionadded:: 3.3.0
+
+        Parameters
+        ----------
+        min_periods : int, optional
+            Minimum number of observations required per pair of columns
+            to have a valid result.
+
+        Returns
+        -------
+        DataFrame
+            The covariance matrix of the series of the DataFrame.
+
+        See Also
+        --------
+        Series.cov : Compute covariance with another Series.
+
+        Examples
+        --------
+        >>> df = ps.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
+        ...                   columns=['dogs', 'cats'])
+        >>> df.cov()
+                  dogs      cats
+        dogs  0.666667 -1.000000
+        cats -1.000000  1.666667
+
+        >>> np.random.seed(42)
+        >>> df = ps.DataFrame(np.random.randn(1000, 5),
+        ...                   columns=['a', 'b', 'c', 'd', 'e'])
+        >>> df.cov()
+                  a         b         c         d         e
+        a  0.998438 -0.020161  0.059277 -0.008943  0.014144
+        b -0.020161  1.059352 -0.008543 -0.024738  0.009826
+        c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
+        d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
+        e  0.014144  0.009826 -0.000271 -0.013692  0.977795
+
+        **Minimum number of periods**
+
+        This method also supports an optional ``min_periods`` keyword
+        that specifies the required minimum number of non-NA observations for
+        each column pair in order to have a valid result:
+
+        >>> np.random.seed(42)
+        >>> df = pd.DataFrame(np.random.randn(20, 3),
+        ...                   columns=['a', 'b', 'c'])
+        >>> df.loc[df.index[:5], 'a'] = np.nan
+        >>> df.loc[df.index[5:10], 'b'] = np.nan
+        >>> sdf = ps.from_pandas(df)
+        >>> sdf.cov(min_periods=12)
+                  a         b         c
+        a  0.316741       NaN -0.150812
+        b       NaN  1.248003  0.191417
+        c -0.150812  0.191417  0.895202
+        """
+        min_periods = 1 if min_periods is None else min_periods
+
+        # Only compute covariance for Boolean and Numeric except Decimal
+        psdf = self[
+            [
+                col
+                for col in self.columns
+                if isinstance(self[col].spark.data_type, BooleanType)
+                or (
+                    isinstance(self[col].spark.data_type, NumericType)
+                    and not isinstance(self[col].spark.data_type, DecimalType)
+                )
+            ]
+        ]
+
+        num_cols = len(psdf.columns)

Review comment:
       yep




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-965081205


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/145046/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn edited a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn edited a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950060026


   CC @HyukjinKwon , updated some nit. May I resolve other improvements in F-UP PR latter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950068793


   **[Test build #144551 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144551/testReport)** for PR 34213 at commit [`d2b819d`](https://github.com/apache/spark/commit/d2b819d2e1d6e229aaad5804c5e0417ba157bcf9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #34213:
URL: https://github.com/apache/spark/pull/34213


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982353317


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50219/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982349531


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50219/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937984758


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48463/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-937811486


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143982/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-964840572


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49517/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982273172


   **[Test build #145748 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/145748/testReport)** for PR 34213 at commit [`dc20b38`](https://github.com/apache/spark/commit/dc20b382141b250deba95b6e755b41ed50fc76f3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-982353317


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/50219/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-939603339


   CC @HyukjinKwon, Could you take a look ? Thank you so much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dchvn commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

dchvn commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950060026


   CC @HyukjinKwon , updated some nit. May I resolve other improvements in F-UP PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950063876


   **[Test build #144551 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144551/testReport)** for PR 34213 at commit [`d2b819d`](https://github.com/apache/spark/commit/d2b819d2e1d6e229aaad5804c5e0417ba157bcf9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950091643


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144551/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950104918


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34213: [SPARK-36396][PYTHON] Implement DataFrame.cov

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34213:
URL: https://github.com/apache/spark/pull/34213#issuecomment-950104918


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org