You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2017/03/31 09:45:30 UTC

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/17494

    [SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation

    ## What changes were proposed in this pull request?
    
    The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.
    
    ## How was this patch tested?
    
    Python unit test.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 correlation-python-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17494.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17494
    
----
commit 9b81ce9045f179e7bb481665b4491f5949eade47
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2017-03-31T09:36:58Z

    Add Python interface for ml.stats.Correlation.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    LGTM pending Jenkins confirming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75569/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109556837
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    --- End diff --
    
    Sounds good. Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109492883
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    :param dataset:
    +      A dataset or a dataframe.
    +    :param column:
    +      The name of the column of vectors for which the correlation coefficient needs
    +      to be computed. This must be a column of the dataset, and it must contain
    +      Vector objects.
    +    :param method:
    +      String specifying the method to use for computing correlation.
    +      Supported: `pearson` (default), `spearman`.
    +    :return:
    +      A dataframe that contains the correlation matrix of the column of vectors. This
    +      dataframe contains a single row and a single column of name
    +      '$METHODNAME($COLUMN)'.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml.stat import Correlation
    +    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
    +    ...            [Vectors.dense([4, 5, 0, 3])],
    +    ...            [Vectors.dense([6, 7, 0,  8])],
    +    ...            [Vectors.dense([9, 0, 0, 1])]]
    +    >>> dataset = spark.createDataFrame(dataset, ["features"])
    +    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
    +    >>> print(str(pearsonCorr).replace('nan', 'NaN'))
    +    DenseMatrix([[ 1.        ,  0.05564149,         NaN,  0.40047142],
    +                 [ 0.05564149,  1.        ,         NaN,  0.91359586],
    +                 [        NaN,         NaN,  1.        ,         NaN],
    +                 [ 0.40047142,  0.91359586,         NaN,  1.        ]])
    +    >>> spearmanCorr = Correlation.corr(dataset, 'features', method="spearman").collect()[0][0]
    --- End diff --
    
    Super minor nit - but let's use single `'` everywhere here rather than have a mix of single & double.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110097152
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    --- End diff --
    
    I don't think `@note` will work for PyDoc? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75564/testReport)** for PR 17494 at commit [`5d9d70f`](https://github.com/apache/spark/commit/5d9d70fbe225899e8a53a5cb2c116350236d0230).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    cc @ thunterdb @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    cc @thunterdb


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75569 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75569/testReport)** for PR 17494 at commit [`5d04326`](https://github.com/apache/spark/commit/5d043264103e297c52d173ac9b84b7b89833cceb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75496/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75430/testReport)** for PR 17494 at commit [`9b81ce9`](https://github.com/apache/spark/commit/9b81ce9045f179e7bb481665b4491f5949eade47).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Thanks @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109135242
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    :param dataset:
    +      A dataset or a dataframe.
    +    :param column:
    +      The name of the column of vectors for which the correlation coefficient needs
    +      to be computed. This must be a column of the dataset, and it must contain
    +      Vector objects.
    +    :param method:
    +      String specifying the method to use for computing correlation.
    +      Supported: `pearson` (default), `spearman`.
    +    :return:
    +      A dataframe that contains the correlation matrix of the column of vectors. This
    +      dataframe contains a single row and a single column of name
    +      '$METHODNAME($COLUMN)'.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml.stat import Correlation
    +    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
    +    ...            [Vectors.dense([4, 5, 0, 3])],
    +    ...            [Vectors.dense([6, 7, 0,  8])],
    +    ...            [Vectors.dense([9, 0, 0, 1])]]
    +    >>> dataset = spark.createDataFrame(dataset, ["features"])
    +    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
    +    >>> print(str(pearsonCorr).replace('nan', 'NaN'))
    --- End diff --
    
    Any reason for this replacement?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109143992
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    :param dataset:
    +      A dataset or a dataframe.
    +    :param column:
    +      The name of the column of vectors for which the correlation coefficient needs
    +      to be computed. This must be a column of the dataset, and it must contain
    +      Vector objects.
    +    :param method:
    +      String specifying the method to use for computing correlation.
    +      Supported: `pearson` (default), `spearman`.
    +    :return:
    +      A dataframe that contains the correlation matrix of the column of vectors. This
    +      dataframe contains a single row and a single column of name
    +      '$METHODNAME($COLUMN)'.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml.stat import Correlation
    +    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
    +    ...            [Vectors.dense([4, 5, 0, 3])],
    +    ...            [Vectors.dense([6, 7, 0,  8])],
    +    ...            [Vectors.dense([9, 0, 0, 1])]]
    +    >>> dataset = spark.createDataFrame(dataset, ["features"])
    +    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
    +    >>> print(str(pearsonCorr).replace('nan', 'NaN'))
    --- End diff --
    
    The test is mainly modified from mllib's old Correlation. I can't think why it does the replacement except for better representation of the 'NaN' values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Thanks @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75432/testReport)** for PR 17494 at commit [`e129e06`](https://github.com/apache/spark/commit/e129e063f3fc90c14af534e9f4b8b731dfc4fa33).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    LGTM if others are ok too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    ping @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110104787
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    Notice: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    --- End diff --
    
    I think we should use a `.. note::`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110156124
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    +      and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
    +      which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'`
    +      to avoid recomputing the common lineage.
    +
    +    :param dataset:
    +      A dataset or a dataframe.
    +    :param column:
    +      The name of the column of vectors for which the correlation coefficient needs
    +      to be computed. This must be a column of the dataset, and it must contain
    +      Vector objects.
    +    :param method:
    +      String specifying the method to use for computing correlation.
    +      Supported: `pearson` (default), `spearman`.
    +    :return:
    +      A dataframe that contains the correlation matrix of the column of vectors. This
    +      dataframe contains a single row and a single column of name
    +      '$METHODNAME($COLUMN)'.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml.stat import Correlation
    +    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
    +    ...            [Vectors.dense([4, 5, 0, 3])],
    +    ...            [Vectors.dense([6, 7, 0, 8])],
    +    ...            [Vectors.dense([9, 0, 0, 1])]]
    +    >>> dataset = spark.createDataFrame(dataset, ['features'])
    +    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
    +    >>> print(str(pearsonCorr).replace('nan', 'NaN'))
    +    DenseMatrix([[ 1.        ,  0.05564149,         NaN,  0.40047142],
    --- End diff --
    
    Fair point - it may lead to flaky tests I guess at some point. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75568 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75568/testReport)** for PR 17494 at commit [`fd76901`](https://github.com/apache/spark/commit/fd76901c39c24be48dda970f5e4625f839ee02ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110161378
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    +      and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
    +      which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'`
    +      to avoid recomputing the common lineage.
    +
    +    :param dataset:
    +      A dataset or a dataframe.
    +    :param column:
    +      The name of the column of vectors for which the correlation coefficient needs
    +      to be computed. This must be a column of the dataset, and it must contain
    +      Vector objects.
    +    :param method:
    +      String specifying the method to use for computing correlation.
    +      Supported: `pearson` (default), `spearman`.
    +    :return:
    +      A dataframe that contains the correlation matrix of the column of vectors. This
    +      dataframe contains a single row and a single column of name
    +      '$METHODNAME($COLUMN)'.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml.stat import Correlation
    +    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
    +    ...            [Vectors.dense([4, 5, 0, 3])],
    +    ...            [Vectors.dense([6, 7, 0, 8])],
    +    ...            [Vectors.dense([9, 0, 0, 1])]]
    +    >>> dataset = spark.createDataFrame(dataset, ['features'])
    +    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
    +    >>> print(str(pearsonCorr).replace('nan', 'NaN'))
    +    DenseMatrix([[ 1.        ,  0.05564149,         NaN,  0.40047142],
    --- End diff --
    
    Although we have many tests in pyspark now with floats like this, it is a fair point, I agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    cc @holdenk


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17494


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109538556
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala ---
    @@ -56,7 +56,7 @@ object Correlation {
        *  Here is how to access the correlation coefficient:
        *  {{{
        *    val data: Dataset[Vector] = ...
    -   *    val Row(coeff: Matrix) = Statistics.corr(data, "value").head
    +   *    val Row(coeff: Matrix) = Correlation.corr(data, "value").head
        *    // coeff now contains the Pearson correlation matrix.
        *  }}}
        *
    --- End diff --
    
    Also since we are here as well, there is a reference to input RDD up above in the docstring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75430/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Thanks @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75568 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75568/testReport)** for PR 17494 at commit [`fd76901`](https://github.com/apache/spark/commit/fd76901c39c24be48dda970f5e4625f839ee02ed).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109557018
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala ---
    @@ -56,7 +56,7 @@ object Correlation {
        *  Here is how to access the correlation coefficient:
        *  {{{
        *    val data: Dataset[Vector] = ...
    -   *    val Row(coeff: Matrix) = Statistics.corr(data, "value").head
    +   *    val Row(coeff: Matrix) = Correlation.corr(data, "value").head
        *    // coeff now contains the Pearson correlation matrix.
        *  }}}
        *
    --- End diff --
    
    oh, right. fixed. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75434/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110105126
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    Notice: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    --- End diff --
    
    Ok. Not quite familiar with PyDoc...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110106687
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    --- End diff --
    
    ah. ok. fixed. see if this time it's ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    LGTM as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75432/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75574/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75495 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75495/testReport)** for PR 17494 at commit [`8936880`](https://github.com/apache/spark/commit/8936880bafd8a8520011e663c0edc3b428b9160f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110127639
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    +      and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
    +      which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'`
    +      to avoid recomputing the common lineage.
    +
    +    :param dataset:
    +      A dataset or a dataframe.
    +    :param column:
    +      The name of the column of vectors for which the correlation coefficient needs
    +      to be computed. This must be a column of the dataset, and it must contain
    +      Vector objects.
    +    :param method:
    +      String specifying the method to use for computing correlation.
    +      Supported: `pearson` (default), `spearman`.
    +    :return:
    +      A dataframe that contains the correlation matrix of the column of vectors. This
    +      dataframe contains a single row and a single column of name
    +      '$METHODNAME($COLUMN)'.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml.stat import Correlation
    +    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
    +    ...            [Vectors.dense([4, 5, 0, 3])],
    +    ...            [Vectors.dense([6, 7, 0, 8])],
    +    ...            [Vectors.dense([9, 0, 0, 1])]]
    +    >>> dataset = spark.createDataFrame(dataset, ['features'])
    +    >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0]
    +    >>> print(str(pearsonCorr).replace('nan', 'NaN'))
    +    DenseMatrix([[ 1.        ,  0.05564149,         NaN,  0.40047142],
    --- End diff --
    
    So maybe I'm being overly cautious - but doctests with floats have bit me in the past - would it be good to use the ... syntax here or is this going to be ok? (Just asking).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75432/testReport)** for PR 17494 at commit [`e129e06`](https://github.com/apache/spark/commit/e129e063f3fc90c14af534e9f4b8b731dfc4fa33).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Correlation(object):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75434/testReport)** for PR 17494 at commit [`a684ac8`](https://github.com/apache/spark/commit/a684ac82e38fe01f68c98fff0c17a9b63dbead45).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75495/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109493076
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    :param dataset:
    +      A dataset or a dataframe.
    +    :param column:
    +      The name of the column of vectors for which the correlation coefficient needs
    +      to be computed. This must be a column of the dataset, and it must contain
    +      Vector objects.
    +    :param method:
    +      String specifying the method to use for computing correlation.
    +      Supported: `pearson` (default), `spearman`.
    +    :return:
    +      A dataframe that contains the correlation matrix of the column of vectors. This
    +      dataframe contains a single row and a single column of name
    +      '$METHODNAME($COLUMN)'.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml.stat import Correlation
    +    >>> dataset = [[Vectors.dense([1, 0, 0, -2])],
    +    ...            [Vectors.dense([4, 5, 0, 3])],
    +    ...            [Vectors.dense([6, 7, 0,  8])],
    --- End diff --
    
    another minor nit - seems an extra space here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75569/testReport)** for PR 17494 at commit [`5d04326`](https://github.com/apache/spark/commit/5d043264103e297c52d173ac9b84b7b89833cceb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75564 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75564/testReport)** for PR 17494 at commit [`5d9d70f`](https://github.com/apache/spark/commit/5d9d70fbe225899e8a53a5cb2c116350236d0230).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75568/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75496/testReport)** for PR 17494 at commit [`fbcc1fe`](https://github.com/apache/spark/commit/fbcc1fe1c8e2652dc54c2ebfacce01a3f69449a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75434/testReport)** for PR 17494 at commit [`a684ac8`](https://github.com/apache/spark/commit/a684ac82e38fe01f68c98fff0c17a9b63dbead45).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    @jkbradley @MLnick @holdenk If there is no more questions about this change, maybe we can make it into 2.2?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75495/testReport)** for PR 17494 at commit [`8936880`](https://github.com/apache/spark/commit/8936880bafd8a8520011e663c0edc3b428b9160f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75430/testReport)** for PR 17494 at commit [`9b81ce9`](https://github.com/apache/spark/commit/9b81ce9045f179e7bb481665b4491f5949eade47).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Correlation(object):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75564/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110097832
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    --- End diff --
    
    Replaced it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75574 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75574/testReport)** for PR 17494 at commit [`601d9eb`](https://github.com/apache/spark/commit/601d9ebd3cf1f427e8b8859b921511ff839747ea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109136254
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala ---
    @@ -56,7 +56,7 @@ object Correlation {
        *  Here is how to access the correlation coefficient:
        *  {{{
        *    val data: Dataset[Vector] = ...
    -   *    val Row(coeff: Matrix) = Statistics.corr(data, "value").head
    +   *    val Row(coeff: Matrix) = Correlation.corr(data, "value").head
        *    // coeff now contains the Pearson correlation matrix.
        *  }}}
        *
    --- End diff --
    
    While we're here - below it says "cache the input RDD" but we that should be "the input Dataset"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109144067
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala ---
    @@ -56,7 +56,7 @@ object Correlation {
        *  Here is how to access the correlation coefficient:
        *  {{{
        *    val data: Dataset[Vector] = ...
    -   *    val Row(coeff: Matrix) = Statistics.corr(data, "value").head
    +   *    val Row(coeff: Matrix) = Correlation.corr(data, "value").head
        *    // coeff now contains the Pearson correlation matrix.
        *  }}}
        *
    --- End diff --
    
    OK. Fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r110105943
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,67 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    +
    +    .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
    --- End diff --
    
    Sorry, I picked up that the doc gen will fail here - there needs to be 2 spaces before the start of each subsequent line, like this:
    
    ```
    .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
      and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
      which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'`
      to avoid recomputing the common lineage.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75574/testReport)** for PR 17494 at commit [`601d9eb`](https://github.com/apache/spark/commit/601d9ebd3cf1f427e8b8859b921511ff839747ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #17494: [SPARK-20076][ML][PySpark] Add Python interface f...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17494#discussion_r109538706
  
    --- Diff: python/pyspark/ml/stat.py ---
    @@ -71,6 +71,62 @@ def test(dataset, featuresCol, labelCol):
             return _java2py(sc, javaTestObj.test(*args))
     
     
    +class Correlation(object):
    +    """
    +    .. note:: Experimental
    +
    +    Compute the correlation matrix for the input dataset of Vectors using the specified method.
    +    Methods currently supported: `pearson` (default), `spearman`.
    --- End diff --
    
    So the Scala documentation had a warning about caching being suggested when using Spearman, would it make sense to copy this warning over as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #17494: [SPARK-20076][ML][PySpark] Add Python interface for ml.s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17494
  
    **[Test build #75496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75496/testReport)** for PR 17494 at commit [`fbcc1fe`](https://github.com/apache/spark/commit/fbcc1fe1c8e2652dc54c2ebfacce01a3f69449a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org