You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dorx <gi...@git.apache.org> on 2014/08/01 08:28:33 UTC

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

GitHub user dorx opened a pull request:

    https://github.com/apache/spark/pull/1713

    [SPARK-2786][mllib] Python correlations

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dorx/spark pythonCorrelation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1713.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1713
    
----
commit cd163d6e8d3535d8dac1dd8321831092ef52c995
Author: Doris Xin <do...@gmail.com>
Date:   2014-07-29T18:44:07Z

    WIP

commit d199f1fc84113c0eb541f895ea04ae57fdd77de2
Author: Doris Xin <do...@gmail.com>
Date:   2014-07-29T21:20:36Z

    Moved correlation names into a public object

commit 9141a637770f62eae950d28013537a7b1812229f
Author: Doris Xin <do...@gmail.com>
Date:   2014-07-31T19:24:15Z

    WIP2

commit cc9f725ec54ae294bec9caf0986b9130cc407a7d
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-01T06:00:09Z

    units passed.

commit eb5bf5692d63226bf71777b48c63c18fde4f38d4
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-01T06:11:57Z

    merge master

commit e69d4462c394eb11d9b7f8b398666ce36c026dec
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-01T06:21:45Z

    fixed missed conflicts.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50919220
  
    QA tests have started for PR 1713. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17688/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15718783
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
         ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
       }
     
    +  /**
    +   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
    +   * Returns the correlation matrix serialized into a byte array understood by deserializers in
    +   * pyspark.
    +   */
    +  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
    +    val inputMatrix = X.rdd.map(deserializeDoubleVector(_))
    --- End diff --
    
    Got it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15717671
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala ---
    @@ -49,43 +49,48 @@ private[stat] trait Correlation {
     }
     
     /**
    - * Delegates computation to the specific correlation object based on the input method name
    - *
    - * Currently supported correlations: pearson, spearman.
    - * After new correlation algorithms are added, please update the documentation here and in
    - * Statistics.scala for the correlation APIs.
    - *
    - * Maintains the default correlation type, pearson
    + * Delegates computation to the specific correlation object based on the input method name.
      */
     private[stat] object Correlations {
     
    -  // Note: after new types of correlations are implemented, please update this map
    -  val nameToObjectMap = Map(("pearson", PearsonCorrelation), ("spearman", SpearmanCorrelation))
    -  val defaultCorrName: String = "pearson"
    -  val defaultCorr: Correlation = nameToObjectMap(defaultCorrName)
    -
    -  def corr(x: RDD[Double], y: RDD[Double], method: String = defaultCorrName): Double = {
    +  def corr(x: RDD[Double],
    +       y: RDD[Double],
    +       method: String = CorrelationNames.defaultCorrName): Double = {
         val correlation = getCorrelationFromName(method)
         correlation.computeCorrelation(x, y)
       }
     
    -  def corrMatrix(X: RDD[Vector], method: String = defaultCorrName): Matrix = {
    +  def corrMatrix(X: RDD[Vector],
    --- End diff --
    
    @mengxr Is this what you meant? This isn't public and done to avoid confusing the scala compiler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50853328
  
    QA tests have started for PR 1713. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17653/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15718765
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala ---
    @@ -49,43 +49,48 @@ private[stat] trait Correlation {
     }
     
     /**
    - * Delegates computation to the specific correlation object based on the input method name
    - *
    - * Currently supported correlations: pearson, spearman.
    - * After new correlation algorithms are added, please update the documentation here and in
    - * Statistics.scala for the correlation APIs.
    - *
    - * Maintains the default correlation type, pearson
    + * Delegates computation to the specific correlation object based on the input method name.
      */
     private[stat] object Correlations {
     
    -  // Note: after new types of correlations are implemented, please update this map
    -  val nameToObjectMap = Map(("pearson", PearsonCorrelation), ("spearman", SpearmanCorrelation))
    -  val defaultCorrName: String = "pearson"
    -  val defaultCorr: Correlation = nameToObjectMap(defaultCorrName)
    -
    -  def corr(x: RDD[Double], y: RDD[Double], method: String = defaultCorrName): Double = {
    +  def corr(x: RDD[Double],
    +       y: RDD[Double],
    +       method: String = CorrelationNames.defaultCorrName): Double = {
         val correlation = getCorrelationFromName(method)
         correlation.computeCorrelation(x, y)
       }
     
    -  def corrMatrix(X: RDD[Vector], method: String = defaultCorrName): Matrix = {
    +  def corrMatrix(X: RDD[Vector],
    --- End diff --
    
    Yes. Sorry I didn't see the private tag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15717240
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
         ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
       }
     
    +  /**
    +   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
    +   * Returns the correlation matrix serialized into a byte array understood by deserializers in
    +   * pyspark.
    +   */
    +  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
    +    val inputMatrix = X.rdd.map(deserializeDoubleVector(_))
    --- End diff --
    
    nit: `X.rdd.map(deserializeDoubleVector)` (`(_)` is not necessary)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15684998
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
         ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
       }
     
    +  /**
    +   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
    +   * Returns the correlation matrix serialized into a byte array understood by deserializers in
    +   * pyspark.
    +   */
    +  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
    +    val inputMatrix = X.rdd.map(deserializeDoubleVector(_))
    +    val result = Statistics.corr(inputMatrix, getCorrNameOrDefault(method))
    +    serializeDoubleMatrix(to2dArray(result))
    +  }
    +
    +  /**
    +   * Java stub for mllib Statistics.corr(x: RDD[Double], y: RDD[Double], method: String).
    +   */
    +  def corr(x: JavaRDD[Array[Byte]], y: JavaRDD[Array[Byte]], method: String): Double = {
    +    val xDeser = x.rdd.map(deserializeDouble(_))
    +    val yDeser = y.rdd.map(deserializeDouble(_))
    +    Statistics.corr(xDeser, yDeser, getCorrNameOrDefault(method))
    +  }
    +
    +  // used by the corr methods to retrieve the name of the correlation method passed in via pyspark
    +  private def getCorrNameOrDefault(method: String) = {
    +    if (method == null) CorrelationNames.defaultCorrName else method
    +  }
    +
    +  // Reformat a Matrix into Array[Array[Double]] for serialization
    +  private[python] def to2dArray(matrix: Matrix): Array[Array[Double]] = {
    +    val values = matrix.toArray.toIterator
    --- End diff --
    
    iterator is slow. `Array.tabulate(matrix.numRows, matrix.numCols)((i, j) => values(i + j * matrix.numRows))` may be faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50853374
  
    QA results for PR 1713:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class Statistics(object):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17653/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15685067
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/api/python/PythonMLLibAPISuite.scala ---
    @@ -59,10 +59,25 @@ class PythonMLLibAPISuite extends FunSuite {
       }
     
       test("double serialization") {
    -    for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue)) {
    +    for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue, Double.NaN)) {
           val bytes = py.serializeDouble(x)
           val deser = py.deserializeDouble(bytes)
    -      assert(x === deser)
    +      // We use `equals` here for comparison because we cannot use `==` for NaN
    +      assert(x.equals(deser))
         }
       }
    +
    +  test("matrix to 2D array") {
    +    val values = Array[Double](0, 1.2, 3, 4.56, 7, 8)
    +    val matrix = Matrices.dense(2, 3, values)
    +    val twoDarray = py.to2dArray(matrix)
    +    val expected = Array(Array[Double](0, 1.2, 3), Array[Double](4.56, 7, 8))
    --- End diff --
    
    The expected value should be `(0, 3, 7), (1.2, 4.56, 8)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15685005
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
         ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
       }
     
    +  /**
    +   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
    +   * Returns the correlation matrix serialized into a byte array understood by deserializers in
    +   * pyspark.
    +   */
    +  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
    +    val inputMatrix = X.rdd.map(deserializeDoubleVector(_))
    +    val result = Statistics.corr(inputMatrix, getCorrNameOrDefault(method))
    +    serializeDoubleMatrix(to2dArray(result))
    +  }
    +
    +  /**
    +   * Java stub for mllib Statistics.corr(x: RDD[Double], y: RDD[Double], method: String).
    +   */
    +  def corr(x: JavaRDD[Array[Byte]], y: JavaRDD[Array[Byte]], method: String): Double = {
    +    val xDeser = x.rdd.map(deserializeDouble(_))
    +    val yDeser = y.rdd.map(deserializeDouble(_))
    +    Statistics.corr(xDeser, yDeser, getCorrNameOrDefault(method))
    +  }
    +
    +  // used by the corr methods to retrieve the name of the correlation method passed in via pyspark
    +  private def getCorrNameOrDefault(method: String) = {
    +    if (method == null) CorrelationNames.defaultCorrName else method
    +  }
    +
    +  // Reformat a Matrix into Array[Array[Double]] for serialization
    +  private[python] def to2dArray(matrix: Matrix): Array[Array[Double]] = {
    +    val values = matrix.toArray.toIterator
    +    Array.fill(matrix.numRows)(Array.fill(matrix.numCols)(values.next()))
    --- End diff --
    
    `mllib.Matrix` is column majored but `Array[Array[Double]]` is an array of rows. I think this is not correct. The error doesn't show up in the python tests because corr is symmetric.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15684617
  
    --- Diff: python/pyspark/mllib/stat.py ---
    @@ -0,0 +1,103 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for statistical functions in MLlib.
    +"""
    +
    +from pyspark.mllib._common import \
    +    _get_unmangled_double_vector_rdd, _get_unmangled_rdd, \
    +    _serialize_double, _serialize_double_vector, \
    +    _deserialize_double, _deserialize_double_matrix
    +
    +class Statistics(object):
    +
    +    @staticmethod
    +    def corr(x, y=None, method=None):
    +        """
    +        Compute the correlation (matrix) for the input RDD(s) using the
    +        specified method.
    +        Methods currently supported: I{pearson (default), spearman}.
    +
    +        If a single RDD of Vectors is passed in, a correlation matrix
    +        comparing the columns in the input RDD is returned. Note that the
    +        method name can be passed in as the second argument without C{method=}.
    --- End diff --
    
    So it also "supports" `corr(x, y="spearman")`. This adds some convenience together with some confusion. If `X` is a matrix, user should use `corr(X, method="spearman")`. Example: http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50925018
  
    QA results for PR 1713:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class Statistics(object):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17688/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15717696
  
    --- Diff: python/pyspark/mllib/stat.py ---
    @@ -0,0 +1,103 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for statistical functions in MLlib.
    +"""
    +
    +from pyspark.mllib._common import \
    +    _get_unmangled_double_vector_rdd, _get_unmangled_rdd, \
    +    _serialize_double, _serialize_double_vector, \
    +    _deserialize_double, _deserialize_double_matrix
    +
    +class Statistics(object):
    +
    +    @staticmethod
    +    def corr(x, y=None, method=None):
    +        """
    +        Compute the correlation (matrix) for the input RDD(s) using the
    +        specified method.
    +        Methods currently supported: I{pearson (default), spearman}.
    +
    +        If a single RDD of Vectors is passed in, a correlation matrix
    +        comparing the columns in the input RDD is returned. Note that the
    +        method name can be passed in as the second argument without C{method=}.
    --- End diff --
    
    Doc was not updated. Please remove `Note that .... without C{methods=}.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50854587
  
    QA tests have started for PR 1713. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17657/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50933645
  
    QA tests have started for PR 1713. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17698/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50939505
  
    QA results for PR 1713:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class Statistics(object):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17698/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15685050
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/api/python/PythonMLLibAPISuite.scala ---
    @@ -59,10 +59,25 @@ class PythonMLLibAPISuite extends FunSuite {
       }
     
       test("double serialization") {
    -    for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue)) {
    +    for (x <- List(123.0, -10.0, 0.0, Double.MaxValue, Double.MinValue, Double.NaN)) {
           val bytes = py.serializeDouble(x)
           val deser = py.deserializeDouble(bytes)
    -      assert(x === deser)
    +      // We use `equals` here for comparison because we cannot use `==` for NaN
    +      assert(x.equals(deser))
         }
       }
    +
    +  test("matrix to 2D array") {
    +    val values = Array[Double](0, 1.2, 3, 4.56, 7, 8)
    +    val matrix = Matrices.dense(2, 3, values)
    +    val twoDarray = py.to2dArray(matrix)
    --- End diff --
    
    `twoDArray` or `arr`? `twoDarray` doesn't follow the naming convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50939858
  
    LGTM. Merged into master. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15709151
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
         ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
       }
     
    +  /**
    +   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
    +   * Returns the correlation matrix serialized into a byte array understood by deserializers in
    +   * pyspark.
    +   */
    +  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
    --- End diff --
    
    This was designed to match R's method name. R was a good candidate here since it also allows you to pass in the method name as a string after the input arrays/matrix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15684640
  
    --- Diff: python/pyspark/mllib/stat.py ---
    @@ -0,0 +1,103 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for statistical functions in MLlib.
    +"""
    +
    +from pyspark.mllib._common import \
    +    _get_unmangled_double_vector_rdd, _get_unmangled_rdd, \
    +    _serialize_double, _serialize_double_vector, \
    +    _deserialize_double, _deserialize_double_matrix
    +
    +class Statistics(object):
    +
    +    @staticmethod
    +    def corr(x, y=None, method=None):
    +        """
    +        Compute the correlation (matrix) for the input RDD(s) using the
    +        specified method.
    +        Methods currently supported: I{pearson (default), spearman}.
    +
    +        If a single RDD of Vectors is passed in, a correlation matrix
    +        comparing the columns in the input RDD is returned. Note that the
    +        method name can be passed in as the second argument without C{method=}.
    +        If two RDDs of floats are passed in, a single float is returned.
    +
    +        >>> x = sc.parallelize([1.0, 0.0, -2.0], 2)
    +        >>> y = sc.parallelize([4.0, 5.0, 3.0], 2)
    +        >>> zeros = sc.parallelize([0.0, 0.0, 0.0], 2)
    +        >>> abs(Statistics.corr(x, y) - 0.6546537) < 1e-7
    +        True
    +        >>> Statistics.corr(x, y) == Statistics.corr(x, y, "pearson")
    +        True
    +        >>> Statistics.corr(x, y, "spearman")
    +        0.5
    +        >>> from math import isnan
    +        >>> isnan(Statistics.corr(x, zeros))
    +        True
    +        >>> from linalg import Vectors
    +        >>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
    +        ...                       Vectors.dense([6, 7, 0,  8]), Vectors.dense([9, 0, 0, 1])])
    +        >>> Statistics.corr(rdd)
    +        array([[ 1.        ,  0.05564149,         nan,  0.40047142],
    +               [ 0.05564149,  1.        ,         nan,  0.91359586],
    +               [        nan,         nan,  1.        ,         nan],
    +               [ 0.40047142,  0.91359586,         nan,  1.        ]])
    +        >>> Statistics.corr(rdd, "spearman")
    +        array([[ 1.        ,  0.10540926,         nan,  0.4       ],
    +               [ 0.10540926,  1.        ,         nan,  0.9486833 ],
    +               [        nan,         nan,  1.        ,         nan],
    +               [ 0.4       ,  0.9486833 ,         nan,  1.        ]])
    +        """
    +        sc = x.ctx
    +        # Check inputs to determine whether a single value or a matrix is needed for output.
    +        # Since it's legal for users to use the method name as the second argument, we need to
    +        # check if y is used to specify the method name instead.
    +        if type(y) == str:
    +            if not method:
    +                method = y
    +            else:
    +                raise TypeError("Multiple string arguments detected when only at most one " \
    +                                + "allowed for Statistics.corr")
    +        if not y or type(y) == str:
    +            try:
    +                Xser = _get_unmangled_double_vector_rdd(x)
    +            except TypeError:
    +                raise TypeError("corr called on a single RDD not consisted of Vectors.")
    +            resultMat = sc._jvm.PythonMLLibAPI().corr(Xser._jrdd, method)
    +            return _deserialize_double_matrix(resultMat)
    +        else:
    +            xSer = _get_unmangled_rdd(x, _serialize_double)
    +            ySer = _get_unmangled_rdd(y, _serialize_double)
    +            result = sc._jvm.PythonMLLibAPI().corr(xSer._jrdd, ySer._jrdd, method)
    +            return result
    +
    +
    +def _test():
    +    import doctest
    +    from pyspark import SparkContext
    +    globs = globals().copy()
    +    globs['sc'] = SparkContext('local[4]', 'PythonTest', batchSize=2)
    +    (failure_count, test_count) = doctest.testmod(globs=globs, optionflags=doctest.ELLIPSIS)
    +    globs['sc'].stop()
    +    if failure_count:
    +        exit(-1)
    +
    +
    +if __name__ == "__main__":
    +    _test()
    --- End diff --
    
    new line at the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15684424
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala ---
    @@ -49,43 +49,48 @@ private[stat] trait Correlation {
     }
     
     /**
    - * Delegates computation to the specific correlation object based on the input method name
    - *
    - * Currently supported correlations: pearson, spearman.
    - * After new correlation algorithms are added, please update the documentation here and in
    - * Statistics.scala for the correlation APIs.
    - *
    - * Maintains the default correlation type, pearson
    + * Delegates computation to the specific correlation object based on the input method name.
      */
     private[stat] object Correlations {
     
    -  // Note: after new types of correlations are implemented, please update this map
    -  val nameToObjectMap = Map(("pearson", PearsonCorrelation), ("spearman", SpearmanCorrelation))
    -  val defaultCorrName: String = "pearson"
    -  val defaultCorr: Correlation = nameToObjectMap(defaultCorrName)
    -
    -  def corr(x: RDD[Double], y: RDD[Double], method: String = defaultCorrName): Double = {
    +  def corr(x: RDD[Double],
    +       y: RDD[Double],
    +       method: String = CorrelationNames.defaultCorrName): Double = {
         val correlation = getCorrelationFromName(method)
         correlation.computeCorrelation(x, y)
       }
     
    -  def corrMatrix(X: RDD[Vector], method: String = defaultCorrName): Matrix = {
    +  def corrMatrix(X: RDD[Vector],
    +      method: String = CorrelationNames.defaultCorrName): Matrix = {
         val correlation = getCorrelationFromName(method)
         correlation.computeCorrelationMatrix(X)
       }
     
    -  /**
    -   * Match input correlation name with a known name via simple string matching
    -   *
    -   * private to stat for ease of unit testing
    -   */
    -  private[stat] def getCorrelationFromName(method: String): Correlation = {
    +  // Match input correlation name with a known name via simple string matching.
    +  def getCorrelationFromName(method: String): Correlation = {
         try {
    -      nameToObjectMap(method)
    +      CorrelationNames.nameToObjectMap(method)
         } catch {
           case nse: NoSuchElementException =>
             throw new IllegalArgumentException("Unrecognized method name. Supported correlations: "
    -          + nameToObjectMap.keys.mkString(", "))
    +          + CorrelationNames.nameToObjectMap.keys.mkString(", "))
         }
       }
     }
    +
    +/**
    + * Maintains supported and default correlation names.
    + *
    + * Currently supported correlations: `pearson`, `spearman`.
    + * Current default correlation: `pearson`.
    + *
    + * After new correlation algorithms are added, please update the documentation here and in
    + * Statistics.scala for the correlation APIs.
    + */
    +object CorrelationNames {
    --- End diff --
    
    Is it suppose to be a public API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15689358
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
         ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
       }
     
    +  /**
    +   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
    +   * Returns the correlation matrix serialized into a byte array understood by deserializers in
    +   * pyspark.
    +   */
    +  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
    --- End diff --
    
    You can ignore this if you like, but if I'm being picky, why not spell out "correlations"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15684527
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala ---
    @@ -49,43 +49,48 @@ private[stat] trait Correlation {
     }
     
     /**
    - * Delegates computation to the specific correlation object based on the input method name
    - *
    - * Currently supported correlations: pearson, spearman.
    - * After new correlation algorithms are added, please update the documentation here and in
    - * Statistics.scala for the correlation APIs.
    - *
    - * Maintains the default correlation type, pearson
    + * Delegates computation to the specific correlation object based on the input method name.
      */
     private[stat] object Correlations {
     
    -  // Note: after new types of correlations are implemented, please update this map
    -  val nameToObjectMap = Map(("pearson", PearsonCorrelation), ("spearman", SpearmanCorrelation))
    -  val defaultCorrName: String = "pearson"
    -  val defaultCorr: Correlation = nameToObjectMap(defaultCorrName)
    -
    -  def corr(x: RDD[Double], y: RDD[Double], method: String = defaultCorrName): Double = {
    +  def corr(x: RDD[Double],
    +       y: RDD[Double],
    +       method: String = CorrelationNames.defaultCorrName): Double = {
         val correlation = getCorrelationFromName(method)
         correlation.computeCorrelation(x, y)
       }
     
    -  def corrMatrix(X: RDD[Vector], method: String = defaultCorrName): Matrix = {
    +  def corrMatrix(X: RDD[Vector],
    +      method: String = CorrelationNames.defaultCorrName): Matrix = {
         val correlation = getCorrelationFromName(method)
         correlation.computeCorrelationMatrix(X)
       }
     
    -  /**
    -   * Match input correlation name with a known name via simple string matching
    -   *
    -   * private to stat for ease of unit testing
    -   */
    -  private[stat] def getCorrelationFromName(method: String): Correlation = {
    +  // Match input correlation name with a known name via simple string matching.
    +  def getCorrelationFromName(method: String): Correlation = {
         try {
    -      nameToObjectMap(method)
    +      CorrelationNames.nameToObjectMap(method)
         } catch {
           case nse: NoSuchElementException =>
             throw new IllegalArgumentException("Unrecognized method name. Supported correlations: "
    -          + nameToObjectMap.keys.mkString(", "))
    +          + CorrelationNames.nameToObjectMap.keys.mkString(", "))
         }
       }
     }
    +
    +/**
    + * Maintains supported and default correlation names.
    + *
    + * Currently supported correlations: `pearson`, `spearman`.
    + * Current default correlation: `pearson`.
    + *
    + * After new correlation algorithms are added, please update the documentation here and in
    + * Statistics.scala for the correlation APIs.
    + */
    +object CorrelationNames {
    --- End diff --
    
    I originally planned on using it inside of pyspark but `private[mllib]` is sufficient scope now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1713#issuecomment-50858129
  
    QA results for PR 1713:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class Statistics(object):<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17657/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1713


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2786][mllib] Python correlations

Posted by dorx <gi...@git.apache.org>.

Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1713#discussion_r15717902
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -456,6 +458,37 @@ class PythonMLLibAPI extends Serializable {
         ALS.trainImplicit(ratings, rank, iterations, lambda, blocks, alpha)
       }
     
    +  /**
    +   * Java stub for mllib Statistics.corr(X: RDD[Vector], method: String).
    +   * Returns the correlation matrix serialized into a byte array understood by deserializers in
    +   * pyspark.
    +   */
    +  def corr(X: JavaRDD[Array[Byte]], method: String): Array[Byte] = {
    +    val inputMatrix = X.rdd.map(deserializeDoubleVector(_))
    --- End diff --
    
    Actually it is in this case, since `deserializeDoubleVector` has 2 arguments (with `offset` being optional).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---