You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2015/10/27 08:49:12 UTC

[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/9293

    [SPARK-11302] [MLLIB] Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases

    Compute sigma pseudo-inverse without square root to avoid precision problems
    
    CC @mengxr @jkbradley 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-11302

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9293
    
----
commit b375ce5029711998728daad1235f293fbd3bfa9a
Author: Sean Owen <so...@cloudera.com>
Date:   2015-10-27T07:47:47Z

    Compute sigma pseudo-inverse without square root to avoid precision problems

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen closed the pull request at:

    https://github.com/apache/spark/pull/9293


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9293#discussion_r43132731
  
    --- Diff: python/pyspark/mllib/clustering.py ---
    @@ -218,9 +218,9 @@ class GaussianMixtureModel(JavaModelWrapper, JavaSaveable, JavaLoader):
         >>> model = GaussianMixture.train(clusterdata_2, 2, convergenceTol=0.0001,
         ...                               maxIterations=150, seed=10)
         >>> labels = model.predict(clusterdata_2).collect()
    -    >>> labels[0]==labels[1]==labels[2]
    +    >>> labels[0]==labels[1]
    --- End diff --
    
    If you look at the test data, it's obviously constructed so that the first 2 points cluster together and the other 3 cluster together. I verified this is what `Mclust` gives in R as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151547293
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44436/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151547289
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151529925
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9293#discussion_r43184703
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/distribution/MultivariateGaussian.scala ---
    @@ -126,11 +120,11 @@ class MultivariateGaussian @Since("1.3.0") (
           // log(pseudo-determinant) is sum of the logs of all non-zero singular values
           val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
     
    -      // calculate the root-pseudo-inverse of the diagonal matrix of singular values
    -      // by inverting the square root of all non-zero values
    -      val pinvS = diag(new DBV(d.map(v => if (v > tol) math.sqrt(1.0 / v) else 0.0).toArray))
    +      // calculate the pseudo-inverse of the diagonal matrix of singular values
    +      // by inverting the non-zero values
    +      val pinvS = diag(new DBV(d.map(v => if (v > tol) 1.0 / v else 0.0).toArray))
     
    -      (pinvS * u, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    +      (u * pinvS * u.t, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    --- End diff --
    
    Adding `v.t * v` also puts in an extra matrix-matrix multiply. But yes I see your point that `u.t` alone was the likely original bug. If that fixes it, it's a simpler change and yes that does cost one less matrix multiply. Have a look at https://github.com/apache/spark/pull/9309


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9293#discussion_r43182091
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/distribution/MultivariateGaussian.scala ---
    @@ -126,11 +120,11 @@ class MultivariateGaussian @Since("1.3.0") (
           // log(pseudo-determinant) is sum of the logs of all non-zero singular values
           val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
     
    -      // calculate the root-pseudo-inverse of the diagonal matrix of singular values
    -      // by inverting the square root of all non-zero values
    -      val pinvS = diag(new DBV(d.map(v => if (v > tol) math.sqrt(1.0 / v) else 0.0).toArray))
    +      // calculate the pseudo-inverse of the diagonal matrix of singular values
    +      // by inverting the non-zero values
    +      val pinvS = diag(new DBV(d.map(v => if (v > tol) 1.0 / v else 0.0).toArray))
     
    -      (pinvS * u, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    +      (u * pinvS * u.t, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    --- End diff --
    
    There is an extra matrix-matrix multiplication in `u * pinvS * u.t`. I think the bug is in line 133, where we should use `pinvS * u.t` instead of `pinvS * u`. Could you check this solution? Some comments need updates too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151546918
  
    **[Test build #44436 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44436/consoleFull)** for PR 9293 at commit [`fe431ac`](https://github.com/apache/spark/commit/fe431ac6dcbedbe7cc46071d4dc5696bda4dec46).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151529888
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151404857
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151451985
  
    **[Test build #1956 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1956/consoleFull)** for PR 9293 at commit [`b375ce5`](https://github.com/apache/spark/commit/b375ce5029711998728daad1235f293fbd3bfa9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151408072
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151404807
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151408073
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44411/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9293#discussion_r43142466
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/distribution/MultivariateGaussian.scala ---
    @@ -126,11 +120,11 @@ class MultivariateGaussian @Since("1.3.0") (
           // log(pseudo-determinant) is sum of the logs of all non-zero singular values
           val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
     
    -      // calculate the root-pseudo-inverse of the diagonal matrix of singular values
    -      // by inverting the square root of all non-zero values
    -      val pinvS = diag(new DBV(d.map(v => if (v > tol) math.sqrt(1.0 / v) else 0.0).toArray))
    +      // calculate the pseudo-inverse of the diagonal matrix of singular values
    +      // by inverting the non-zero values
    +      val pinvS = diag(new DBV(d.map(v => if (v > tol) 1.0 / v else 0.0).toArray))
     
    -      (pinvS * u, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    +      (u * pinvS * u.t, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    --- End diff --
    
    Is the old formula missing the following:
    
    ~~~
    val v = pinvS * u
    (v.t * v, ...)
    ~~~
    
    I think using the root inverse should be cheaper and more accurate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151533560
  
    **[Test build #44436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44436/consoleFull)** for PR 9293 at commit [`fe431ac`](https://github.com/apache/spark/commit/fe431ac6dcbedbe7cc46071d4dc5696bda4dec46).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9293#discussion_r43151694
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/distribution/MultivariateGaussian.scala ---
    @@ -126,11 +120,11 @@ class MultivariateGaussian @Since("1.3.0") (
           // log(pseudo-determinant) is sum of the logs of all non-zero singular values
           val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
     
    -      // calculate the root-pseudo-inverse of the diagonal matrix of singular values
    -      // by inverting the square root of all non-zero values
    -      val pinvS = diag(new DBV(d.map(v => if (v > tol) math.sqrt(1.0 / v) else 0.0).toArray))
    +      // calculate the pseudo-inverse of the diagonal matrix of singular values
    +      // by inverting the non-zero values
    +      val pinvS = diag(new DBV(d.map(v => if (v > tol) 1.0 / v else 0.0).toArray))
     
    -      (pinvS * u, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    +      (u * pinvS * u.t, -0.5 * (mu.size * math.log(2.0 * math.Pi) + logPseudoDetSigma))
    --- End diff --
    
    Yes that's equivalent numerically, and would require the other changes above. I'm not clear why it's better to do it this way though? it takes longer to take the square root of the eigenvalues, and then they're just multiplied back together. It's the same number of operations here and above otherwise.
    
    I think the evidence that it's not accurate enough is the case in the JIRA and tests here, and also the Pyspark test that is wrong at the moment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9293#issuecomment-151475411
  
    **[Test build #1956 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1956/consoleFull)** for PR 9293 at commit [`b375ce5`](https://github.com/apache/spark/commit/b375ce5029711998728daad1235f293fbd3bfa9a).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org