You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MechCoder <gi...@git.apache.org> on 2015/02/17 20:18:38 UTC

[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

GitHub user MechCoder opened a pull request:

    https://github.com/apache/spark/pull/4654

    [SPARK-5016] Distribute Gaussian Initialization in GaussianMixture

    Following discussion in the JIRA

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MechCoder/spark spark-5016

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4654.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4654
    
----
commit e2fa4539b1edcd27935021ca0c51e8fbb6745e53
Author: MechCoder <ma...@gmail.com>
Date:   2015-02-17T19:16:39Z

    [SPARK-5016] Distribute Gaussian Initialization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74734084
  
    I could not distribute the other Gaussian update, since this line (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L176) is interdependent on all k reducers/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74814932
  
      [Test build #27673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27673/consoleFull) for   PR 4654 at commit [`4b217ae`](https://github.com/apache/spark/commit/4b217aec90d342cd03188f0f941a5df914dbea0b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-75285426
  
    @tgaloppo Sorry for my noobness, all my work on MLlib has been on a single machine. I am not really sure how to run it on a cluster (and hence was verifying if my previous comment was the way to enable the cluster mode)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-76954011
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28231/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74814934
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27673/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4654#discussion_r24846111
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
    @@ -135,25 +135,39 @@ class GaussianMixture private (
         val breezeData = data.map(_.toBreeze).cache()
         
         // Get length of the input vectors
    -    val d = breezeData.first().length
    -    
    +    val tmp = breezeData.first()
    +    val (d, nFeatures) = (tmp.length, tmp.size)
    +
    +    val distributeGaussian = if (k >= 10 && nFeatures >= 10) true else false
         // Determine initial weights and corresponding Gaussians.
         // If the user supplied an initial GMM, we use those values, otherwise
         // we start with uniform weights, a random mean from the data, and
         // diagonal covariance matrices using component variances
    -    // derived from the samples    
    +    // derived from the samples.
         val (weights, gaussians) = initialModel match {
           case Some(gmm) => (gmm.weights, gmm.gaussians)
           
           case None => {
             val samples = breezeData.takeSample(withReplacement = true, k * nSamples, seed)
    -        (Array.fill(k)(1.0 / k), Array.tabulate(k) { i => 
    -          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    -          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    -        })
    +        val weights = Array.fill(k)(1.0 / k)
    +        val gaussians = {
    +          if (distributeGaussian) {
    --- End diff --
    
    Just a second, please do not review yet.  I need to make a few other changes as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by tgaloppo <gi...@git.apache.org>.
Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4654#discussion_r24848991
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
    @@ -135,25 +135,39 @@ class GaussianMixture private (
         val breezeData = data.map(_.toBreeze).cache()
         
         // Get length of the input vectors
    -    val d = breezeData.first().length
    -    
    +    val tmp = breezeData.first()
    +    val (d, nFeatures) = (tmp.length, tmp.size)
    +
    +    val distributeGaussian = true //if (k >= 10 && nFeatures >= 10) true else false
    --- End diff --
    
    scala style is breaking here; need space after //



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74749527
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27639/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by tgaloppo <gi...@git.apache.org>.
Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-75236020
  
    @MechCoder I mean making sure this is run on a cluster and not just on a single machine. My hypothesis is that cost of distributing the tasks to the cluster nodes (and collecting the results) will exceed the time saved by parallelizing the Gaussian initializations for data dimensions within the operational utility of the algorithm (as discussed on Jira).  On a single [multi-core] machine the savings may be observable, since there is no network to contend with.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-76954005
  
      [Test build #28231 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28231/consoleFull) for   PR 4654 at commit [`54db317`](https://github.com/apache/spark/commit/54db31752d39b4345985e41abdcce26e5534a268).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-76669702
  
    Hi, Sorry for taking so much time to get back to this. I want to generate some random data to write this script using breeze. But I unable to understand how the random seed should be fixed to reproduce the data (for the distributed and non distributed version).
    
        val g= randomDouble((5, 5), (2., 20.))
    
    How do I supply the seed to randomDouble. Sorry if it is obvious !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74733737
  
      [Test build #27639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27639/consoleFull) for   PR 4654 at commit [`e2fa453`](https://github.com/apache/spark/commit/e2fa4539b1edcd27935021ca0c51e8fbb6745e53).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74734555
  
    Err.. No. Should have been some other error while I tested it. Will update that in a while.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74733959
  
    Could you please add a description (based on the JIRA info)?  That makes it easier for others to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74951424
  
    @MechCoder Could you share some performance comparison results?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-75225799
  
    Just to clarify, by cluster mode do you mean running `./bin/spark-shell --master spark://manoj-X550LD:7077` where the url is generated by doing `./sbin/start-master.sh` and profiling there?
    Do I need to do anything else?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74740596
  
    @tgaloppo I've addressed the issue with distributing the Gaussian updates, in the latest commit. But it breaks tests (Note that I've set distributeGaussian explicitly to True). Would be great if you could have a look. I know it is something really silly, but I'm unable to figure it out myself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74734037
  
    Also, will you be able to do some timing tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74734747
  
    Do you want me to time any specific data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4654#discussion_r24877024
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
    @@ -168,16 +182,26 @@ class GaussianMixture private (
           // Create new distributions based on the partial assignments
           // (often referred to as the "M" step in literature)
           val sumWeights = sums.weights.sum
    -      var i = 0
    -      while (i < k) {
    -        val mu = sums.means(i) / sums.weights(i)
    -        BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu),
    -          Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
    -        weights(i) = sums.weights(i) / sumWeights
    -        gaussians(i) = new MultivariateGaussian(mu, sums.sigmas(i) / sums.weights(i))
    -        i = i + 1
    +      if (distributeGaussian) {
    +        sc.parallelize(0 until k).map{ i =>
    +          val mu = sums.means(i) / sums.weights(i)
    +          BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu),
    +            Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
    +          weights(i) = sums.weights(i) / sumWeights
    --- End diff --
    
    Ah I see, thanks for the explanation. I'll update this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74740673
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27644/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by tgaloppo <gi...@git.apache.org>.
Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4654#discussion_r24846194
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
    @@ -135,25 +135,39 @@ class GaussianMixture private (
         val breezeData = data.map(_.toBreeze).cache()
         
         // Get length of the input vectors
    -    val d = breezeData.first().length
    -    
    +    val tmp = breezeData.first()
    +    val (d, nFeatures) = (tmp.length, tmp.size)
    +
    +    val distributeGaussian = if (k >= 10 && nFeatures >= 10) true else false
         // Determine initial weights and corresponding Gaussians.
         // If the user supplied an initial GMM, we use those values, otherwise
         // we start with uniform weights, a random mean from the data, and
         // diagonal covariance matrices using component variances
    -    // derived from the samples    
    +    // derived from the samples.
         val (weights, gaussians) = initialModel match {
           case Some(gmm) => (gmm.weights, gmm.gaussians)
           
           case None => {
             val samples = breezeData.takeSample(withReplacement = true, k * nSamples, seed)
    -        (Array.fill(k)(1.0 / k), Array.tabulate(k) { i => 
    -          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    -          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    -        })
    +        val weights = Array.fill(k)(1.0 / k)
    +        val gaussians = {
    +          if (distributeGaussian) {
    +            Array.tabulate(k) { i =>
    +              val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +              new MultivariateGaussian(vectorMean(slice), initCovariance(slice))
    +            }
    +          }
    +          else {
    +            sc.parallelize(0 until k).map { i =>
    +              val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +              new MultivariateGaussian(vectorMean(slice), initCovariance(slice))
    +            }
    +          }.collect
    +        }
    +        (weights, gaussians)
           }
         }
    -    
    +
    --- End diff --
    
    This only parallelizes the very first set of gaussian initializations... even if the parallelization is effective (which, as discussed on Jira, I have my doubts), this will be a very small gain... better to focus on the initializations at line 191 (in this revision), which occur with each iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-75745207
  
    Great, I'll do it tomorrow after I'm done with my exams.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by tgaloppo <gi...@git.apache.org>.
Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4654#discussion_r24850641
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
    @@ -168,16 +182,26 @@ class GaussianMixture private (
           // Create new distributions based on the partial assignments
           // (often referred to as the "M" step in literature)
           val sumWeights = sums.weights.sum
    -      var i = 0
    -      while (i < k) {
    -        val mu = sums.means(i) / sums.weights(i)
    -        BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu),
    -          Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
    -        weights(i) = sums.weights(i) / sumWeights
    -        gaussians(i) = new MultivariateGaussian(mu, sums.sigmas(i) / sums.weights(i))
    -        i = i + 1
    +      if (distributeGaussian) {
    +        sc.parallelize(0 until k).map{ i =>
    +          val mu = sums.means(i) / sums.weights(i)
    +          BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu),
    +            Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
    +          weights(i) = sums.weights(i) / sumWeights
    --- End diff --
    
    Doing this does not get the updated weights/gaussians back to the driver (ie, the weights/gaussians arrays operated on here are local to the node running the task, only).  Try something like the code I posted on Jira for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-75479190
  
    @MechCoder   The actual code should be the same, except that you might load the data from some other source and you might re-partition it.  If you don't have access to a cluster, would you still be able to prep test scripts or examples to try a range of dimensions and time everything?  Then one of us could try running it on a cluster.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74740497
  
      [Test build #27644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27644/consoleFull) for   PR 4654 at commit [`48f4b05`](https://github.com/apache/spark/commit/48f4b0598cbfbc43dc2e6402c68279d2cdb0c1ed).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74768749
  
    I don't think you need to test on specific data; synthetic data is fine. But I'd try a range of values for numFeatures and k, and in the cluster setting as @tgaloppo suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-76701605
  
    Hmm. I came up with this, but surely there should be a more elegant way of doing it.
    
        import scala.util.Random
        rng = Random
        rng.setSeed(0)
        Array.tabulate(nSamples){i => Vectors.dense(Array.fill(nFeatures)(rng.nextDouble))}
        
    Compared to the NumPy way of doing
    
        rng = np.random.RandomState(0)
        rng.rand(nSamples, nFeatures)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74808797
  
      [Test build #27673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27673/consoleFull) for   PR 4654 at commit [`4b217ae`](https://github.com/apache/spark/commit/4b217aec90d342cd03188f0f941a5df914dbea0b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by tgaloppo <gi...@git.apache.org>.
Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4654#discussion_r24845959
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
    @@ -135,25 +135,39 @@ class GaussianMixture private (
         val breezeData = data.map(_.toBreeze).cache()
         
         // Get length of the input vectors
    -    val d = breezeData.first().length
    -    
    +    val tmp = breezeData.first()
    +    val (d, nFeatures) = (tmp.length, tmp.size)
    +
    +    val distributeGaussian = if (k >= 10 && nFeatures >= 10) true else false
         // Determine initial weights and corresponding Gaussians.
         // If the user supplied an initial GMM, we use those values, otherwise
         // we start with uniform weights, a random mean from the data, and
         // diagonal covariance matrices using component variances
    -    // derived from the samples    
    +    // derived from the samples.
         val (weights, gaussians) = initialModel match {
           case Some(gmm) => (gmm.weights, gmm.gaussians)
           
           case None => {
             val samples = breezeData.takeSample(withReplacement = true, k * nSamples, seed)
    -        (Array.fill(k)(1.0 / k), Array.tabulate(k) { i => 
    -          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    -          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    -        })
    +        val weights = Array.fill(k)(1.0 / k)
    +        val gaussians = {
    +          if (distributeGaussian) {
    --- End diff --
    
    Is the sense here wrong? distributeGaussian being true leads to the original tabulate initialization...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4654#discussion_r24848796
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala ---
    @@ -168,16 +182,26 @@ class GaussianMixture private (
           // Create new distributions based on the partial assignments
           // (often referred to as the "M" step in literature)
           val sumWeights = sums.weights.sum
    -      var i = 0
    -      while (i < k) {
    -        val mu = sums.means(i) / sums.weights(i)
    -        BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu),
    -          Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
    -        weights(i) = sums.weights(i) / sumWeights
    -        gaussians(i) = new MultivariateGaussian(mu, sums.sigmas(i) / sums.weights(i))
    -        i = i + 1
    +      if (distributeGaussian) {
    +        sc.parallelize(0 until k).map{ i =>
    --- End diff --
    
    Do you think there is something wrong here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74749512
  
      [Test build #27639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27639/consoleFull) for   PR 4654 at commit [`e2fa453`](https://github.com/apache/spark/commit/e2fa4539b1edcd27935021ca0c51e8fbb6745e53).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by MechCoder <gi...@git.apache.org>.
Github user MechCoder commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-76953094
  
    @jkbradley I wrote a script here (https://gist.github.com/MechCoder/5939294f74f105e5c499) to compare the timings in this branch and master, It seems to me that there is no significant improvements as @tgaloppo suggested.
    
    Would you be able to run this on a cluster and verify the same? After that I can close this.
    
    Thanks.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-76941161
  
      [Test build #28231 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28231/consoleFull) for   PR 4654 at commit [`54db317`](https://github.com/apache/spark/commit/54db31752d39b4345985e41abdcce26e5534a268).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-76874073
  
    @MechCoder  No problem about the delay; everyone is busy!  What you wrote for generating data is about as good as it gets AFAIK.  You could use fill instead of tabulate since you don't care about the index i.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74740668
  
      [Test build #27644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27644/consoleFull) for   PR 4654 at commit [`48f4b05`](https://github.com/apache/spark/commit/48f4b0598cbfbc43dc2e6402c68279d2cdb0c1ed).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5016] Distribute Gaussian Initializatio...

Posted by tgaloppo <gi...@git.apache.org>.
Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/4654#issuecomment-74737757
  
    Please be sure to test in cluster setting, not just on a multicore machine... I believe the computation/communication ratio is going to be too low to make this effective in most cluster environments given the limits of the algorithm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org