You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by tgaloppo <gi...@git.apache.org> on 2014/10/30 20:00:52 UTC

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

GitHub user tgaloppo opened a pull request:

    https://github.com/apache/spark/pull/3022

    SPARK-4156 [MLLIB] EM algorithm for GMMs

    Implementation of Expectation-Maximization for Gaussian Mixture Models.
    
    This is my maiden contribution to Apache Spark, so I apologize now if I have done anything incorrectly; having said that, this work is my own, and I offer it to the project under the project's open source license.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tgaloppo/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3022.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3022
    
----
commit c15405c78345e9a46549a398c6b59bed80274f9e
Author: Travis Galoppo <tr...@localhost.localdomain>
Date:   2014-10-30T18:50:47Z

    SPARK-4156

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092953
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    --- End diff --
    
    `treeAggreate` may be better than `aggregate` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655817
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    --- End diff --
    
    Use "/** ... */" for comment so it is part of the generated documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67158536
  
    I've merged in the predict() method from @FlytxtRnD 
    I am working on the changeover from accumulators to RDD.aggregate; I should have this up soon.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22018162
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    --- End diff --
    
    I know Breeze has a MultivariateGaussian, but using it requires commons-math, which does not appear to get packaged with Spark (my first pass at this algo used it and failed at run time due to the missing dependency).  It would be really cool if we could use that implementation (I'm guessing it would side-step the whole covariace matrix inversion issue, too).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-62386233
  
    Please advise how to resolve merge issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68401923
  
    @jkbradley Please assign 5017, 5018, 5019, and 5020 to me.  Regarding 5018, can you refer me to other PR's that are bringing in common distributions?  I can work toward formalizing an API to make all of them public.
    
     I also indicated that I would be happy to provide the Python wrappers for the algorithm (ticket 5012); @FlytxtRnD had provided an initial Python implementation of the algorithm... if they would like to provide the wrappers instead, that would be cool (but I am still definitely happy to do it if not).
    
    CC: @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21860601
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,234 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map( u => u.toBreeze.toDenseVector ).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = (0 until k).map{ i => (1.0 / k, 
    +                                  vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                                  }.toArray
    +    
    +    val accW     = new Array[Accumulator[Double]](k)
    +    val accMu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val accSigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // reset accumulators
    +      for (i <- 0 until k) {
    +        accW(i)     = ctx.accumulator(0.0)
    +        accMu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        accSigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val logLikelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map{ i => 
    +                                  new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    --- End diff --
    
    indentation (as above)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21859898
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,41 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length;
    --- End diff --
    
    no semicolon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67087866
  
    I just found out (hearsay) that Accumulator may incur a big performance penalty relative to methods like RDD.aggregate().  There have also been some bugs found with Accumulator in the past.  So it might be worth switching from using Accumulators to using aggregate() and other methods.  The aggregations are pretty simple, so it shouldn't make the code any longer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67073269
  
    Oh, also, IntelliJ 13 does a pretty good job with the indentation, if you're using it.  You can run "sbt/sbt gen-idea" to create project files before opening the Spark project in IntelliJ.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67550199
  
    Sorry, I forgot to comment on this issue.  That would be fine with me.  The prediction methods were contributed by @FlytxtRnD , so perhaps we can solicit their opinion as well.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655806
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModel
    +import org.apache.spark.mllib.clustering.GMMExpectationMaximization
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if( args.length != 3 ) {
    +      println("usage: DenseGmmEM <input file> <k> <delta>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, tol: Double) {
    --- End diff --
    
    "tol" --> "delta" (or whatever delta is changed to; see other comment about delta)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655824
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    --- End diff --
    
    You can probably write ```x.sum / x.length```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083551
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,65 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +/**
    + * An example Gaussian Mixture Model EM app. Run with
    + * {{{
    + * ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM <input> <k> <covergenceTol>
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    +      Vectors.dense(line.trim.split(' ').map(_.toDouble))
    +    }.cache
    +      
    +    val clusters = new GaussianMixtureModelEM()
    +      .setK(k)
    +      .setConvergenceTol(convergenceTol)
    +      .run(data)
    +    
    +    for (i <- 0 until clusters.k) {
    +      println("weight=%f\nmu=%s\nsigma=\n%s\n" format 
    +        (clusters.weight(i), clusters.mu(i), clusters.sigma(i)))
    +    }
    +    
    +    println("Cluster labels:")
    +    val (responsibilityMatrix, clusterLabels) = clusters.predict(data)
    +    for (x <- clusterLabels.collect) {
    --- End diff --
    
    If people try this on a big dataset, it will crash when collect is called.  It might be good to do: ```clusterLabels.take(100).collect()``` (which will take <= 100 instances) (and to change the println above accordingly).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68312524
  
    @tgaloppo  Thanks for the updates, and thanks for all of your work in getting this ready!
    
    LGTM
    
    CC: @mengxr 
    
    After this is merged, I'll make some JIRAs for the various item we've discussed along the way + a few more.  Let me know if I've missed anything here:
    * Add parameters: seed, maxIterations
    * Use sparse vectors more efficiently
    * If numFeatures or k are large, distribute matrix inverses for Gaussian initialization.
    * Breeze pinv fails when the matrix is singular: [https://github.com/scalanlp/breeze/issues/304]  Do SVD instead.
    * Make MultivariateGaussian public, and update GMM API
    * Check for NaNs:
     * in computeSoftAssignments (if all pdfs = 0)
     * in values when constructing a GMM



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092957
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    +        val mu = muSums(i) / wSums(i)
    +        val sigma = sigmaSums(i) / wSums(i) - mu * new Transpose(mu)
    +        weights(i) = wSums(i) / sumWeights
    +        gaussians(i) = new MultivariateGaussian(mu, sigma)
    +      }
    +   
    +      llhp = llh // current becomes previous
    +      llh = logLikelihood(0) // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > convergenceTol)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(gaussians(i).mu)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(gaussians(i).sigma)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +    
    +  /** Average of dense breeze vectors */
    +  private def vectorMean(x: VectorArrayView): DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    x.foreach(xi => v += xi)
    +    v / x.length.asInstanceOf[Double] 
    --- End diff --
    
    toDouble


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092924
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    --- End diff --
    
    organize imports into groups: 
    
    https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092937
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    --- End diff --
    
    See my previous comments about `ExpectationSum`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67387864
  
    I have replaced the accumulators with RDD.aggregate functionality.
    I added functionality allowing the user to provide their own initial GMM, bypassing the random generation of an initial starting point
    I added a second unit test with two univariate clusters.
    
    cc: @jkbradley 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by FlytxtRnD <gi...@git.apache.org>.

Github user FlytxtRnD commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22163250
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,93 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrix, Vector}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    --- End diff --
    
    Typing mistake ,where "weight(i)" is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083578
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,244 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        ((0 until k).map(_ => 1.0 / k).toArray, (0 until k).map{ i => 
    --- End diff --
    
    This could be cleaner: Instead of
    ```
    (0 until k).map(_ => 1.0 / k).toArray
    ```
    you can write
    ```
    Array.fill[Double](k)(1.0 / k)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092926
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    --- End diff --
    
    minor: GMM's name contains `Model`, which is a little confusing: `GaussianMixtureModelEM` produces `GaussianMixtureModel`. I don't have good suggestions. Maybe we could rename `GaussianMixtureModelEM` to `GaussianMixtureEM`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21859900
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,234 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map( u => u.toBreeze.toDenseVector ).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = (0 until k).map{ i => (1.0 / k, 
    +                                  vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                                  }.toArray
    +    
    +    val accW     = new Array[Accumulator[Double]](k)
    +    val accMu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val accSigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // reset accumulators
    +      for (i <- 0 until k) {
    +        accW(i)     = ctx.accumulator(0.0)
    +        accMu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        accSigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val logLikelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map{ i => 
    +                                  new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    +                                }.toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => gaussians(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map{ i => 
    +                  eps + weights.value(i) * dists.value(i).pdf(x)
    +                }.toArray
    +        
    +        val pSum = p.sum 
    +        
    +        logLikelihood += math.log(pSum)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for (i <- 0 until k) {
    +          p(i) /= pSum
    +          accW(i) += p(i)
    +          accMu(i) += x * p(i)
    +          accSigma(i) += xxt * p(i)
    +        }
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => accW(i).value).toArray
    +      val MU = (0 until k).map(i => accMu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => accSigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      gaussians = (0 until k).map{ i => {
    +            val weight = W(i) / W.sum
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }
    +        }.toArray
    +      
    +      llhp = llh; // current becomes previous
    --- End diff --
    
    no semicolon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655815
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    --- End diff --
    
    I would recommend using "convergenceTol" since that is already used elsewhere (e.g., in LBFGS).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22058461
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    --- End diff --
    
    We only use Breeze internally right now; we don't want to expose it as a public API.  I really meant using the MultivariateGaussian class which you defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68398244
  
    @tgaloppo @FlytxtRnD  I made some JIRAs for the to-do items above.
    
    I'd say the most important are:
    * [Change predictMembership() to take an RDD, not the GMM.](https://issues.apache.org/jira/browse/SPARK-5020)
     * I did not notice that it took all of the GMM parameters.  It should be renamed and made internal, and a wrapper method predictMembership() should take an RDD only.
    * [Make MultivariateGaussian public](https://issues.apache.org/jira/browse/SPARK-5018)
    * [Update GMM API to use MultivariateGaussian instead of means, covariances](https://issues.apache.org/jira/browse/SPARK-5019)
    * (The Python API and user guide JIRAs from @mengxr should also be in this list.)
    
    It would be great to do:
    * [SVD for Gaussian initialization](https://issues.apache.org/jira/browse/SPARK-5017)
    
    Some less critical ones are:
    * [random seed](https://issues.apache.org/jira/browse/SPARK-5015)
    * [If numFeatures or k are large, distribute matrix inverses for Gaussian initialization.](https://issues.apache.org/jira/browse/SPARK-5016)
    * [Be faster for SparseVector inputs](https://issues.apache.org/jira/browse/SPARK-5021)
    
    I removed the NAN JIRAs, but we should investigate numerical stability at some point.
    
    Please let me know if you'd like any assigned to you, and thanks in advance for your work on this!  If I'm able to work on one of the JIRAs, I'll make a note on the JIRA page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67076315
  
    No worries; it'll get there.  I appreciate the comments and pointers.
    
    
    > On Dec 15, 2014, at 4:52 PM, jkbradley <no...@github.com> wrote:
    > 
    > @tgaloppo Thanks for the updates! You did exactly what I had in mind for MultivariateGaussian; thanks.
    > 
    > My main comments now are still about style. I realize it's annoying to match a new style, but it is enforced pretty strictly with Spark to keep the codebase uniform. I'll add some comments about style in the body, but probably won't catch everything, so please check through and try to match. The Spark style guide has some examples, and it links to the much more extensive Scala style guide.
    > 
    > I'll wait for the predict() patch & additional tests.
    > 
    > I'll try to run some scaling tests myself and will put some results up here before long.
    > 
    > Thanks!
    > 
    > —
    > Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68475553
  
    @jkbradley Please assign me SPARK-5017, and I will take care of this in preparation for 5018 and 5019.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655818
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    --- End diff --
    
    spaces around "/" operator
    
    Also, it might be good to use a more descriptive variable name like "centers"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016176
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    --- End diff --
    
    Thinking more about this API, I now believe it would be better to have it store an array of weights + an array of MultivariateGaussian instances.  That would require making the MultivariateGaussian API public.
    
    I'll check some other libraries to get a sense of what their MultivariateGaussian APIs look like.  If you're interested, I can let you know what I find so we can make this API change in this PR.  However, if you prefer, I'd be happy to send a follow-up PR which makes this change.  What do you prefer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22185023
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    --- End diff --
    
    +1 for renaming.  No great solution here, but GaussianMixtureEM sounds OK to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092962
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.impl
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, pinv}
    --- End diff --
    
    Group the imports into a single closure.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092908
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,65 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +/**
    + * An example Gaussian Mixture Model EM app. Run with
    + * {{{
    + * ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM <input> <k> <covergenceTol>
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  private def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    --- End diff --
    
    Gaussian Mixture Model EM example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092920
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    +    val responsibilityMatrix = predictMembership(points,mu,sigma,weight,k)
    +    val clusterLabels = responsibilityMatrix.map(r => r.indexOf(r.max))
    +    (responsibilityMatrix, clusterLabels)
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictMembership(
    +      points: RDD[Vector], 
    +      mu: Array[Vector], 
    +      sigma: Array[Matrix],
    +      weight: Array[Double], k: Int): RDD[Array[Double]] = {
    --- End diff --
    
    move `k: Int) ...` to a new line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655833
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    (0 until x.length).foreach(j => v += x(j))
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def init_cov(x : Array[DenseDoubleVector]) : DenseDoubleMatrix = {
    +    val mu = vec_mean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val result = BreezeMatrix.eye[Double](ss.length)
    +    (0 until x.length).map(i => (x(i) - mu) :^ 2.0).foreach(u => ss += u)
    +    (0 until ss.length).foreach(i => result(i,i) = ss(i) / x.length)
    +    result
    +  }
    +  
    +  /** AccumulatorParam for Dense Breeze Vectors */
    +  private object DenseDoubleVectorAccumulatorParam extends AccumulatorParam[DenseDoubleVector] {
    +    def zero(initialVector : DenseDoubleVector) : DenseDoubleVector = {
    +      BreezeVector.zeros[Double](initialVector.length)
    +    }
    +    
    +    def addInPlace(a : DenseDoubleVector, b : DenseDoubleVector) : DenseDoubleVector = {
    +      a += b
    +    }
    +  }
    +  
    +  /** AccumulatorParam for Dense Breeze Matrices */
    +  private object DenseDoubleMatrixAccumulatorParam extends AccumulatorParam[DenseDoubleMatrix] {
    +    def zero(initialVector : DenseDoubleMatrix) : DenseDoubleMatrix = {
    +      BreezeMatrix.zeros[Double](initialVector.rows, initialVector.cols)
    +    }
    +    
    +    def addInPlace(a : DenseDoubleMatrix, b : DenseDoubleMatrix) : DenseDoubleMatrix = {
    +      a += b
    +    }
    +  }  
    +  
    +  /** 
    +   * Utility class to implement the density function for multivariate Gaussian distribution.
    +   * Breeze provides this functionality, but it requires the Apache Commons Math library,
    +   * so this class is here so-as to not introduce a new dependency in Spark.
    +   */
    +  private class MultivariateGaussian(val mu : DenseDoubleVector, val sigma : DenseDoubleMatrix) 
    +      extends Serializable {
    +    private val sigma_inv_2 = inv(sigma) * -0.5
    +    private val U = math.pow(2.0*math.Pi, -mu.length/2.0) * math.pow(det(sigma), -0.5)
    --- End diff --
    
    put spaces around operators "*" and "/"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-61361369
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67549445
  
    @tgaloppo  Thanks for the updates! It looks quite good to me.  My main remaining question is: What do you think about having predict() return the cluster centers (following KMeansModel), rather than the pair of RDDs?  Users can still access the membership Vector using predictMembership().
    
    I'm running a few tests now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22067710
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    +    val responsibilityMatrix = predictMembership(points,mu,sigma,weight,k)
    +    val clusterLabels = responsibilityMatrix.map(r => r.indexOf(r.max))
    +    (responsibilityMatrix, clusterLabels)
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictMembership(
    +      points: RDD[Vector], 
    +      mu: Array[Vector], 
    +      sigma: Array[Matrix],
    +      weight: Array[Double], k: Int): RDD[Array[Double]] = {
    +    val sc = points.sparkContext
    +    val dists = sc.broadcast{
    +      (0 until k).map{ i => 
    +        new MultivariateGaussian(mu(i).toBreeze.toDenseVector, sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +    }
    +    val weights = sc.broadcast((0 until k).map(i => weight(i)).toArray)
    --- End diff --
    
    "weight" is already an array, so you can just call sc.broadcast(weight)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083566
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,244 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    --- End diff --
    
    In Scala, for loops are slower than while loops.  Since this is an inner loop, it may be worthwhile to change it to a while loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016173
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    +      Vectors.dense(line.trim.split(' ').map(_.toDouble))
    +    }.cache
    +      
    +    val clusters = new GaussianMixtureModelEM()
    +      .setK(k)
    +      .setConvergenceTol(convergenceTol)
    +      .run(data)
    +    
    +    for (i <- 0 until clusters.k) {
    +      println("weight=%f mu=%s sigma=\n%s\n" format 
    --- End diff --
    
    Maybe use newlines instead of spaces since the formatting looks odd otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21695326
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    --- End diff --
    
    Oh, OK, good to know!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-66563419
  
    @tgaloppo  Let me know if you have questions, and also when I should make another pass over this PR---thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655828
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    (0 until x.length).foreach(j => v += x(j))
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def init_cov(x : Array[DenseDoubleVector]) : DenseDoubleMatrix = {
    +    val mu = vec_mean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val result = BreezeMatrix.eye[Double](ss.length)
    --- End diff --
    
    "result" --> "cov" (or something more descriptive)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-63405397
  
    Merged with the latest master branch to hopefully fix any merge issues.
    Updated scala test suite to use new MLlibSparkTestContext
    Improved cluster initialization strategy to average several samples per cluster.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092959
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    +        val mu = muSums(i) / wSums(i)
    +        val sigma = sigmaSums(i) / wSums(i) - mu * new Transpose(mu)
    +        weights(i) = wSums(i) / sumWeights
    +        gaussians(i) = new MultivariateGaussian(mu, sigma)
    +      }
    +   
    +      llhp = llh // current becomes previous
    +      llh = logLikelihood(0) // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > convergenceTol)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(gaussians(i).mu)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(gaussians(i).sigma)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +    
    +  /** Average of dense breeze vectors */
    +  private def vectorMean(x: VectorArrayView): DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    x.foreach(xi => v += xi)
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def initCovariance(x: VectorArrayView): DenseDoubleMatrix = {
    +    val mu = vectorMean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val cov = BreezeMatrix.eye[Double](ss.length)
    +    x.map(xi => (xi - mu) :^ 2.0).foreach(u => ss += u)
    --- End diff --
    
    breeze has `squaredDistance`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655820
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    --- End diff --
    
    scala style:
    ``` for (i <- 0 until k) { ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016180
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    --- End diff --
    
    The documentation should tell users to set K *before* setting initialGmm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by squito <gi...@git.apache.org>.

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r20063810
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,246 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stores as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stores as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stores as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as some random
    +    // point from the data.  (This could be improved)
    +    val samples = breezeData.takeSample(true, k, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // identity matrices for covariance 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  samples(i), 
    +                                  BreezeMatrix.eye[Double](d))).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +      
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += x * new Transpose(x) * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    x.foreach(u => s += u)
    --- End diff --
    
    You might not care about this at all, but calling `foreach` on an `Array` is actually notably slower than using a while loop over the indices.  `foreach` over a `Range` is actually pretty close to while loop (ie. `(0 until x.length).foreach{idx => s += x(idx)}`.  Or if you don't care about runtimes, then you can always just call `array.sum` (it actually comes from an implicit conversion to `WrappedArray`):
    
    ```
    scala> ((0 to 100).map{_ / 100.0}.toArray).sum
    res2: Double = 50.5
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22058276
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +      val dists = (0 until k).map{ i => 
    +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    --- End diff --
    
    I don't think the pseudoinverse is quite what we need.  Thinking more about it, I feel like some smoothing might be best.  It will not bias the estimate too much, it will make the algorithm more robust, and it should be simple to add.  What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68313864
  
    @jkbradley   Thank you for your help and feedback along the way.  Please assign some (or all) of those tickets to me and I will continue to improve the implementation.  In particular, you mentioned that there are a number of PR's with code for common distributions... I would be happy to help formalize a common interface and make these a public part of the library.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67583648
  
    OK, I believe those are my last comments!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67098327
  
    Thanks for the style updates!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655822
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    --- End diff --
    
    You should be able to write ```x.sum```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21683030
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    --- End diff --
    
    This does not work; the compiler can not find a suitable implicit conversion for the array of vectors when attempting x.sum


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655814
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    --- End diff --
    
    It might be good to rename this class to "GaussianMixtureModelEM" so that its name is closer to the name of the model it produces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092921
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    +    val responsibilityMatrix = predictMembership(points,mu,sigma,weight,k)
    +    val clusterLabels = responsibilityMatrix.map(r => r.indexOf(r.max))
    +    (responsibilityMatrix, clusterLabels)
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictMembership(
    +      points: RDD[Vector], 
    +      mu: Array[Vector], 
    +      sigma: Array[Matrix],
    +      weight: Array[Double], k: Int): RDD[Array[Double]] = {
    +    val sc = points.sparkContext
    +    val dists = sc.broadcast{
    --- End diff --
    
    The downside is every time `predictMembershiip` is called we need to re-broadcast `mu`, `sigma`, and `weights`.  Because we already broadcast the RDD closure since 1.1, it is not necessary to do this unless we could reuse the broadcast objects.
    
    Btw, if we add a method to `GaussianMixtureModel` that can predict individual instance, this could be done by broadcast the entire model, which is about the same performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092935
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    --- End diff --
    
    `Array.fill(k)(BDV.zeros[Double](d))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-66563244
  
    @tgaloppo  Thanks very much for the PR, and sincere apologies for the slow response about it!  @manishamde was right about people being too preoccupied with the 1.2 release.  It will be great to get GMMs into MLlib though!
    
    I’ve added some inline comments, and have put a few general comments below.
    
    We’re moving away from some of the old API conventions, and it would be nice to try to fit this to match newer APIs (especially the experimental spark.ml branch).  In particular, the static train() methods are part of the old API: It’s becoming hard to maintain train() methods with explicit lists of parameters (as we add more parameters to existing algorithms).  We’d prefer to stick with the builder pattern you have implemented in GMMExpectationMaximization, where you can call setter methods (setK, etc.) to set parameters before calling run().
      * I would recommend eliminating object GMMExpectationMaximization and keeping the class API basically as is.
      * Could you please add getter methods (getK, etc.) to the class GMMExpectationMaximization?
    
    Tests: I’d recommend adding more tests.  It’s good to test a single cluster, as you did.  It might also be good to test multiple clusters, where you take precautions to make sure the test will almost certainly succeed (e.g., use pre-selected random seeds or enough trials).  I’ll think more about possible tests.
    
    Scaling: It will be important to get a sense of how efficient/scalable the implementation is.  Would you be able to run tests on a small cluster?  If not, the community might be able to help.
    
    Side note: I noticed this is a PR from your master branch.  It’s generally easier to create a separate branch for each PR you plan to contribute.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655807
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModel
    +import org.apache.spark.mllib.clustering.GMMExpectationMaximization
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if( args.length != 3 ) {
    +      println("usage: DenseGmmEM <input file> <k> <delta>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, tol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map(line =>
    --- End diff --
    
    scala style: For multi-line map() calls, use braces:
    ```
        val data = ctx.textFile(inputFile).map { line =>
            Vectors.dense(line.trim.split(' ').map(_.toDouble))
          }.cache()
    ```
    This occurs in several other places in this PR.  Could you please fix those too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67586420
  
    Great! I've pushed the requested changes.  I will open a ticket on Jira about making the MultivariateGaussian more widely applicable.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-66636308
  
    @jkbradley Thank you for your comments.  I am working to resolve these issues and will push these changes in a day or two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092960
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    +        val mu = muSums(i) / wSums(i)
    +        val sigma = sigmaSums(i) / wSums(i) - mu * new Transpose(mu)
    +        weights(i) = wSums(i) / sumWeights
    +        gaussians(i) = new MultivariateGaussian(mu, sigma)
    +      }
    +   
    +      llhp = llh // current becomes previous
    +      llh = logLikelihood(0) // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > convergenceTol)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(gaussians(i).mu)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(gaussians(i).sigma)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +    
    +  /** Average of dense breeze vectors */
    +  private def vectorMean(x: VectorArrayView): DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    x.foreach(xi => v += xi)
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def initCovariance(x: VectorArrayView): DenseDoubleMatrix = {
    +    val mu = vectorMean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val cov = BreezeMatrix.eye[Double](ss.length)
    +    x.map(xi => (xi - mu) :^ 2.0).foreach(u => ss += u)
    +    (0 until ss.length).foreach(i => cov(i,i) = ss(i) / x.length)
    --- End diff --
    
    breeze has `diag`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68414406
  
    @tgaloppo  It's ideal if we assign & fix one JIRA at a time (as separate PRs).  Can I start by assigning one of your choosing?
    
    For 5018, there is only [one other such PR](https://github.com/apache/spark/pull/1269) I know of, and it uses a Dirichlet distribution.  But for API examples, I would recommend checking out popular libraries, such as R, Matlab, numpy, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092955
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    +        val mu = muSums(i) / wSums(i)
    +        val sigma = sigmaSums(i) / wSums(i) - mu * new Transpose(mu)
    --- End diff --
    
    Please use `BLAS.dsyr` or leave a TODO note.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68476266
  
    Done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655813
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    --- End diff --
    
    It would be great if you could add a sentence or two explaining what GMMs are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67072947
  
    @tgaloppo  Thanks for the updates!  You did exactly what I had in mind for MultivariateGaussian; thanks.
    
    My main comments now are still about style.  I realize it's annoying to match a new style, but it is enforced pretty strictly with Spark to keep the codebase uniform.  I'll add some comments about style in the body, but probably won't catch everything, so please check through and try to match.  The [Spark style guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has some examples, and it links to the much more extensive Scala style guide.
    
    I'll wait for the predict() patch & additional tests.
    
    I'll try to run some scaling tests myself and will put some results up here before long.
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092915
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    --- End diff --
    
    +1 on @jkbradley 's suggestion, which we can do in a follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68299685
  
      [Test build #555 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/555/consoleFull) for   PR 3022 at commit [`aaa8f25`](https://github.com/apache/spark/commit/aaa8f25a579d9c9aa191734377b503fb73299b78).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092939
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    --- End diff --
    
    Please move the constructor to the beginning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655836
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,35 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * Multivariate Gaussian mixture model consisting of k Gaussians, where points are drawn 
    + * from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective 
    + * mean and covariance for each Gaussian distribution i=1..k. 
    --- End diff --
    
    I would describe parameters using the param syntax.  E.g.:
    ```
    @param mu  Means for each Gaussian distribution in the mixture, where mu(i) is the mean for Gaussian i.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092934
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    --- End diff --
    
    The code could be more readable if we define `ExpectionSum` as a private class with `var loglik`, `weights`, `mean`, and `cov`, then implement `add` method. As a result, we don't use `._1`, `._2`, .... in the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-61361694
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22688/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655827
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    (0 until x.length).foreach(j => v += x(j))
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def init_cov(x : Array[DenseDoubleVector]) : DenseDoubleMatrix = {
    --- End diff --
    
    Here and elsewhere, use camelCase naming convention


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655837
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,35 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * Multivariate Gaussian mixture model consisting of k Gaussians, where points are drawn 
    + * from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are the respective 
    + * mean and covariance for each Gaussian distribution i=1..k. 
    + */
    +class GaussianMixtureModel(
    +  val w: Array[Double], 
    --- End diff --
    
    It might be good to use a more explicit parameter name than "w."  Maybe "weight" or "weights?"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655811
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    --- End diff --
    
    space between groups of imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67486366
  
    Ok, I have addressed (I think) all of those issues, with the exception of modifying GaussianMixtureModel to carry instances of MultivariateGaussian.  I do like that idea, but think it would be best to create a new issue around solidifying MultivariateGaussian, then revisit this modification.  I'd be more than happy to work on the PR for making MultivariateGaussian public.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016183
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    --- End diff --
    
    It's more common to use ```sc``` than ```ctx```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67885369
  
    Ok.  I changed the privacy of EPSILON and am now using it in this code.
    I changed the name from GaussianMixtureModelEM to GaussianMixtureEM.
    I've changed predictLabels() back to predict().



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016175
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    +      Vectors.dense(line.trim.split(' ').map(_.toDouble))
    +    }.cache
    +      
    +    val clusters = new GaussianMixtureModelEM()
    +      .setK(k)
    +      .setConvergenceTol(convergenceTol)
    +      .run(data)
    +    
    +    for (i <- 0 until clusters.k) {
    +      println("weight=%f mu=%s sigma=\n%s\n" format 
    +        (clusters.weight(i), clusters.mu(i), clusters.sigma(i)))
    +    }
    +    
    +    val (responsibilityMatrix, clusterLabels) = clusters.predict(data)
    +    for (x <- clusterLabels.collect) {
    --- End diff --
    
    Can this print a line here saying "Cluster labels:"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655830
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    (0 until x.length).foreach(j => v += x(j))
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def init_cov(x : Array[DenseDoubleVector]) : DenseDoubleMatrix = {
    +    val mu = vec_mean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val result = BreezeMatrix.eye[Double](ss.length)
    +    (0 until x.length).map(i => (x(i) - mu) :^ 2.0).foreach(u => ss += u)
    --- End diff --
    
    Can you simplify this to be:
    ```
    val ss = x.map(xi => (xi - mu) :^ 2.0).sum
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083563
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,244 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    --- End diff --
    
    This could be a Double instead of an Array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21860758
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,234 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map( u => u.toBreeze.toDenseVector ).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = (0 until k).map{ i => (1.0 / k, 
    +                                  vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                                  }.toArray
    +    
    +    val accW     = new Array[Accumulator[Double]](k)
    +    val accMu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val accSigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // reset accumulators
    +      for (i <- 0 until k) {
    +        accW(i)     = ctx.accumulator(0.0)
    +        accMu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        accSigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val logLikelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map{ i => 
    +                                  new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    +                                }.toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => gaussians(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map{ i => 
    +                  eps + weights.value(i) * dists.value(i).pdf(x)
    +                }.toArray
    +        
    +        val pSum = p.sum 
    +        
    +        logLikelihood += math.log(pSum)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for (i <- 0 until k) {
    +          p(i) /= pSum
    +          accW(i) += p(i)
    +          accMu(i) += x * p(i)
    +          accSigma(i) += xxt * p(i)
    +        }
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => accW(i).value).toArray
    +      val MU = (0 until k).map(i => accMu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => accSigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      gaussians = (0 until k).map{ i => {
    --- End diff --
    
    No need for second "{" in this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016187
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    --- End diff --
    
    For the record, you can simplify this:
    ```
    val weights = gaussians.map(_._1)
    ```
    (but my other comments may obviate the need for this change)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67583382
  
    I agree about 100 features being too big for clustering, but I wanted to get some sense of scaling w.r.t. features.  (It basically makes the matrix inverses take a long time.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22184915
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    +    val responsibilityMatrix = predictMembership(points,mu,sigma,weight,k)
    +    val clusterLabels = responsibilityMatrix.map(r => r.indexOf(r.max))
    +    (responsibilityMatrix, clusterLabels)
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictMembership(
    --- End diff --
    
    I like the idea of being able to get back soft clustering results, not just hard predictions.  I'm voting for having predictMembership() return soft clusterings (Vector of cluster membership degrees for each cluster), and predict() return hard clusterings (Int indicating cluster, as in KMeansModel).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22061331
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.impl
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, pinv}
    +
    +/** 
    +   * Utility class to implement the density function for multivariate Gaussian distribution.
    +   * Breeze provides this functionality, but it requires the Apache Commons Math library,
    +   * so this class is here so-as to not introduce a new dependency in Spark.
    +   */
    +private[mllib] class MultivariateGaussian(
    +    val mu: BreezeVector[Double], 
    +    val sigma: BreezeMatrix[Double]) extends Serializable {
    +  private val sigmaInv2 = pinv(sigma) * -0.5
    +  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5)
    --- End diff --
    
    By the way, ```det``` and ```pinv``` are factorizing the matrix twice.  It would be better to do one factorization (like SVD) and then compute the det and inv from it.  We can do that in a follow-up PR though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655832
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    (0 until x.length).foreach(j => v += x(j))
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def init_cov(x : Array[DenseDoubleVector]) : DenseDoubleMatrix = {
    +    val mu = vec_mean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val result = BreezeMatrix.eye[Double](ss.length)
    +    (0 until x.length).map(i => (x(i) - mu) :^ 2.0).foreach(u => ss += u)
    +    (0 until ss.length).foreach(i => result(i,i) = ss(i) / x.length)
    +    result
    +  }
    +  
    +  /** AccumulatorParam for Dense Breeze Vectors */
    +  private object DenseDoubleVectorAccumulatorParam extends AccumulatorParam[DenseDoubleVector] {
    +    def zero(initialVector : DenseDoubleVector) : DenseDoubleVector = {
    +      BreezeVector.zeros[Double](initialVector.length)
    +    }
    +    
    +    def addInPlace(a : DenseDoubleVector, b : DenseDoubleVector) : DenseDoubleVector = {
    +      a += b
    +    }
    +  }
    +  
    +  /** AccumulatorParam for Dense Breeze Matrices */
    +  private object DenseDoubleMatrixAccumulatorParam extends AccumulatorParam[DenseDoubleMatrix] {
    +    def zero(initialVector : DenseDoubleMatrix) : DenseDoubleMatrix = {
    +      BreezeMatrix.zeros[Double](initialVector.rows, initialVector.cols)
    +    }
    +    
    +    def addInPlace(a : DenseDoubleMatrix, b : DenseDoubleMatrix) : DenseDoubleMatrix = {
    +      a += b
    +    }
    +  }  
    +  
    +  /** 
    +   * Utility class to implement the density function for multivariate Gaussian distribution.
    +   * Breeze provides this functionality, but it requires the Apache Commons Math library,
    +   * so this class is here so-as to not introduce a new dependency in Spark.
    +   */
    +  private class MultivariateGaussian(val mu : DenseDoubleVector, val sigma : DenseDoubleMatrix) 
    --- End diff --
    
    Could you please move this class to a new folder mllib/stat/impl/ ?  There are a few PRs introducing standard distributions.  I think we should keep them private for now but collect them in stat/impl/.  Later on, we can standardize the APIs and make them public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-61361549
  
    @tgaloppo Thanks for the PR and congratulations on the first contribution. Apologies for the lack of feedback thus far -- I guess everyone is busy with the 1.2 release deadline on Nov 1. I will take a look at the PR in the next few days. 
    
    Please make sure you get the JIRA assigned to yourself next time before working. It's the only way to avoid duplicate work. 
    
    cc: @jkbradley, @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655821
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    --- End diff --
    
    extra whitespace after }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016195
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +      val dists = (0 until k).map{ i => 
    +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    +      }.toArray
    +      
    +      // create and broadcast curried cluster contribution function
    +      val compute = ctx.broadcast(computeExpectation(weights, dists)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val sums = breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Assignments to make the code more readable
    +      val logLikelihood = sums._1(0)
    +      val W = sums._2
    +      val MU = sums._3
    +      val SIGMA = sums._4
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      gaussians = (0 until k).map{ i => 
    +        val weight = W(i) / W.sum
    +        val mu = MU(i) / W(i)
    +        val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +        (weight, mu, sigma)
    +      }.toArray
    +      
    +      llhp = llh // current becomes previous
    +      llh = logLikelihood // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > convergenceTol)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(gaussians(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(gaussians(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +    
    +  /** Average of dense breeze vectors */
    +  private def vectorMean(x: Array[DenseDoubleVector]): DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    x.foreach(xi => v += xi)
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def initCovariance(x: Array[DenseDoubleVector]): DenseDoubleMatrix = {
    +    val mu = vectorMean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val cov = BreezeMatrix.eye[Double](ss.length)
    +    x.map(xi => (xi - mu) :^ 2.0).foreach(u => ss += u)
    +    (0 until ss.length).foreach(i => cov(i,i) = ss(i) / x.length)
    +    cov
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictClusters(points: RDD[Vector], mu: Array[Vector], sigma: Array[Matrix],
    +      weight: Array[Double], k: Int): RDD[Array[Double]] = {
    +    val ctx = points.sparkContext
    +    val dists = ctx.broadcast{
    +      (0 until k).map{ i => 
    +        new MultivariateGaussian(mu(i).toBreeze.toDenseVector, sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +    }
    +    val weights = ctx.broadcast((0 until k).map(i => weight(i)).toArray)
    +    points.map{ x => 
    +      computeSoftAssignments(x.toBreeze.toDenseVector, dists.value, weights.value, k)
    +    }
    +  }
    +  
    +  /**
    +   * Compute the partial assignments for each vector
    +   */
    +  def computeSoftAssignments(pt: DenseDoubleVector, dists: Array[MultivariateGaussian],
    --- End diff --
    
    Scala style again.
    Also, can this be private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016178
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    --- End diff --
    
    It is more common to write:
    ```
    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016191
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +      val dists = (0 until k).map{ i => 
    +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    +      }.toArray
    +      
    +      // create and broadcast curried cluster contribution function
    +      val compute = ctx.broadcast(computeExpectation(weights, dists)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val sums = breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Assignments to make the code more readable
    +      val logLikelihood = sums._1(0)
    +      val W = sums._2
    +      val MU = sums._3
    +      val SIGMA = sums._4
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      gaussians = (0 until k).map{ i => 
    +        val weight = W(i) / W.sum
    --- End diff --
    
    Should only compute W.sum once, not every iteration.
    After that change, it may work well to avoid defining W, MU, SIGMA, and instead calling ```sums.map ...```.  With that change, you will not need the ```(i)``` indices within this loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21860755
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,234 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map( u => u.toBreeze.toDenseVector ).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = (0 until k).map{ i => (1.0 / k, 
    +                                  vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                                  }.toArray
    +    
    +    val accW     = new Array[Accumulator[Double]](k)
    +    val accMu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val accSigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // reset accumulators
    +      for (i <- 0 until k) {
    +        accW(i)     = ctx.accumulator(0.0)
    +        accMu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        accSigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val logLikelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map{ i => 
    +                                  new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    --- End diff --
    
    indentation (as above)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092909
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,65 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +/**
    + * An example Gaussian Mixture Model EM app. Run with
    + * {{{
    + * ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM <input> <k> <covergenceTol>
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  private def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    --- End diff --
    
    space before `{` ( please also update others)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092942
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    --- End diff --
    
    remove `[Double]`
    `(0 until k).map { i =>` -> `Array.tabulate(k) { i =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22136408
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    +        val mu = muSums(i) / wSums(i)
    +        val sigma = sigmaSums(i) / wSums(i) - mu * new Transpose(mu)
    +        weights(i) = wSums(i) / sumWeights
    +        gaussians(i) = new MultivariateGaussian(mu, sigma)
    +      }
    +   
    +      llhp = llh // current becomes previous
    +      llh = logLikelihood(0) // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > convergenceTol)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(gaussians(i).mu)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(gaussians(i).sigma)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +    
    +  /** Average of dense breeze vectors */
    +  private def vectorMean(x: VectorArrayView): DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    x.foreach(xi => v += xi)
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def initCovariance(x: VectorArrayView): DenseDoubleMatrix = {
    +    val mu = vectorMean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val cov = BreezeMatrix.eye[Double](ss.length)
    +    x.map(xi => (xi - mu) :^ 2.0).foreach(u => ss += u)
    --- End diff --
    
    squaredDistance returns a scalar... I want the squared entry values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68315285
  
    @tgaloppo I've merged this into master. Thanks for contributing GMM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092918
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    +    val responsibilityMatrix = predictMembership(points,mu,sigma,weight,k)
    --- End diff --
    
    space after each `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016185
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    --- End diff --
    
    Use a temp value to store the slice so it is not computed twice.  Also, I believe using ```samples.view.slice``` will be more efficient since it avoids creating an explicit copy of the data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092923
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    +    val responsibilityMatrix = predictMembership(points,mu,sigma,weight,k)
    +    val clusterLabels = responsibilityMatrix.map(r => r.indexOf(r.max))
    +    (responsibilityMatrix, clusterLabels)
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictMembership(
    +      points: RDD[Vector], 
    +      mu: Array[Vector], 
    +      sigma: Array[Matrix],
    +      weight: Array[Double], k: Int): RDD[Array[Double]] = {
    +    val sc = points.sparkContext
    +    val dists = sc.broadcast{
    +      (0 until k).map{ i => 
    +        new MultivariateGaussian(mu(i).toBreeze.toDenseVector, sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +    }
    +    val weights = sc.broadcast(weight)
    +    points.map{ x => 
    +      computeSoftAssignments(x.toBreeze.toDenseVector, dists.value, weights.value, k)
    +    }
    +  }
    +  
    +  // We use "eps" as the minimum likelihood density for any given point
    +  // in every cluster; this prevents any divide by zero conditions for
    +  // outlier points.
    +  private val eps = math.pow(2.0, -52)
    --- End diff --
    
    EPS is defined in `MLUtils.EPSILON`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083570
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,244 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    --- End diff --
    
    Ditto here (while instead of for loop)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67445452
  
    Working on these changes; still a few left.
    Great feedback; really helping to improve my scala!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by FlytxtRnD <gi...@git.apache.org>.

Github user FlytxtRnD commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68335194
  
    @tgaloppo Good Work
    @mengxr Thanks for giving us a chance to be a part of this contribution


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22185641
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,242 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import scala.collection.mutable.IndexedSeq
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix, diag, Transpose}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    --- End diff --
    
    Remove extra newlines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67580723
  
    Here are some results I got using the text8-100 dataset.  It's just a local test (1 worker), but we can do larger-scale tests in the future.
    
    numInstances	|	k	|	numFeatures	|	time(sec)	|	avg kmeansCost*
    -------------	|	-------	|	-------	|	-------	|	-------
    170053	|	2	|	10	|	46.53373526	|	0.772718978
    170053	|	4	|	10	|	65.32307584	|	0.720132151
    170053	|	16	|	10	|	225.6346969	|	0.632446005
    170053	|	64	|	10	|	894.4889346	|	0.525024814
    170053	|	2	|	100	|	39.96043946	|	76.93881368
    170053	|	4	|	100	|	65.9859325	|	76.93881368
    170053	|	16	|	100	|	224.1422487	|	76.93881368
    170053	|	64	|	100	|	875.0000443	|	76.93881368
    
    \* avg squared L2 distance
    
    I'll make one more pass, but I think this is basically ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22136877
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    +        val mu = muSums(i) / wSums(i)
    +        val sigma = sigmaSums(i) / wSums(i) - mu * new Transpose(mu)
    --- End diff --
    
    I don't see dsyr in BLAS... perhaps I am out of date (or blind)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21860754
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,234 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map( u => u.toBreeze.toDenseVector ).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = (0 until k).map{ i => (1.0 / k, 
    --- End diff --
    
    I would format as:
    ```
    var gaussians = (0 until k).map{ i =>
      (1.0 / k, 
        vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)),
        initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    }.toArray
    ```
    (indentation + ending the first line with "=>")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655839
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximizationSuite.scala ---
    @@ -0,0 +1,44 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.{Vectors, Matrices}
    +import org.apache.spark.mllib.util.{LocalClusterSparkContext, MLlibTestSparkContext}
    --- End diff --
    
    No need to import LocalClusterSparkContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083574
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,244 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    --- End diff --
    
    typo: "getInitialiGmm" --> "getInitialGmm"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21872495
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    +      Vectors.dense(line.trim.split(' ').map(_.toDouble))
    +    }.cache
    +      
    +    val clusters = new GaussianMixtureModelEM()
    +        .setK(k)
    --- End diff --
    
    Sorry!  I just checked other places in the codebase, and it looks like +2 spaces is the correct indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092954
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    --- End diff --
    
    Use `while` instead of `for`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092917
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    --- End diff --
    
    space after `,` (please also update others)
    
    Is it simpler if we only return `RDD[Array[Double]]` and let users compute the best cluster? Another solution is to let `predict` return `RDD[Int]` and add `predictRaw` return `RDD[Array[Double]]`. Btw, this issue could be easily addressed by the new pipeline API, where we can put two columns and compute them on-demand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22059411
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +      val dists = (0 until k).map{ i => 
    +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    --- End diff --
    
    I can add some small conditioning along the matrix diagonal prior to the inversion... before I do, however, please test the version I committed using pinv(), as I am rather confident it solves the issue without changing the behavior when the matrix would be invertible anyway.  If you find a failure, then I'll take the smoothing route.  Sound fair?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-66822813
  
    @jkbradley I have pushed commits addressing [hopefully] all of the issues you pointed out.  Of particular concern to me the movement of the utility MultivariateGaussian class; please make sure that what I did was what you had in mind.  
    
    I have a patch coming in from another contributor adding a "predict" method.
    
    I will try to construct an additional test or two in the very near future.  The single cluster test really just serves as a sanity check.
    
    My access to a cluster for scalability testing is limited; if the community is able to help with this, that would be great.
    
    Indeed, I realized I should have created a separate branch for the PR almost immediately... my goal is to become a regular contributor!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68415536
  
    @jkbradley No problem.  Let's start with 5020, and I'll move on from there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22061399
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.impl
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, pinv}
    +
    +/** 
    +   * Utility class to implement the density function for multivariate Gaussian distribution.
    +   * Breeze provides this functionality, but it requires the Apache Commons Math library,
    +   * so this class is here so-as to not introduce a new dependency in Spark.
    +   */
    +private[mllib] class MultivariateGaussian(
    +    val mu: BreezeVector[Double], 
    +    val sigma: BreezeMatrix[Double]) extends Serializable {
    +  private val sigmaInv2 = pinv(sigma) * -0.5
    +  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5)
    --- End diff --
    
    So for pinv, I'd say it's OK to leave it as pinv for now.  We can handle it more carefully when we switch to an SVD later on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092956
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // create and broadcast curried cluster contribution function
    +      val compute = sc.broadcast(computeExpectation(weights, gaussians)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val (logLikelihood, wSums, muSums, sigmaSums) = 
    +        breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      val sumWeights = wSums.sum
    +      for (i <- 0 until k) {
    +        val mu = muSums(i) / wSums(i)
    +        val sigma = sigmaSums(i) / wSums(i) - mu * new Transpose(mu)
    +        weights(i) = wSums(i) / sumWeights
    +        gaussians(i) = new MultivariateGaussian(mu, sigma)
    +      }
    +   
    +      llhp = llh // current becomes previous
    +      llh = logLikelihood(0) // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > convergenceTol)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(gaussians(i).mu)).toArray
    --- End diff --
    
    `Array.tabulate`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016184
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    --- End diff --
    
    This code could be made easier to read by using an array of weights + an array of MultivariateGaussian instances, rather than a tuple.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092919
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.scala ---
    @@ -0,0 +1,94 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector}
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.Matrix
    +import org.apache.spark.mllib.linalg.Vector
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +/**
    + * Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points 
    + * are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are 
    + * the respective mean and covariance for each Gaussian distribution i=1..k. 
    + * 
    + * @param weight Weights for each Gaussian distribution in the mixture, where mu(i) is
    + *               the weight for Gaussian i, and weight.sum == 1
    + * @param mu Means for each Gaussian in the mixture, where mu(i) is the mean for Gaussian i
    + * @param sigma Covariance maxtrix for each Gaussian in the mixture, where sigma(i) is the
    + *              covariance matrix for Gaussian i
    + */
    +class GaussianMixtureModel(
    +  val weight: Array[Double], 
    +  val mu: Array[Vector], 
    +  val sigma: Array[Matrix]) extends Serializable {
    +  
    +  /** Number of gaussians in mixture */
    +  def k: Int = weight.length
    +
    +  /** Maps given points to their cluster indices. */
    +  def predict(points: RDD[Vector]): (RDD[Array[Double]],RDD[Int]) = {
    +    val responsibilityMatrix = predictMembership(points,mu,sigma,weight,k)
    +    val clusterLabels = responsibilityMatrix.map(r => r.indexOf(r.max))
    +    (responsibilityMatrix, clusterLabels)
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictMembership(
    --- End diff --
    
    Should it be a private method inside `object GaussianMixtureModel`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-62929422
  
    Thanks, @squito ... while I expect the array to only have a few elements, I have made changes according to your advice.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092941
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    --- End diff --
    
    `Gmm` -> `Model`? There are other algorithms where we use `setInitialModel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-62313507
  
      [Test build #514 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/514/consoleFull) for   PR 3022 at commit [`c15405c`](https://github.com/apache/spark/commit/c15405c78345e9a46549a398c6b59bed80274f9e).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016194
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +      val dists = (0 until k).map{ i => 
    +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    +      }.toArray
    +      
    +      // create and broadcast curried cluster contribution function
    +      val compute = ctx.broadcast(computeExpectation(weights, dists)_)
    +      
    +      // aggregate the cluster contribution for all sample points
    +      val sums = breezeData.aggregate(zeroExpectationSum(k, d))(compute.value, addExpectationSums)
    +      
    +      // Assignments to make the code more readable
    +      val logLikelihood = sums._1(0)
    +      val W = sums._2
    +      val MU = sums._3
    +      val SIGMA = sums._4
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      gaussians = (0 until k).map{ i => 
    +        val weight = W(i) / W.sum
    +        val mu = MU(i) / W(i)
    +        val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +        (weight, mu, sigma)
    +      }.toArray
    +      
    +      llhp = llh // current becomes previous
    +      llh = logLikelihood // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > convergenceTol)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(gaussians(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(gaussians(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +    
    +  /** Average of dense breeze vectors */
    +  private def vectorMean(x: Array[DenseDoubleVector]): DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    x.foreach(xi => v += xi)
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def initCovariance(x: Array[DenseDoubleVector]): DenseDoubleMatrix = {
    +    val mu = vectorMean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val cov = BreezeMatrix.eye[Double](ss.length)
    +    x.map(xi => (xi - mu) :^ 2.0).foreach(u => ss += u)
    +    (0 until ss.length).foreach(i => cov(i,i) = ss(i) / x.length)
    +    cov
    +  }
    +  
    +  /**
    +   * Given the input vectors, return the membership value of each vector
    +   * to all mixture components. 
    +   */
    +  def predictClusters(points: RDD[Vector], mu: Array[Vector], sigma: Array[Matrix],
    --- End diff --
    
    Scala style (1 argument per line for a method declaration which won't fit on 1 line)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22084185
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,65 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +/**
    + * An example Gaussian Mixture Model EM app. Run with
    + * {{{
    + * ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM <input> <k> <covergenceTol>
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    +      Vectors.dense(line.trim.split(' ').map(_.toDouble))
    +    }.cache
    --- End diff --
    
    Ok.  I'm not even sure this cache() makes sense, since within the algorithm the vectors are converted to breeze vectors and that is the RDD operated on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67582826
  
    Excellent.  100 features is probably a bit of a stretch for the algorithm,,, the density at any point (especially with respect to the initial random gaussians) is going to be miniscule, possibly less than machine precision.  This is probably why the L2 cost of the models never changed despite increasing number of clusters.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016170
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,56 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    --- End diff --
    
    Please add documentation similar to other examples (e.g., DenseKMeans.scala)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655831
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    (0 until x.length).foreach(j => v += x(j))
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def init_cov(x : Array[DenseDoubleVector]) : DenseDoubleMatrix = {
    +    val mu = vec_mean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val result = BreezeMatrix.eye[Double](ss.length)
    +    (0 until x.length).map(i => (x(i) - mu) :^ 2.0).foreach(u => ss += u)
    +    (0 until ss.length).foreach(i => result(i,i) = ss(i) / x.length)
    +    result
    +  }
    +  
    +  /** AccumulatorParam for Dense Breeze Vectors */
    +  private object DenseDoubleVectorAccumulatorParam extends AccumulatorParam[DenseDoubleVector] {
    +    def zero(initialVector : DenseDoubleVector) : DenseDoubleVector = {
    +      BreezeVector.zeros[Double](initialVector.length)
    +    }
    +    
    +    def addInPlace(a : DenseDoubleVector, b : DenseDoubleVector) : DenseDoubleVector = {
    +      a += b
    +    }
    +  }
    +  
    +  /** AccumulatorParam for Dense Breeze Matrices */
    +  private object DenseDoubleMatrixAccumulatorParam extends AccumulatorParam[DenseDoubleMatrix] {
    +    def zero(initialVector : DenseDoubleMatrix) : DenseDoubleMatrix = {
    --- End diff --
    
    no space before ":" in method declaration (here and elsewhere)
    
    Also, "initialVector" --> "initialMatrix"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3022


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21683119
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    +    do {
    +      // reset accumulators
    +      for(i <- 0 until k){
    +        acc_w(i)     = ctx.accumulator(0.0)
    +        acc_mu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        acc_sigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val log_likelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map(i => 
    +                                  new MultivariateGaussian(C(i)._2, C(i)._3)).toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => C(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    +        val p = (0 until k).map(i => 
    +          eps + weights.value(i) * dists.value(i).pdf(x)).toArray
    +        val norm = sum(p)
    +        
    +        log_likelihood += math.log(norm)  
    +          
    +        // accumulate weighted sums  
    +        val xxt = x * new Transpose(x)
    +        for(i <- 0 until k){
    +          p(i) /= norm
    +          acc_w(i) += p(i)
    +          acc_mu(i) += x * p(i)
    +          acc_sigma(i) += xxt * p(i)
    +        }  
    +      })
    +      
    +      // Collect the computed sums
    +      val W = (0 until k).map(i => acc_w(i).value).toArray
    +      val MU = (0 until k).map(i => acc_mu(i).value).toArray
    +      val SIGMA = (0 until k).map(i => acc_sigma(i).value).toArray
    +      
    +      // Create new distributions based on the partial assignments
    +      // (often referred to as the "M" step in literature)
    +      C = (0 until k).map(i => {
    +            val weight = W(i) / sum(W)
    +            val mu = MU(i) / W(i)
    +            val sigma = SIGMA(i) / W(i) - mu * new Transpose(mu)
    +            (weight, mu, sigma)
    +          }).toArray
    +      
    +      llhp = llh; // current becomes previous
    +      llh = log_likelihood.value // this is the freshly computed log-likelihood
    +      iter += 1
    +    } while(iter < maxIterations && Math.abs(llh-llhp) > delta)
    +    
    +    // Need to convert the breeze matrices to MLlib matrices
    +    val weights = (0 until k).map(i => C(i)._1).toArray
    +    val means   = (0 until k).map(i => Vectors.fromBreeze(C(i)._2)).toArray
    +    val sigmas  = (0 until k).map(i => Matrices.fromBreeze(C(i)._3)).toArray
    +    new GaussianMixtureModel(weights, means, sigmas)
    +  }
    +  
    +  /** Sum the values in array of doubles */
    +  private def sum(x : Array[Double]) : Double = {
    +    var s : Double = 0.0
    +    (0 until x.length).foreach(j => s += x(j))
    +    s
    +  }
    +  
    +  /** Average of dense breeze vectors */
    +  private def vec_mean(x : Array[DenseDoubleVector]) : DenseDoubleVector = {
    +    val v = BreezeVector.zeros[Double](x(0).length)
    +    (0 until x.length).foreach(j => v += x(j))
    +    v / x.length.asInstanceOf[Double] 
    +  }
    +  
    +  /**
    +   * Construct matrix where diagonal entries are element-wise
    +   * variance of input vectors (computes biased variance)
    +   */
    +  private def init_cov(x : Array[DenseDoubleVector]) : DenseDoubleMatrix = {
    +    val mu = vec_mean(x)
    +    val ss = BreezeVector.zeros[Double](x(0).length)
    +    val result = BreezeMatrix.eye[Double](ss.length)
    +    (0 until x.length).map(i => (x(i) - mu) :^ 2.0).foreach(u => ss += u)
    --- End diff --
    
    Here again, sum method on array of vectors fails to compile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655802
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModel
    --- End diff --
    
    no need for this import


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092938
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    --- End diff --
    
    minor: The implementation in this block allocates unnecessary temp memory. For example, this is a rank-1 update. Computing `xxt` allocates unnecessary memory. We can use `BLAS.dsyr` instead. Another optimization we can do is packing the covariance matrix into upper-triangular form. It is not necessary to do those optimizations in this PR. Could you leave a TODO?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22084037
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,244 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    --- End diff --
    
    Lol. Actually, this is a hack because just putting a Double here does not allow an in-place update of the tuple; I would have to create a new tuple out of the sum arrays and the combined double value. I know it is kind of ugly, but I suspect it is more performant than creating a new tuple each time.  The alternative would be to make ExpectationSum an actual class...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092933
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    --- End diff --
    
    `IndexedSeq` should be sufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by FlytxtRnD <gi...@git.apache.org>.

Github user FlytxtRnD commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67816287
  
    Sorry for late reply.predictLabels() and predictMembership() looks fine.But what about moving the computeSoftAssignments() to GaussianMixtureModelEM class(in KMeans, findClosest() is defined in KMeans rather than in KMeansModel)
    
    It will be good if the name of the class  GaussianMixtureModelEM is changed as @mengxr suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655819
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    +  
    +  // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setDelta(delta: Double): this.type = {
    +    this.delta = delta
    +    this
    +  }
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map{ u => u.toBreeze.toDenseVector }.cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // C will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var C = (0 until k).map(i => (1.0/k, 
    +                                  vec_mean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  init_cov(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                           ).toArray
    +    
    +    val acc_w     = new Array[Accumulator[Double]](k)
    +    val acc_mu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val acc_sigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var i, iter = 0
    --- End diff --
    
    i is not used (The for loops have implicit declarations of other "i" vars.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67717252
  
    I've performed most of the requested changes.  I do not see the BLAS function mentioned (dsyr), so I left this as a TODO.  Also, I could not find EPSILON in MLUtils.  
    
    I left predictMembership public and changed predict to predictLabels, providing soft and hard label assignments, respectively.  I know there are some other thoughts around improving these, but I am not clear on what I should do.
    
    cc: @mengxr @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655816
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala ---
    @@ -0,0 +1,283 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, inv}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * Expectation-Maximization for multivariate Gaussian Mixture Models.
    + * 
    + */
    +object GMMExpectationMaximization {
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k)
    +      .setMaxIterations(maxIterations)
    +      .setDelta(delta)
    +      .run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param maxIterations the maximum number of iterations to perform
    +   */
    +  def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   * @param delta change in log-likelihood at which convergence is considered achieved
    +   */
    +  def train(data: RDD[Vector], k: Int, delta: Double): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).setDelta(delta).run(data)
    +  }
    +  
    +  /**
    +   * Trains a GMM using the given parameters
    +   * 
    +   * @param data training points stored as RDD[Vector]
    +   * @param k the number of Gaussians in the mixture
    +   */
    +  def train(data: RDD[Vector], k: Int): GaussianMixtureModel = {
    +    new GMMExpectationMaximization().setK(k).run(data)
    +  }
    +}
    +
    +/**
    + * This class performs multivariate Gaussian expectation maximization.  It will 
    + * maximize the log-likelihood for a mixture of k Gaussians, iterating until
    + * the log-likelihood changes by less than delta, or until it has reached
    + * the max number of iterations.  
    + */
    +class GMMExpectationMaximization private (
    +    private var k: Int, 
    +    private var delta: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5;
    --- End diff --
    
    no semicolon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083546
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,65 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +/**
    + * An example Gaussian Mixture Model EM app. Run with
    + * {{{
    + * ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM <input> <k> <covergenceTol>
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    --- End diff --
    
    Make private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-62256712
  
    This test appeared to fail due to some form of timeout during the pull; is there any action I need to take?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-68307573
  
      [Test build #555 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/555/consoleFull) for   PR 3022 at commit [`aaa8f25`](https://github.com/apache/spark/commit/aaa8f25a579d9c9aa191734377b503fb73299b78).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class GaussianMixtureModel(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67097150
  
    Ok, I will look into swapping the accumulators out for aggregate().  In the mean time I have worked to correct some of the style issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092927
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    --- End diff --
    
    Does this aliases simplify any code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092963
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.impl
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, pinv}
    +
    +/** 
    +   * Utility class to implement the density function for multivariate Gaussian distribution.
    +   * Breeze provides this functionality, but it requires the Apache Commons Math library,
    +   * so this class is here so-as to not introduce a new dependency in Spark.
    +   */
    +private[mllib] class MultivariateGaussian(
    +    val mu: BreezeVector[Double], 
    +    val sigma: BreezeMatrix[Double]) extends Serializable {
    +  private val sigmaInv2 = pinv(sigma) * -0.5
    +  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5)
    +    
    +  def pdf(x: BreezeVector[Double]): Double = {
    --- End diff --
    
    Need doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655804
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModel
    +import org.apache.spark.mllib.clustering.GMMExpectationMaximization
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if( args.length != 3 ) {
    --- End diff --
    
    scala style: should use this spacing: "if (args.length != 3)"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21860757
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,234 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map( u => u.toBreeze.toDenseVector ).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // For each Gaussian, we will initialize the mean as the average
    +    // of some random samples from the data
    +    val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = (0 until k).map{ i => (1.0 / k, 
    +                                  vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +                                  initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +                                  }.toArray
    +    
    +    val accW     = new Array[Accumulator[Double]](k)
    +    val accMu    = new Array[Accumulator[DenseDoubleVector]](k)
    +    val accSigma = new Array[Accumulator[DenseDoubleMatrix]](k)
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // reset accumulators
    +      for (i <- 0 until k) {
    +        accW(i)     = ctx.accumulator(0.0)
    +        accMu(i)    = ctx.accumulator(
    +                      BreezeVector.zeros[Double](d))(DenseDoubleVectorAccumulatorParam)
    +        accSigma(i) = ctx.accumulator(
    +                      BreezeMatrix.zeros[Double](d,d))(DenseDoubleMatrixAccumulatorParam)
    +      }
    +      
    +      val logLikelihood = ctx.accumulator(0.0)
    +            
    +      // broadcast the current weights and distributions to all nodes
    +      val dists = ctx.broadcast((0 until k).map{ i => 
    +                                  new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    +                                }.toArray)
    +      val weights = ctx.broadcast((0 until k).map(i => gaussians(i)._1).toArray)
    +      
    +      // calculate partial assignments for each sample in the data
    +      // (often referred to as the "E" step in literature)
    +      breezeData.foreach(x => {  
    --- End diff --
    
    Use "{" instead of "(" for multi-line foreach/map/etc. calls:
    ```
    breezeData.foreach { x =>
    ```
    Also, no need for "{" to wrap block on right-hand side of "=>"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22083547
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,65 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +/**
    + * An example Gaussian Mixture Model EM app. Run with
    + * {{{
    + * ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM <input> <k> <covergenceTol>
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    +      Vectors.dense(line.trim.split(' ').map(_.toDouble))
    +    }.cache
    --- End diff --
    
    FYI: It's preferable to use parentheses after calls like cache() which have side effects.  (Some IDEs give warnings about that.)  Ditto for collect, println


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22092952
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,248 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +
    +import scala.collection.mutable.IndexedSeqView
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  private type VectorArrayView = IndexedSeqView[DenseDoubleVector, Array[DenseDoubleVector]]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    var i = 0
    +    while (i < m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +      i = i + 1
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(
    +      weights: Array[Double], 
    +      dists: Array[MultivariateGaussian])
    +      (sums: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = sums._2.length
    +    val p = weights.zip(dists).map { case (weight, dist) => eps + weight * dist.pdf(x) }
    +    val pSum = p.sum
    +    sums._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    var i = 0
    +    while (i < k) {
    +      p(i) /= pSum
    +      sums._2(i) += p(i)
    +      sums._3(i) += x * p(i)
    +      sums._4(i) += xxt * p(i)
    +      i = i + 1
    +    }
    +    sums
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization.
    +   *  You must call setK() prior to calling this method, and the condition
    +   *  (gmm.k == this.k) must be met; failure will result in an IllegalArgumentException
    +   */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val sc = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // Determine initial weights and corresponding Gaussians.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples    
    +    val (weights, gaussians) = initialGmm match {
    +      case Some(gmm) => (gmm.weight, gmm.mu.zip(gmm.sigma).map{ case(mu, sigma) => 
    +        new MultivariateGaussian(mu.toBreeze.toDenseVector, sigma.toBreeze.toDenseMatrix) 
    +      }.toArray)
    +      
    +      case None => {
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +        (Array.fill[Double](k)(1.0 / k), (0 until k).map{ i => 
    +          val slice = samples.view(i * nSamples, (i + 1) * nSamples)
    +          new MultivariateGaussian(vectorMean(slice), initCovariance(slice)) 
    +        }.toArray)  
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    --- End diff --
    
    Shall we use `while () { ... ` instead? We may just want to test the initialization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22016189
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +      val dists = (0 until k).map{ i => 
    +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    --- End diff --
    
    This is brittle since it fails when the covariance matrix is not full rank.  I'd say any of these is acceptable for now:
    * Temporary fix: Check for an exception and printing a warning about the data not being full rank.
    * OK fix: Adding a little epsilon smoothing.
    * Actual fix: Do a matrix decomposition (like Cholesky) instead of a direct inversion to handle non-full rank matrices.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67422512
  
    Thanks for the updates!  I've started running some quick tests, and it's made me think of some more items.  I've added some inline comments as well; the main issue I’ve run into is the failure for covariance matrices which are not full rank.
    
    The new prediction methods look useful.  How would you feel about this set of methods:
    * predict(): predict best cluster as an Int for each data point (same as in KMeans)
    * predictMembership(): predict membership in all clusters as a Vector for each data point (predictClusters is more ambiguous, IMO.)
    
    Also, can the prediction code be moved to within the model class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21655840
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximizationSuite.scala ---
    @@ -0,0 +1,44 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.{Vectors, Matrices}
    +import org.apache.spark.mllib.util.{LocalClusterSparkContext, MLlibTestSparkContext}
    +import org.apache.spark.mllib.util.TestingUtils._
    +
    +class GMMExpectationMaximizationSuite extends FunSuite with MLlibTestSparkContext {
    +  test("single cluster") {
    +    val data = sc.parallelize(Array(
    +        Vectors.dense(6.0, 9.0),
    +        Vectors.dense(5.0, 10.0),
    +        Vectors.dense(4.0, 11.0)
    +      ))
    +    
    +    // expectations
    +    val Ew = 1.0;
    --- End diff --
    
    no need for semicolon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22066566
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala ---
    @@ -0,0 +1,39 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.stat.impl
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.{Transpose, det, pinv}
    +
    +/** 
    +   * Utility class to implement the density function for multivariate Gaussian distribution.
    +   * Breeze provides this functionality, but it requires the Apache Commons Math library,
    +   * so this class is here so-as to not introduce a new dependency in Spark.
    +   */
    +private[mllib] class MultivariateGaussian(
    +    val mu: BreezeVector[Double], 
    +    val sigma: BreezeMatrix[Double]) extends Serializable {
    +  private val sigmaInv2 = pinv(sigma) * -0.5
    +  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5)
    --- End diff --
    
    Agreed.  This can be part of bringing MultivariateGaussian to the public scope.
    
    
    > On Dec 18, 2014, at 1:43 PM, jkbradley <no...@github.com> wrote:
    > 
    > In mllib/src/main/scala/org/apache/spark/mllib/stat/impl/MultivariateGaussian.scala:
    > 
    > > +
    > > +package org.apache.spark.mllib.stat.impl
    > > +
    > > +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    > > +import breeze.linalg.{Transpose, det, pinv}
    > > +
    > > +/** 
    > > +   * Utility class to implement the density function for multivariate Gaussian distribution.
    > > +   * Breeze provides this functionality, but it requires the Apache Commons Math library,
    > > +   * so this class is here so-as to not introduce a new dependency in Spark.
    > > +   */
    > > +private[mllib] class MultivariateGaussian(
    > > +    val mu: BreezeVector[Double], 
    > > +    val sigma: BreezeMatrix[Double]) extends Serializable {
    > > +  private val sigmaInv2 = pinv(sigma) * -0.5
    > > +  private val U = math.pow(2.0 * math.Pi, -mu.length / 2.0) * math.pow(det(sigma), -0.5)
    > By the way, det and pinv are factorizing the matrix twice. It would be better to do one factorization (like SVD) and then compute the det and inv from it. We can do that in a follow-up PR though.
    > 
    > —
    > Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67527000
  
    OK, that sounds good.  Feel free to make a JIRA for that issue.  Thanks for the updates!  I'll take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-62316919
  
      [Test build #514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/514/consoleFull) for   PR 3022 at commit [`c15405c`](https://github.com/apache/spark/commit/c15405c78345e9a46549a398c6b59bed80274f9e).
     * This patch **passes all tests**.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `class GaussianMixtureModel(val w: Array[Double], val mu: Array[Vector], val sigma: Array[Matrix]) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r21860764
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  def run(inputFile: String, k: Int, convergenceTol: Double) {
    +    val conf = new SparkConf().setAppName("Spark EM Sample")
    +    val ctx  = new SparkContext(conf)
    +    
    +    val data = ctx.textFile(inputFile).map{ line =>
    +        Vectors.dense(line.trim.split(' ').map(_.toDouble))
    +      }.cache()
    +      
    +    val clusters = new GaussianMixtureModelEM()
    +                        .setK(k)
    --- End diff --
    
    index +4 spaces only


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-61151053
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3022#issuecomment-67880399
  
    @tgaloppo  MLUtils.EPSILON is actually private[util].  I think it would be fine to change it to be private[mllib]. CC: @mengxr 
    
    @tgaloppo I strongly recommend predict() instead of predictLabels() to be consistent with KMeansModel.
    
    @FlytxtRnD computeSoftAssignments() is a function of the model, not the learning algorithm, so I think it belongs in the model.  IMO, findClosest() should be in KMeansModel instead of KMeans, but that should be fixed in another PR.  (It is not too important though since it is a private[mllib] API.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by FlytxtRnD <gi...@git.apache.org>.

Github user FlytxtRnD commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22163213
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/DenseGmmEM.scala ---
    @@ -0,0 +1,65 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.mllib
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.mllib.clustering.GaussianMixtureModelEM
    +import org.apache.spark.mllib.linalg.Vectors
    +
    +/**
    + * An example Gaussian Mixture Model EM app. Run with
    + * {{{
    + * ./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM <input> <k> <covergenceTol>
    + * }}}
    + * If you use it as a template to create your own app, please use `spark-submit` to submit your app.
    + */
    +object DenseGmmEM {
    +  def main(args: Array[String]): Unit = {
    +    if (args.length != 3) {
    +      println("usage: DenseGmmEM <input file> <k> <convergenceTol>")
    +    } else {
    +      run(args(0), args(1).toInt, args(2).toDouble)
    +    }
    +  }
    +
    +  private def run(inputFile: String, k: Int, convergenceTol: Double) {
    --- End diff --
    
    Can we take maxIterations as an optional input parameter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs

Posted by tgaloppo <gi...@git.apache.org>.

Github user tgaloppo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3022#discussion_r22017550
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala ---
    @@ -0,0 +1,284 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import breeze.linalg.{DenseVector => BreezeVector, DenseMatrix => BreezeMatrix}
    +import breeze.linalg.Transpose
    +
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector, Vectors}
    +import org.apache.spark.mllib.stat.impl.MultivariateGaussian
    +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
    +import org.apache.spark.SparkContext.DoubleAccumulatorParam
    +
    +/**
    + * This class performs expectation maximization for multivariate Gaussian
    + * Mixture Models (GMMs).  A GMM represents a composite distribution of
    + * independent Gaussian distributions with associated "mixing" weights
    + * specifying each's contribution to the composite.
    + *
    + * Given a set of sample points, this class will maximize the log-likelihood 
    + * for a mixture of k Gaussians, iterating until the log-likelihood changes by 
    + * less than convergenceTol, or until it has reached the max number of iterations.
    + * While this process is generally guaranteed to converge, it is not guaranteed
    + * to find a global optimum.  
    + * 
    + * @param k The number of independent Gaussians in the mixture model
    + * @param convergenceTol The maximum change in log-likelihood at which convergence
    + * is considered to have occurred.
    + * @param maxIterations The maximum number of iterations to perform
    + */
    +class GaussianMixtureModelEM private (
    +    private var k: Int, 
    +    private var convergenceTol: Double, 
    +    private var maxIterations: Int) extends Serializable {
    +      
    +  // Type aliases for convenience
    +  private type DenseDoubleVector = BreezeVector[Double]
    +  private type DenseDoubleMatrix = BreezeMatrix[Double]
    +  
    +  private type ExpectationSum = (
    +    Array[Double], // log-likelihood in index 0
    +    Array[Double], // array of weights
    +    Array[DenseDoubleVector], // array of means
    +    Array[DenseDoubleMatrix]) // array of cov matrices
    +  
    +  // create a zero'd ExpectationSum instance
    +  private def zeroExpectationSum(k: Int, d: Int): ExpectationSum = {
    +    (Array(0.0), 
    +      new Array[Double](k),
    +      (0 until k).map(_ => BreezeVector.zeros[Double](d)).toArray,
    +      (0 until k).map(_ => BreezeMatrix.zeros[Double](d,d)).toArray)
    +  }
    +  
    +  // add two ExpectationSum objects (allowed to use modify m1)
    +  // (U, U) => U for aggregation
    +  private def addExpectationSums(m1: ExpectationSum, m2: ExpectationSum): ExpectationSum = {
    +    m1._1(0) += m2._1(0)
    +    for (i <- 0 until m1._2.length) {
    +      m1._2(i) += m2._2(i)
    +      m1._3(i) += m2._3(i)
    +      m1._4(i) += m2._4(i)
    +    }
    +    m1
    +  }
    +  
    +  // compute cluster contributions for each input point
    +  // (U, T) => U for aggregation
    +  private def computeExpectation(weights: Array[Double], dists: Array[MultivariateGaussian])
    +      (model: ExpectationSum, x: DenseDoubleVector): ExpectationSum = {
    +    val k = model._2.length
    +    val p = (0 until k).map(i => eps + weights(i) * dists(i).pdf(x)).toArray
    +    val pSum = p.sum
    +    model._1(0) += math.log(pSum)
    +    val xxt = x * new Transpose(x)
    +    for (i <- 0 until k) {
    +      p(i) /= pSum
    +      model._2(i) += p(i)
    +      model._3(i) += x * p(i)
    +      model._4(i) += xxt * p(i)
    +    }
    +    model
    +  }
    +  
    +  // number of samples per cluster to use when initializing Gaussians
    +  private val nSamples = 5
    +  
    +  // an initializing GMM can be provided rather than using the 
    +  // default random starting point
    +  private var initialGmm: Option[GaussianMixtureModel] = None
    +  
    +  /** A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold */
    +  def this() = this(2, 0.01, 100)
    +  
    +  /** Set the initial GMM starting point, bypassing the random initialization */
    +  def setInitialGmm(gmm: GaussianMixtureModel): this.type = {
    +    if (gmm.k == k) {
    +      initialGmm = Some(gmm)
    +    } else {
    +      throw new IllegalArgumentException("initialing GMM has mismatched cluster count (gmm.k != k)")
    +    }
    +    this
    +  }
    +  
    +  /** Return the user supplied initial GMM, if supplied */
    +  def getInitialiGmm: Option[GaussianMixtureModel] = initialGmm
    +  
    +  /** Set the number of Gaussians in the mixture model.  Default: 2 */
    +  def setK(k: Int): this.type = {
    +    this.k = k
    +    this
    +  }
    +  
    +  /** Return the number of Gaussians in the mixture model */
    +  def getK: Int = k
    +  
    +  /** Set the maximum number of iterations to run. Default: 100 */
    +  def setMaxIterations(maxIterations: Int): this.type = {
    +    this.maxIterations = maxIterations
    +    this
    +  }
    +  
    +  /** Return the maximum number of iterations to run */
    +  def getMaxIterations: Int = maxIterations
    +  
    +  /**
    +   * Set the largest change in log-likelihood at which convergence is 
    +   * considered to have occurred.
    +   */
    +  def setConvergenceTol(convergenceTol: Double): this.type = {
    +    this.convergenceTol = convergenceTol
    +    this
    +  }
    +  
    +  /** Return the largest change in log-likelihood at which convergence is
    +   *  considered to have occurred.
    +   */
    +  def getConvergenceTol: Double = convergenceTol
    +  
    +  /** Machine precision value used to ensure matrix conditioning */
    +  private val eps = math.pow(2.0, -52)
    +  
    +  /** Perform expectation maximization */
    +  def run(data: RDD[Vector]): GaussianMixtureModel = {
    +    val ctx = data.sparkContext
    +    
    +    // we will operate on the data as breeze data
    +    val breezeData = data.map(u => u.toBreeze.toDenseVector).cache()
    +    
    +    // Get length of the input vectors
    +    val d = breezeData.first.length 
    +    
    +    // gaussians will be array of (weight, mean, covariance) tuples.
    +    // If the user supplied an initial GMM, we use those values, otherwise
    +    // we start with uniform weights, a random mean from the data, and
    +    // diagonal covariance matrices using component variances
    +    // derived from the samples 
    +    var gaussians = initialGmm match {
    +      case Some(gmm) => (0 until k).map{ i =>
    +        (gmm.weight(i), gmm.mu(i).toBreeze.toDenseVector, gmm.sigma(i).toBreeze.toDenseMatrix)
    +      }.toArray
    +      
    +      case None => {
    +        // For each Gaussian, we will initialize the mean as the average
    +        // of some random samples from the data
    +        val samples = breezeData.takeSample(true, k * nSamples, scala.util.Random.nextInt)
    +          
    +        (0 until k).map{ i => 
    +          (1.0 / k, 
    +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    +        }.toArray
    +      }
    +    }
    +    
    +    var llh = Double.MinValue // current log-likelihood 
    +    var llhp = 0.0            // previous log-likelihood
    +    
    +    var iter = 0
    +    do {
    +      // pivot gaussians into weight and distribution arrays 
    +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    +      val dists = (0 until k).map{ i => 
    +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    --- End diff --
    
    I had considered using pseudo inverse here for that reason (I ultimately decided this was unlikely to actually cause a problem in practice); what do you think of using pinv instead?
    
    > On Dec 17, 2014, at 7:23 PM, jkbradley <no...@github.com> wrote:
    > 
    > In mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixtureModelEM.scala:
    > 
    > > +          (1.0 / k, 
    > > +            vectorMean(samples.slice(i * nSamples, (i + 1) * nSamples)), 
    > > +            initCovariance(samples.slice(i * nSamples, (i + 1) * nSamples)))
    > > +        }.toArray
    > > +      }
    > > +    }
    > > +    
    > > +    var llh = Double.MinValue // current log-likelihood 
    > > +    var llhp = 0.0            // previous log-likelihood
    > > +    
    > > +    var iter = 0
    > > +    do {
    > > +      // pivot gaussians into weight and distribution arrays 
    > > +      val weights = (0 until k).map(i => gaussians(i)._1).toArray
    > > +      val dists = (0 until k).map{ i => 
    > > +        new MultivariateGaussian(gaussians(i)._2, gaussians(i)._3)
    > This is brittle since it fails when the covariance matrix is not full rank. I'd say any of these is acceptable for now:
    > 
    > Temporary fix: Check for an exception and printing a warning about the data not being full rank.
    > OK fix: Adding a little epsilon smoothing.
    > Actual fix: Do a matrix decomposition (like Cholesky) instead of a direct inversion to handle non-full rank matrices.
    > —
    > Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org