You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by witgo <gi...@git.apache.org> on 2014/09/14 18:57:06 UTC

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/2388

    [WIP][SPARK-1405][MLLIB]LDA based on Graphx

    cc @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark graphx_lda

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2388.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2388
    
----
commit 9860fd1f8dc969f905f1b3d1509214a817789a86
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-09-14T16:55:15Z

    LDA based on Graphx

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by witgo <gi...@git.apache.org>.
GitHub user witgo reopened a pull request:

    https://github.com/apache/spark/pull/2388

    [WIP][SPARK-1405][MLLIB] topic modeling on Graphx

    This PR relies on  #2631
    
    - [X] Topic de-duplication
    - [X] Support  100000 topics
    - [X] Asymmetric Dirichlet priors
    - [ ] Add the documentation
    - [X] Add infer interface
    - [X] Add unit tests
    - [X] Add the performance test 
    - [X] Optimizing the infer interface performance
    - [ ] Verifying the correctness of the algorithm
    
    
    The performance test:
    
    `2000` topics:
    
    Item | value
    ------------ | -------------
    The cluster resource | 36 executors(36 cores, 216g memory)
    The corpus size | 253064 document, 29696335 words
    The number of iterations | `105`
    The number of distinct term |  75496
    The number of topics |  `2000`
    alpha | 0.01
    beta | 0.01
    The running time |  37.1 minutes
    
    `10000` topics:
    
    Item | value
    ------------ | -------------
    The cluster resource | 36 executors(36 cores, 216g memory)
    The corpus size | 253064 document, 29696335 words
    The number of iterations | `105`
    The number of distinct term |  75496
    The number of topics |  `10000`
    alpha | 0.01
    beta | 0.01
    The running time |  49 minutes
    
    
    `100000` topics:
    
    Item | value
    ------------ | -------------
    The cluster resource | 36 executors(36 cores, 216g memory)
    The corpus size | 253064 document, 29696335 words
    The number of iterations | `105`
    The number of distinct term |  75496
    The number of topics |  `100000`
    alpha | 0.1
    beta | 0.01
    The running time |  268.9 minutes
    
    conf/spark-defaults.conf:
    ```
    spark.akka.frameSize   20
    spark.executor.instances 36
    spark.rdd.compress true
    spark.executor.memory   6g
    spark.default.parallelism  72
    spark.broadcast.blockSize  8192
    spark.storage.memoryFraction 0.4
    spark.serializer org.apache.spark.serializer.KryoSerializer
    spark.kryo.registrator org.apache.spark.mllib.feature.TopicModelingKryoRegistrator
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark graphx_lda

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2388.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2388
    
----
commit ca8e6f296a2f7ed674dd3a5cde49d4301d3d6d14
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-10-08T08:10:12Z

    topic modeling on Graphx

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-72069846
  
    Here is a sample faster branch(work in progress): 
    https://github.com/witgo/spark/tree/lda_MH


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58330999
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21455/consoleFull) for   PR 2388 at commit [`ca8e6f2`](https://github.com/apache/spark/commit/ca8e6f296a2f7ed674dd3a5cde49d4301d3d6d14).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(inputFile: String = null, threshold: Double = 0.1)`
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`
      * `  class SparkIMain(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56302557
  
    **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20620/consoleFull)** after     a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57588021
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21174/consoleFull) for   PR 2388 at commit [`99945ce`](https://github.com/apache/spark/commit/99945ce52e7559728191226fbc21a2a592591ceb).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57588026
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21174/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57440229
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21101/consoleFull) for   PR 2388 at commit [`84f51e3`](https://github.com/apache/spark/commit/84f51e3857f6ffae0584100f53ac7e68767ba060).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by huifeidemaer <gi...@git.apache.org>.
Github user huifeidemaer commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18195844
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/TopicModeling.scala ---
    @@ -0,0 +1,818 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import java.util.Random
    +
    +import breeze.collection.mutable.SparseArray
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, sum => bsum}
    +import com.esotericsoftware.kryo.{Kryo, KryoException}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.Logging
    +import org.apache.spark.graphx._
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.rdd.RDD
    +
    +object TopicModeling {
    +
    +  type DocId = VertexId
    +  type WordId = VertexId
    +  type Count = Int
    +  type VD = (BV[Count], Option[(BV[Double], BV[Double])])
    +  type ED = Array[Count]
    +
    +  def train(docs: RDD[(DocId, SSV)],
    +    numTopics: Int = 2048,
    +    totalIter: Int = 150,
    +    burnInIter: Int = 135,
    +    alpha: Double = 0.1,
    +    beta: Double = 0.01): TopicModel = {
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta)
    +    val numTerms = topicModeling.numTerms
    +    val topicModel = TopicModel(numTopics, numTerms, alpha, beta)
    +    topicModeling.runGibbsSampling(topicModel, totalIter, burnInIter)
    +    topicModel
    +  }
    +
    +  private[mllib] def merge(a: BV[Count], b: BV[Count]): BV[Count] = {
    +    assert(a.size == b.size)
    +    a :+ b
    +  }
    +
    +  private[mllib] def update(a: BV[Count], t: Int, inc: Int): BV[Count] = {
    +    a(t) += inc
    +    a
    +  }
    +
    +  private[mllib] def zeros(numTopics: Int, isDense: Boolean = false): BV[Count] = {
    +    if (isDense) {
    +      BDV.zeros(numTopics)
    +    }
    +    else {
    +      BSV.zeros(numTopics)
    +    }
    +  }
    +
    +  private[mllib] def collectTermTopicDist(graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): Graph[VD, ED] = {
    +    graph.mapVertices[VD]((vertexId, counter) => {
    +      if (vertexId >= 0) {
    +        val termTopicCounter = counter._1
    +        val w = BSV.zeros[Double](numTopics)
    +        val w1 = BSV.zeros[Double](numTopics)
    +        var wi = 0D
    +
    +        termTopicCounter.activeIterator.foreach { case (i, v) =>
    +          var adjustment = 0D
    +          w(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta))
    +
    +          adjustment = -1D
    +          w1(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta)) - w(i)
    +
    +          wi = w(i) + wi
    +          w(i) = wi
    +        }
    +
    +        w(numTopics - 1) = wi
    +        (termTopicCounter, Some(w, w1))
    +      }
    +      else {
    +        counter
    +      }
    +    })
    +  }
    +
    +  @inline private[mllib] def collectDocTopicDist(
    +    totalTopicCounter: BV[Count],
    +    termTopicCounter: BV[Count],
    +    docTopicCounter: BV[Count],
    +    d: BDV[Double],
    +    d1: BDV[Double],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var di = 0D
    +    docTopicCounter.activeIterator.foreach { case (i, v) =>
    +
    +      var adjustment = 0D
    +      d(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta)
    +
    +      adjustment = -1D
    +      d1(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta) - d(i)
    +
    +      di = d(i) + di
    +      d(i) = di
    +    }
    +
    +    d(numTopics - 1) = di
    +
    +    (d, d1)
    +  }
    +
    +  private[mllib] def collectGlobalTopicDist(totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var i = 0
    +    val t = BDV.zeros[Double](numTopics)
    +    val t1 = BDV.zeros[Double](numTopics)
    +    var ti = 0D
    +
    +    while (i < numTopics) {
    +      var adjustment = 0D
    +      t(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta))
    +
    +      adjustment = -1D
    +      t1(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta)) - t(i)
    +
    +      ti = t(i) + ti
    +      t(i) = ti
    +
    +      i += 1
    +    }
    +    (t, t1)
    +  }
    +
    +  private[mllib] def sampleTopics(
    +    graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    innerIter: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double
    +  ): Graph[VD, ED] = {
    +    val parts = graph.edges.partitions.size
    +    val (t, t1) = TopicModeling.collectGlobalTopicDist(totalTopicCounter, sumTerms, numTerms,
    +      numTopics, alpha, beta)
    +    val sampleTopics = (gen: java.util.Random, d: BDV[Double], d1: BDV[Double],
    +    triplet: EdgeTriplet[VD, ED]) => {
    +      assert(triplet.srcId >= 0)
    +      val (termCounter, Some((w, w1))) = triplet.srcAttr
    +      val (docTopicCounter, _) = triplet.dstAttr
    +      TopicModeling.collectDocTopicDist(totalTopicCounter, termCounter,
    +        docTopicCounter, d, d1, sumTerms, numTerms, numTopics, alpha, beta)
    +
    +      val topics = triplet.attr
    +      var i = 0
    +      while (i < topics.length) {
    +        val oldTopic = topics(i)
    +        val newTopic = TopicModeling.multinomialDistSampler(gen, d, w, t, d1(oldTopic),
    +          w1(oldTopic), t1(oldTopic), oldTopic)
    +        topics(i) = newTopic
    +        i += 1
    +      }
    +      topics
    +    }
    +
    +    graph.mapTriplets {
    +      (pid, iter) =>
    +        val gen = new java.util.Random(parts * pid + innerIter)
    +        val d = BDV.zeros[Double](numTopics)
    +        val d1 = BDV.zeros[Double](numTopics)
    +        iter.map {
    +          token =>
    +            sampleTopics(gen, d, d1, token)
    +        }
    +    }
    +  }
    +
    +  private[mllib] def updateCounter(graph: Graph[VD, ED], numTopics: Int): Graph[VD, ED] = {
    +    val newCounter = graph.mapReduceTriplets[BV[Int]](e => {
    +      val docId = e.dstId
    +      val wordId = e.srcId
    +      val newTopics = e.attr
    +      val vector = zeros(numTopics)
    +      var i = 0
    +      while (i < newTopics.length) {
    +        val newTopic = newTopics(i)
    +        vector(newTopic) += 1
    +        i += 1
    +      }
    +      Iterator((docId, vector), (wordId, vector))
    +
    +    }, merge)
    +    graph.joinVertices(newCounter)((_, _, n) => (n, None))
    +  }
    +
    +  private[mllib] def collectGlobalCounter(graph: Graph[VD, ED],
    +    numTopics: Int): BV[Count] = {
    +    graph.vertices.filter(t => t._1 >= 0).map(_._2._1)
    +      .aggregate(zeros(numTopics, isDense = true))(merge, merge)
    +  }
    +
    +  /**
    +   * A multinomial distribution sampler, using roulette method to sample an Int back.
    +   */
    +  @inline private[mllib] def multinomialDistSampler(rand: Random, d: BV[Double], w: BV[Double],
    +    t: BV[Double], d1: Double, w1: Double, t1: Double, currentTopic: Int): Int = {
    +    /**
    +     * Asymmetric Dirichlet Priors you can refer to the paper:
    +     * "Rethinking LDA: Why Priors Matter", available at
    +     * [[http://people.ee.duke.edu/~lcarin/Eric3.5.2010.pdf]]
    +     *
    +     * var topicThisTerm = BDV.zeros[Double](numTopics)
    +     * while (i < numTopics) {
    --- End diff --
    
    adjustment indicates that if this is the current topic, you need to reduce 1 from the corresponding terms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57260813
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21015/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on GraphX

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-62077399
  
    @witgo  Thanks for the PR!  This looks like a very featureful implementation, but I think it will require some refactoring to fit in well with future development.  I'll give some high-level comments for now, and can perhaps do a lower-level pass later on.
    
    **APIs**
    
    I suspect we'll have other types of topic modeling in the future, not just LDA.  It would be great to think ahead for that.  The simplest way is probably to rename everything as "LDA", not "topic modeling," and to minimize the public API.  (Other topic models we might want later are LSA, PLSA, HDP, CTM, etc.)
    
    This should probably go under "clustering" instead of "feature."
    
    **Code organization**
    
    Some of the code is more general than LDA and could go elsewhere in MLlib.  E.g., some of the sampling methods could go in stat/  Also, minMaxIndexSearch, minMaxValueSearch, etc. (or can those be replaced using existing generic methods in Scala or Java?).
    
    **Documentation and code clarity**
    
    The current thing making this hardest to review is the lack of documentation and the difficulty in understanding what each value and method does.  For documentation, it will be helpful to see comments for all classes and methods, and also inline comments explaining code where needed.  For code clarity, using more descriptive variable and method names will help a lot.
    
    **Other thoughts**
    
    It would be nice to remove some experimental items (such as mergeDuplicateTopic) for now.
    
    Thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56504704
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20700/consoleFull) for   PR 2388 at commit [`dfc83fe`](https://github.com/apache/spark/commit/dfc83feb1546dfc3ed1be615a28ebef60e145cb5).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58862848
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21683/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57210710
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20979/consoleFull) for   PR 2388 at commit [`13d2996`](https://github.com/apache/spark/commit/13d29968b6f732ba25aa5c2c5fdde8cf5eda86f1).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55532086
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20309/consoleFull) for   PR 2388 at commit [`5fa02ef`](https://github.com/apache/spark/commit/5fa02ef4d5ff2cd24242ab397670adff3f2efe5a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56497649
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20700/consoleFull) for   PR 2388 at commit [`dfc83fe`](https://github.com/apache/spark/commit/dfc83feb1546dfc3ed1be615a28ebef60e145cb5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55610482
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20340/consoleFull) for   PR 2388 at commit [`3738e74`](https://github.com/apache/spark/commit/3738e7455a8ac6c8afb3daea8f438255383cf386).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57609163
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21195/consoleFull) for   PR 2388 at commit [`4e17606`](https://github.com/apache/spark/commit/4e17606b3508a9208f42a6304c19ca970e07dbea).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57043654
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20906/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57433700
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21101/consoleFull) for   PR 2388 at commit [`84f51e3`](https://github.com/apache/spark/commit/84f51e3857f6ffae0584100f53ac7e68767ba060).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on GraphX

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-62084868
  
      [Test build #512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/512/consoleFull) for   PR 2388 at commit [`fe40445`](https://github.com/apache/spark/commit/fe404451e1d7cf15a336160732a5d2a94c99a877).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by debasish83 <gi...@git.apache.org>.
Github user debasish83 commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-62691464
  
    @jkbradley we support LSA (sparse coding) and PLSA through https://github.com/apache/spark/pull/3221...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57210727
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20979/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56475768
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20686/consoleFull) for   PR 2388 at commit [`bf84e7b`](https://github.com/apache/spark/commit/bf84e7b87306dbe453077727be4a94fec40da417).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-72621141
  
    @mengxr  
    I created a JIRAs [SPARK-5556](https://issues.apache.org/jira/browse/SPARK-5556). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58746403
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21642/consoleFull) for   PR 2388 at commit [`b0734b8`](https://github.com/apache/spark/commit/b0734b86ab95774aec79af55d9de48b363fe243b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-60951854
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56270363
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20610/consoleFull) for   PR 2388 at commit [`d407854`](https://github.com/apache/spark/commit/d407854fa8cdaeb6bc1c00d283d6990b8d27cade).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57594574
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21189/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56653261
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20747/consoleFull) for   PR 2388 at commit [`7bc691a`](https://github.com/apache/spark/commit/7bc691ab142edba8a127937dfbd836d5738f6527).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57042479
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20906/consoleFull) for   PR 2388 at commit [`ebb86a0`](https://github.com/apache/spark/commit/ebb86a01774b00005a180a2289a8417276157403).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56503577
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20699/consoleFull) for   PR 2388 at commit [`61ed81f`](https://github.com/apache/spark/commit/61ed81f26565505c031d86c9e8fdc6d65dfa8ebd).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58680646
  
    Found a bug, and I'm trying to figure out what the problem.
    I  close this temporarily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57584038
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21174/consoleFull) for   PR 2388 at commit [`99945ce`](https://github.com/apache/spark/commit/99945ce52e7559728191226fbc21a2a592591ceb).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-72616260
  
    @witgo We've merged #4047 and closed this PR. Thanks for your contribution! Please create JIRAs and propose new features that can be added to the LDA implementation in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55534091
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20309/consoleFull) for   PR 2388 at commit [`5fa02ef`](https://github.com/apache/spark/commit/5fa02ef4d5ff2cd24242ab397670adff3f2efe5a).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56647000
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20747/consoleFull) for   PR 2388 at commit [`7bc691a`](https://github.com/apache/spark/commit/7bc691ab142edba8a127937dfbd836d5738f6527).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57609171
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21195/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57602181
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21195/consoleFull) for   PR 2388 at commit [`4e17606`](https://github.com/apache/spark/commit/4e17606b3508a9208f42a6304c19ca970e07dbea).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55533316
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20308/consoleFull) for   PR 2388 at commit [`9860fd1`](https://github.com/apache/spark/commit/9860fd1f8dc969f905f1b3d1509214a817789a86).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56272823
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20611/consoleFull) for   PR 2388 at commit [`14903b1`](https://github.com/apache/spark/commit/14903b1b905026e29a58d13dd0da8a1dc4fb6d25).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18768316
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/TopicModeling.scala ---
    @@ -0,0 +1,682 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import java.util.Random
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, sum => brzSum}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.graphx._
    +import org.apache.spark.Logging
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.SparkContext._
    +
    +import TopicModeling._
    +
    +class TopicModeling private[mllib](
    +  @transient var corpus: Graph[VD, ED],
    +  val numTopics: Int,
    +  val numTerms: Int,
    +  val alpha: Double,
    +  val beta: Double,
    +  @transient val storageLevel: StorageLevel)
    +  extends Serializable with Logging {
    +
    +  def this(docs: RDD[(TopicModeling.DocId, SSV)],
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double,
    +    storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
    +    computedModel: Broadcast[TopicModel] = null) {
    +    this(initializeCorpus(docs, numTopics, storageLevel, computedModel),
    +      numTopics, docs.first()._2.size, alpha, beta, storageLevel)
    +  }
    +
    +
    +  /**
    +   * The number of documents in the corpus
    +   */
    +  val numDocs = docVertices.count()
    +
    +  /**
    +   * The number of terms in the corpus
    +   */
    +  private val sumTerms = corpus.edges.map(e => e.attr.size.toDouble).sum().toLong
    +
    +  /**
    +   * The total counts for each topic
    +   */
    +  @transient private var globalTopicCounter: BDV[Count] = collectGlobalCounter(corpus, numTopics)
    +  assert(brzSum(globalTopicCounter) == sumTerms)
    +
    +  @transient private val sc = corpus.vertices.context
    +  @transient private val seed = new Random().nextInt()
    +  @transient private var innerIter = 1
    +  @transient private var cachedEdges: EdgeRDD[ED, VD] = corpus.edges
    +  @transient private var cachedVertices: VertexRDD[VD] = corpus.vertices
    +
    +  private def termVertices = corpus.vertices.filter(t => t._1 >= 0)
    +
    +  private def docVertices = corpus.vertices.filter(t => t._1 < 0)
    +
    +  private def checkpoint(): Unit = {
    +    if (innerIter % 10 == 0 && sc.getCheckpointDir.isDefined) {
    +      val edges = corpus.edges.map(t => t)
    +      edges.checkpoint()
    +      val newCorpus: Graph[VD, ED] = Graph.fromEdges(edges, null,
    +        storageLevel, storageLevel)
    +      corpus = updateCounter(newCorpus, numTopics).cache()
    +    }
    +  }
    +
    +  private def gibbsSampling(): Unit = {
    +    val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter,
    +      sumTerms, numTerms, numTopics, alpha, beta)
    +
    +    val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter,
    +      sumTerms, innerIter + seed, numTerms, numTopics, alpha, beta)
    +    corpusSampleTopics.edges.setName(s"edges-$innerIter").cache().count()
    +    Option(cachedEdges).foreach(_.unpersist())
    +    cachedEdges = corpusSampleTopics.edges
    +
    +    corpus = updateCounter(corpusSampleTopics, numTopics)
    +    corpus.vertices.setName(s"vertices-$innerIter").cache()
    +    globalTopicCounter = collectGlobalCounter(corpus, numTopics)
    +    assert(brzSum(globalTopicCounter) == sumTerms)
    +    Option(cachedVertices).foreach(_.unpersist())
    +    cachedVertices = corpus.vertices
    +
    +    checkpoint()
    +    innerIter += 1
    +  }
    +
    +  def saveTopicModel(burnInIter: Int): TopicModel = {
    +    val topicModel = TopicModel(numTopics, numTerms, alpha, beta)
    +    for (iter <- 1 to burnInIter) {
    +      logInfo("Save TopicModel (Iteration %d/%d)".format(iter, burnInIter))
    +      gibbsSampling()
    +      updateTopicModel(termVertices, topicModel)
    +    }
    +    topicModel.gtc :/= burnInIter.toDouble
    +    topicModel.ttc.foreach(_ :/= burnInIter.toDouble)
    +    topicModel
    +  }
    +
    +  def runGibbsSampling(iterations: Int): Unit = {
    +    for (iter <- 1 to iterations) {
    +      logInfo("Start Gibbs sampling (Iteration %d/%d)".format(iter, iterations))
    +      gibbsSampling()
    +    }
    +  }
    +
    +  @Experimental
    +  def mergeDuplicateTopic(threshold: Double = 0.95D): Map[Int, Int] = {
    +    val rows = termVertices.map(t => t._2.counter).map { bsv =>
    +      val length = bsv.length
    +      val used = bsv.used
    +      val index = bsv.index.slice(0, used)
    +      val data = bsv.data.slice(0, used).map(_.toDouble)
    +      new SSV(length, index, data).asInstanceOf[SV]
    +    }
    +    val simMatrix = new RowMatrix(rows).columnSimilarities()
    +    val minMap = simMatrix.entries.filter { case MatrixEntry(row, column, sim) =>
    +      sim > threshold && row != column
    +    }.map { case MatrixEntry(row, column, sim) =>
    +      (column.toInt, row.toInt)
    +    }.groupByKey().map { case (topic, simTopics) =>
    +      (topic, simTopics.min)
    +    }.collect().toMap
    +    if (minMap.size > 0) {
    +      corpus = corpus.mapEdges(edges => {
    +        edges.attr.map { topic =>
    +          minMap.get(topic).getOrElse(topic)
    +        }
    +      })
    +      corpus = updateCounter(corpus, numTopics)
    +    }
    +    minMap
    +  }
    +
    +  def perplexity(): Double = {
    +    val totalTopicCounter = this.globalTopicCounter
    +    val numTopics = this.numTopics
    +    val numTerms = this.numTerms
    +    val alpha = this.alpha
    +    val beta = this.beta
    +
    +    val newCounts = corpus.mapReduceTriplets[Int](triplet => {
    +      val size = triplet.attr.size
    +      val docId = triplet.dstId
    +      val wordId = triplet.srcId
    +      Iterator((docId, size), (wordId, size))
    +    }, (a, b) => a + b)
    +    val (termProb, totalNum) = corpus.outerJoinVertices(newCounts) {
    +      (_, f, n) =>
    +        (f.counter, n.get)
    +    }.mapTriplets {
    +      triplet =>
    +        val (termCounter, _) = triplet.srcAttr
    +        val (docTopicCounter, docTopicCount) = triplet.dstAttr
    +        var probWord = 0D
    +        val size = triplet.attr.size
    +        (0 until numTopics).foreach {
    +          topic =>
    +            val phi = (termCounter(topic) + beta) / (totalTopicCounter(topic) + numTerms * beta)
    +            val theta = (docTopicCounter(topic) + alpha) / (docTopicCount + alpha * numTopics)
    +            probWord += phi * theta
    +        }
    +        (Math.log(probWord * size) * size, size)
    +    }.edges.map(t => t.attr).reduce {
    +      (lhs, rhs) =>
    +        (lhs._1 + rhs._1, lhs._2 + rhs._2)
    +    }
    +    math.exp(-1 * termProb / totalNum)
    +  }
    +}
    +
    +
    +object TopicModeling {
    +
    +  private[mllib] type DocId = VertexId
    +  private[mllib] type WordId = VertexId
    +  private[mllib] type Count = Int
    +  private[mllib] type ED = Array[Count]
    +
    +  private[mllib] case class VD(counter: BSV[Count], dist: BSV[Double], dist1: BSV[Double])
    +
    +  def train(docs: RDD[(DocId, SSV)],
    +    numTopics: Int = 2048,
    +    totalIter: Int = 150,
    +    burnIn: Int = 5,
    +    alpha: Double = 0.1,
    +    beta: Double = 0.01): TopicModel = {
    +    require(totalIter > burnIn, "totalIter is less than burnIn")
    +    require(totalIter > 0, "totalIter is less than 0")
    +    require(burnIn > 0, "burnIn is less than 0")
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta)
    +    topicModeling.runGibbsSampling(totalIter - burnIn)
    +    topicModeling.saveTopicModel(burnIn)
    +  }
    +
    +  def incrementalTrain(docs: RDD[(DocId, SSV)],
    +    computedModel: TopicModel,
    +    totalIter: Int = 150,
    +    burnIn: Int = 5): TopicModel = {
    +    require(totalIter > burnIn, "totalIter is less than burnIn")
    +    require(totalIter > 0, "totalIter is less than 0")
    +    require(burnIn > 0, "burnIn is less than 0")
    +    val numTopics = computedModel.ttc.size
    +    val alpha = computedModel.alpha
    +    val beta = computedModel.beta
    +
    +    val broadcastModel = docs.context.broadcast(computedModel)
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta,
    +      computedModel = broadcastModel)
    +    broadcastModel.unpersist()
    +    topicModeling.runGibbsSampling(totalIter - burnIn)
    +    topicModeling.saveTopicModel(burnIn)
    +  }
    +
    +  private[mllib] def collectTermTopicDist(graph: Graph[VD, ED],
    +    totalTopicCounter: BDV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): Graph[VD, ED] = {
    +    val newVD = graph.vertices.filter(_._1 >= 0).map { v =>
    +      val vertexId = v._1
    +      val termTopicCounter = v._2.counter
    +      termTopicCounter.compact()
    +      val length = termTopicCounter.length
    +      val used = termTopicCounter.used
    +      val index = termTopicCounter.index
    +      val data = termTopicCounter.data
    +      val w = new Array[Double](used)
    +      val w1 = new Array[Double](used)
    +
    +      var wi = 0D
    +      var i = 0
    +
    +      while (i < used) {
    +        val topic = index(i)
    +        val count = data(i)
    +        var adjustment = 0D
    +        val alphaAS = alpha
    +
    +        w(i) = count * ((totalTopicCounter(topic) * (alpha * numTopics)) +
    +          (alpha * numTopics) * (adjustment + alphaAS) +
    +          adjustment * (sumTerms - 1 + (alphaAS * numTopics))) /
    +          (totalTopicCounter(topic) + (numTerms * beta)) /
    --- End diff --
    
    There is a small bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57500265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21127/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by huifeidemaer <gi...@git.apache.org>.
Github user huifeidemaer commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18196214
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/TopicModeling.scala ---
    @@ -0,0 +1,818 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import java.util.Random
    +
    +import breeze.collection.mutable.SparseArray
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, sum => bsum}
    +import com.esotericsoftware.kryo.{Kryo, KryoException}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.Logging
    +import org.apache.spark.graphx._
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.rdd.RDD
    +
    +object TopicModeling {
    +
    +  type DocId = VertexId
    +  type WordId = VertexId
    +  type Count = Int
    +  type VD = (BV[Count], Option[(BV[Double], BV[Double])])
    +  type ED = Array[Count]
    +
    +  def train(docs: RDD[(DocId, SSV)],
    +    numTopics: Int = 2048,
    +    totalIter: Int = 150,
    +    burnInIter: Int = 135,
    +    alpha: Double = 0.1,
    +    beta: Double = 0.01): TopicModel = {
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta)
    +    val numTerms = topicModeling.numTerms
    +    val topicModel = TopicModel(numTopics, numTerms, alpha, beta)
    +    topicModeling.runGibbsSampling(topicModel, totalIter, burnInIter)
    +    topicModel
    +  }
    +
    +  private[mllib] def merge(a: BV[Count], b: BV[Count]): BV[Count] = {
    +    assert(a.size == b.size)
    +    a :+ b
    +  }
    +
    +  private[mllib] def update(a: BV[Count], t: Int, inc: Int): BV[Count] = {
    +    a(t) += inc
    +    a
    +  }
    +
    +  private[mllib] def zeros(numTopics: Int, isDense: Boolean = false): BV[Count] = {
    +    if (isDense) {
    +      BDV.zeros(numTopics)
    +    }
    +    else {
    +      BSV.zeros(numTopics)
    +    }
    +  }
    +
    +  private[mllib] def collectTermTopicDist(graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): Graph[VD, ED] = {
    +    graph.mapVertices[VD]((vertexId, counter) => {
    +      if (vertexId >= 0) {
    +        val termTopicCounter = counter._1
    +        val w = BSV.zeros[Double](numTopics)
    +        val w1 = BSV.zeros[Double](numTopics)
    +        var wi = 0D
    +
    +        termTopicCounter.activeIterator.foreach { case (i, v) =>
    +          var adjustment = 0D
    +          w(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta))
    +
    +          adjustment = -1D
    +          w1(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta)) - w(i)
    +
    +          wi = w(i) + wi
    +          w(i) = wi
    +        }
    +
    +        w(numTopics - 1) = wi
    +        (termTopicCounter, Some(w, w1))
    +      }
    +      else {
    +        counter
    +      }
    +    })
    +  }
    +
    +  @inline private[mllib] def collectDocTopicDist(
    +    totalTopicCounter: BV[Count],
    +    termTopicCounter: BV[Count],
    +    docTopicCounter: BV[Count],
    +    d: BDV[Double],
    +    d1: BDV[Double],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var di = 0D
    +    docTopicCounter.activeIterator.foreach { case (i, v) =>
    +
    +      var adjustment = 0D
    +      d(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta)
    +
    +      adjustment = -1D
    +      d1(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta) - d(i)
    +
    +      di = d(i) + di
    +      d(i) = di
    +    }
    +
    +    d(numTopics - 1) = di
    +
    +    (d, d1)
    +  }
    +
    +  private[mllib] def collectGlobalTopicDist(totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var i = 0
    +    val t = BDV.zeros[Double](numTopics)
    +    val t1 = BDV.zeros[Double](numTopics)
    +    var ti = 0D
    +
    +    while (i < numTopics) {
    +      var adjustment = 0D
    +      t(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta))
    +
    +      adjustment = -1D
    +      t1(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta)) - t(i)
    +
    +      ti = t(i) + ti
    +      t(i) = ti
    +
    +      i += 1
    +    }
    +    (t, t1)
    +  }
    +
    +  private[mllib] def sampleTopics(
    +    graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    innerIter: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double
    +  ): Graph[VD, ED] = {
    +    val parts = graph.edges.partitions.size
    +    val (t, t1) = TopicModeling.collectGlobalTopicDist(totalTopicCounter, sumTerms, numTerms,
    +      numTopics, alpha, beta)
    +    val sampleTopics = (gen: java.util.Random, d: BDV[Double], d1: BDV[Double],
    +    triplet: EdgeTriplet[VD, ED]) => {
    +      assert(triplet.srcId >= 0)
    +      val (termCounter, Some((w, w1))) = triplet.srcAttr
    +      val (docTopicCounter, _) = triplet.dstAttr
    +      TopicModeling.collectDocTopicDist(totalTopicCounter, termCounter,
    +        docTopicCounter, d, d1, sumTerms, numTerms, numTopics, alpha, beta)
    +
    +      val topics = triplet.attr
    +      var i = 0
    +      while (i < topics.length) {
    +        val oldTopic = topics(i)
    +        val newTopic = TopicModeling.multinomialDistSampler(gen, d, w, t, d1(oldTopic),
    +          w1(oldTopic), t1(oldTopic), oldTopic)
    +        topics(i) = newTopic
    +        i += 1
    +      }
    +      topics
    +    }
    +
    +    graph.mapTriplets {
    +      (pid, iter) =>
    +        val gen = new java.util.Random(parts * pid + innerIter)
    +        val d = BDV.zeros[Double](numTopics)
    +        val d1 = BDV.zeros[Double](numTopics)
    +        iter.map {
    +          token =>
    +            sampleTopics(gen, d, d1, token)
    +        }
    +    }
    +  }
    +
    +  private[mllib] def updateCounter(graph: Graph[VD, ED], numTopics: Int): Graph[VD, ED] = {
    +    val newCounter = graph.mapReduceTriplets[BV[Int]](e => {
    +      val docId = e.dstId
    +      val wordId = e.srcId
    +      val newTopics = e.attr
    +      val vector = zeros(numTopics)
    +      var i = 0
    +      while (i < newTopics.length) {
    +        val newTopic = newTopics(i)
    +        vector(newTopic) += 1
    +        i += 1
    +      }
    +      Iterator((docId, vector), (wordId, vector))
    +
    +    }, merge)
    +    graph.joinVertices(newCounter)((_, _, n) => (n, None))
    +  }
    +
    +  private[mllib] def collectGlobalCounter(graph: Graph[VD, ED],
    +    numTopics: Int): BV[Count] = {
    +    graph.vertices.filter(t => t._1 >= 0).map(_._2._1)
    +      .aggregate(zeros(numTopics, isDense = true))(merge, merge)
    +  }
    +
    +  /**
    +   * A multinomial distribution sampler, using roulette method to sample an Int back.
    +   */
    +  @inline private[mllib] def multinomialDistSampler(rand: Random, d: BV[Double], w: BV[Double],
    +    t: BV[Double], d1: Double, w1: Double, t1: Double, currentTopic: Int): Int = {
    +    /**
    +     * Asymmetric Dirichlet Priors you can refer to the paper:
    +     * "Rethinking LDA: Why Priors Matter", available at
    +     * [[http://people.ee.duke.edu/~lcarin/Eric3.5.2010.pdf]]
    +     *
    +     * var topicThisTerm = BDV.zeros[Double](numTopics)
    +     * while (i < numTopics) {
    +     * val adjustment = if (i == currentTopic) -1 else 0
    +     * val ratio = (globalTopicCounter(i) + alpha) / (sumTerms + adjustment + (alpha * numTopics))
    +     * val asPrior = ratio * (alpha * numTopics)
    +     * topicThisTerm(i) = (termTopicCounter(i) + adjustment + beta) /
    +     * (globalTopicCounter(i) + adjustment + (numTerms * beta)) *
    +     * (docTopicCounter(i) + adjustment + asPrior) /
    +     * (bsum(docTopicCounter) + adjustment + alpha * numTopics)
    +     *
    +     * }
    +     *
    --- End diff --
    
    if you want to know more about the above codes, you can refer to the following formula:
    first, the original sampling formula is :<img src="http://chart.googleapis.com/chart?cht=tx&chl=\Large x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}" style="border:none;">
             


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-62821726
  
    @debasish83 Yep, that's one of the reasons I wanted this renamed something more specific.  @witgo Thanks for making that change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56496236
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20699/consoleFull) for   PR 2388 at commit [`61ed81f`](https://github.com/apache/spark/commit/61ed81f26565505c031d86c9e8fdc6d65dfa8ebd).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56982173
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20860/consoleFull) for   PR 2388 at commit [`d6b4afb`](https://github.com/apache/spark/commit/d6b4afb3b6e494fd5a483b32c646ce8af98037c2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56082465
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20541/consoleFull) for   PR 2388 at commit [`0dd8ad0`](https://github.com/apache/spark/commit/0dd8ad02459adbc6f72dbd7ffd774eaa7ad714fe).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58735791
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21614/consoleFull) for   PR 2388 at commit [`daf0787`](https://github.com/apache/spark/commit/daf07871fabaefb798c7c3f8dc9121eeee1246af).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56072999
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20541/consoleFull) for   PR 2388 at commit [`0dd8ad0`](https://github.com/apache/spark/commit/0dd8ad02459adbc6f72dbd7ffd774eaa7ad714fe).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56472988
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20686/consoleFull) for   PR 2388 at commit [`bf84e7b`](https://github.com/apache/spark/commit/bf84e7b87306dbe453077727be4a94fec40da417).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56991115
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20860/consoleFull) for   PR 2388 at commit [`d6b4afb`](https://github.com/apache/spark/commit/d6b4afb3b6e494fd5a483b32c646ce8af98037c2).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55617366
  
    简要的说明:
    
    * 图的结构
      
    	顶点为词(the source vertex),文档(the target vertex).边为文档中的词对应的主题(ID数组)
    
    *  训练过程
    
    	1.  初始化 根据文档对应的词稀疏向量构建 `RDD[Edge[ED]]`. 边的属性(数组形式储存)初始化为均匀分布.
    
    	2.  根据边的属性(主题数组)构建顶点属性(文档或词主题计数,稀疏向量形式存储), 语料库主题计数(向量形式存储)
    
    	3.  根据顶点属性(文档和词主题计数,语料库主题计数)做Gibbs采样,用采样结果作为边的属性.
    
    	4.  循环第二步和第三步适当次数.
    
    	5.  使用顶点(词)的属性(词主题计数)和语料库主题计数初始化`TopicModel`类
    
    *   推断过程
    
        1. 文档主题分布初始化均匀分布
    
    	2. 用`TopicModel`类(词主题计数,语料库主题计数和文档主题计数)做Gibbs采样,得的新的文档主题分布
    
    	3. 循环第二步totalIter次,取后`burnInIter`次的平均结果为输出



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56991123
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20860/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57626066
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21199/consoleFull) for   PR 2388 at commit [`dbac77e`](https://github.com/apache/spark/commit/dbac77ed4c52173a57c7d5aa9d73d4f1489e7f9b).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58324547
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21455/consoleFull) for   PR 2388 at commit [`ca8e6f2`](https://github.com/apache/spark/commit/ca8e6f296a2f7ed674dd3a5cde49d4301d3d6d14).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56504711
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20700/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by witgo <gi...@git.apache.org>.
Github user witgo closed the pull request at:

    https://github.com/apache/spark/pull/2388


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56270391
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20610/consoleFull) for   PR 2388 at commit [`d407854`](https://github.com/apache/spark/commit/d407854fa8cdaeb6bc1c00d283d6990b8d27cade).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-69236818
  
    @witgo  I’m submitting a simple PR for LDA which using EM for learning.  I believe that it would be good to support other learning methods such as Gibbs sampling (as in your PR), where the user can select the learning method via an LDA parameter.  If you have feedback on my PR, especially the public API, please do let me know.  Thanks very much!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2388


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58736906
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21614/consoleFull) for   PR 2388 at commit [`daf0787`](https://github.com/apache/spark/commit/daf07871fabaefb798c7c3f8dc9121eeee1246af).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by huifeidemaer <gi...@git.apache.org>.
Github user huifeidemaer commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18197041
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/TopicModeling.scala ---
    @@ -0,0 +1,818 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import java.util.Random
    +
    +import breeze.collection.mutable.SparseArray
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, sum => bsum}
    +import com.esotericsoftware.kryo.{Kryo, KryoException}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.Logging
    +import org.apache.spark.graphx._
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.rdd.RDD
    +
    +object TopicModeling {
    +
    +  type DocId = VertexId
    +  type WordId = VertexId
    +  type Count = Int
    +  type VD = (BV[Count], Option[(BV[Double], BV[Double])])
    +  type ED = Array[Count]
    +
    +  def train(docs: RDD[(DocId, SSV)],
    +    numTopics: Int = 2048,
    +    totalIter: Int = 150,
    +    burnInIter: Int = 135,
    +    alpha: Double = 0.1,
    +    beta: Double = 0.01): TopicModel = {
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta)
    +    val numTerms = topicModeling.numTerms
    +    val topicModel = TopicModel(numTopics, numTerms, alpha, beta)
    +    topicModeling.runGibbsSampling(topicModel, totalIter, burnInIter)
    +    topicModel
    +  }
    +
    +  private[mllib] def merge(a: BV[Count], b: BV[Count]): BV[Count] = {
    +    assert(a.size == b.size)
    +    a :+ b
    +  }
    +
    +  private[mllib] def update(a: BV[Count], t: Int, inc: Int): BV[Count] = {
    +    a(t) += inc
    +    a
    +  }
    +
    +  private[mllib] def zeros(numTopics: Int, isDense: Boolean = false): BV[Count] = {
    +    if (isDense) {
    +      BDV.zeros(numTopics)
    +    }
    +    else {
    +      BSV.zeros(numTopics)
    +    }
    +  }
    +
    +  private[mllib] def collectTermTopicDist(graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): Graph[VD, ED] = {
    +    graph.mapVertices[VD]((vertexId, counter) => {
    +      if (vertexId >= 0) {
    +        val termTopicCounter = counter._1
    +        val w = BSV.zeros[Double](numTopics)
    +        val w1 = BSV.zeros[Double](numTopics)
    +        var wi = 0D
    +
    +        termTopicCounter.activeIterator.foreach { case (i, v) =>
    +          var adjustment = 0D
    +          w(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta))
    +
    +          adjustment = -1D
    +          w1(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta)) - w(i)
    +
    +          wi = w(i) + wi
    +          w(i) = wi
    +        }
    +
    +        w(numTopics - 1) = wi
    +        (termTopicCounter, Some(w, w1))
    +      }
    +      else {
    +        counter
    +      }
    +    })
    +  }
    +
    +  @inline private[mllib] def collectDocTopicDist(
    +    totalTopicCounter: BV[Count],
    +    termTopicCounter: BV[Count],
    +    docTopicCounter: BV[Count],
    +    d: BDV[Double],
    +    d1: BDV[Double],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var di = 0D
    +    docTopicCounter.activeIterator.foreach { case (i, v) =>
    +
    +      var adjustment = 0D
    +      d(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta)
    +
    +      adjustment = -1D
    +      d1(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta) - d(i)
    +
    +      di = d(i) + di
    +      d(i) = di
    +    }
    +
    +    d(numTopics - 1) = di
    +
    +    (d, d1)
    +  }
    +
    +  private[mllib] def collectGlobalTopicDist(totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var i = 0
    +    val t = BDV.zeros[Double](numTopics)
    +    val t1 = BDV.zeros[Double](numTopics)
    +    var ti = 0D
    +
    +    while (i < numTopics) {
    +      var adjustment = 0D
    +      t(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta))
    +
    +      adjustment = -1D
    +      t1(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta)) - t(i)
    +
    +      ti = t(i) + ti
    +      t(i) = ti
    +
    +      i += 1
    +    }
    +    (t, t1)
    +  }
    +
    +  private[mllib] def sampleTopics(
    +    graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    innerIter: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double
    +  ): Graph[VD, ED] = {
    +    val parts = graph.edges.partitions.size
    +    val (t, t1) = TopicModeling.collectGlobalTopicDist(totalTopicCounter, sumTerms, numTerms,
    +      numTopics, alpha, beta)
    +    val sampleTopics = (gen: java.util.Random, d: BDV[Double], d1: BDV[Double],
    +    triplet: EdgeTriplet[VD, ED]) => {
    +      assert(triplet.srcId >= 0)
    +      val (termCounter, Some((w, w1))) = triplet.srcAttr
    +      val (docTopicCounter, _) = triplet.dstAttr
    +      TopicModeling.collectDocTopicDist(totalTopicCounter, termCounter,
    +        docTopicCounter, d, d1, sumTerms, numTerms, numTopics, alpha, beta)
    +
    +      val topics = triplet.attr
    +      var i = 0
    +      while (i < topics.length) {
    +        val oldTopic = topics(i)
    +        val newTopic = TopicModeling.multinomialDistSampler(gen, d, w, t, d1(oldTopic),
    +          w1(oldTopic), t1(oldTopic), oldTopic)
    +        topics(i) = newTopic
    +        i += 1
    +      }
    +      topics
    +    }
    +
    +    graph.mapTriplets {
    +      (pid, iter) =>
    +        val gen = new java.util.Random(parts * pid + innerIter)
    +        val d = BDV.zeros[Double](numTopics)
    +        val d1 = BDV.zeros[Double](numTopics)
    +        iter.map {
    +          token =>
    +            sampleTopics(gen, d, d1, token)
    +        }
    +    }
    +  }
    +
    +  private[mllib] def updateCounter(graph: Graph[VD, ED], numTopics: Int): Graph[VD, ED] = {
    +    val newCounter = graph.mapReduceTriplets[BV[Int]](e => {
    +      val docId = e.dstId
    +      val wordId = e.srcId
    +      val newTopics = e.attr
    +      val vector = zeros(numTopics)
    +      var i = 0
    +      while (i < newTopics.length) {
    +        val newTopic = newTopics(i)
    +        vector(newTopic) += 1
    +        i += 1
    +      }
    +      Iterator((docId, vector), (wordId, vector))
    +
    +    }, merge)
    +    graph.joinVertices(newCounter)((_, _, n) => (n, None))
    +  }
    +
    +  private[mllib] def collectGlobalCounter(graph: Graph[VD, ED],
    +    numTopics: Int): BV[Count] = {
    +    graph.vertices.filter(t => t._1 >= 0).map(_._2._1)
    +      .aggregate(zeros(numTopics, isDense = true))(merge, merge)
    +  }
    +
    +  /**
    +   * A multinomial distribution sampler, using roulette method to sample an Int back.
    +   */
    +  @inline private[mllib] def multinomialDistSampler(rand: Random, d: BV[Double], w: BV[Double],
    +    t: BV[Double], d1: Double, w1: Double, t1: Double, currentTopic: Int): Int = {
    +    /**
    +     * Asymmetric Dirichlet Priors you can refer to the paper:
    +     * "Rethinking LDA: Why Priors Matter", available at
    +     * [[http://people.ee.duke.edu/~lcarin/Eric3.5.2010.pdf]]
    +     *
    +     * var topicThisTerm = BDV.zeros[Double](numTopics)
    +     * while (i < numTopics) {
    +     * val adjustment = if (i == currentTopic) -1 else 0
    +     * val ratio = (globalTopicCounter(i) + adjustment + alpha) / (sumTerms - 1 + (alpha * numTopics))
    +     * val asPrior = ratio * (alpha * numTopics)
    +     * topicThisTerm(i) = (termTopicCounter(i) + adjustment + beta) /
    +     * (globalTopicCounter(i) + adjustment + (numTerms * beta)) *
    +     * (docTopicCounter(i) + adjustment + asPrior) /
    +     * (bsum(docTopicCounter) - 1 + alpha * numTopics)
    +     *
    +     * }
    +     *
    --- End diff --
    
    if you want to know more about the above codes, you can refer to the following formula:
    First), the original sampling formula is :<img src="http://www.forkosh.com/mathtex.cgi? P(z^{(d)}_{n}|W, Z_{\backslash d,n}, \alpha u, \beta u)\propto P(w^{(d)}_{n}|z^{(d)}_{n},W_{\backslash d,n}, Z_{\backslash d,n}, \beta u) P(z^{(d)}_{n}|Z_{\backslash d,n}, \alpha u)">     (1)
    Second), using the Asymmetric Dirichlet Priors, the second term of formula (1) can be written as following:
    <img src="http://www.forkosh.com/mathtex.cgi?  P(z^{(d)}_{N_{d+1}}=t|Z, \alpha, \alpha^{'}u)=\int dm P(z^{(d)}_{N_{d+1}}=t|Z, \alpha m)P(m|Z, \alpha^{'}u)">    (2)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55621228
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20340/consoleFull) for   PR 2388 at commit [`3738e74`](https://github.com/apache/spark/commit/3738e7455a8ac6c8afb3daea8f438255383cf386).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57043652
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20906/consoleFull) for   PR 2388 at commit [`ebb86a0`](https://github.com/apache/spark/commit/ebb86a01774b00005a180a2289a8417276157403).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57589906
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21179/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56653270
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20747/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57440236
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21101/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18740868
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/TopicModeling.scala ---
    @@ -0,0 +1,674 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import java.util.Random
    +
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, sum => brzSum}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.graphx._
    +import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +
    +import TopicModeling._
    +
    +class TopicModeling private[mllib](
    +  @transient var corpus: Graph[VD, ED],
    +  val numTopics: Int,
    +  val numTerms: Int,
    +  val alpha: Double,
    +  val beta: Double,
    +  @transient val storageLevel: StorageLevel)
    +  extends Serializable with Logging {
    +
    +  def this(docs: RDD[(TopicModeling.DocId, SSV)],
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double,
    +    storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
    +    computedModel: Broadcast[TopicModel] = null) {
    +    this(initializeCorpus(docs, numTopics, storageLevel, computedModel),
    +      numTopics, docs.first()._2.size, alpha, beta, storageLevel)
    +  }
    +
    +
    +  /**
    +   * The number of documents in the corpus
    +   */
    +  val numDocs = docVertices.count()
    +
    +  /**
    +   * The number of terms in the corpus
    +   */
    +  private val sumTerms = corpus.edges.map(e => e.attr.size.toDouble).sum().toLong
    +
    +  /**
    +   * The total counts for each topic
    +   */
    +  @transient private var globalTopicCounter: BV[Count] = collectGlobalCounter(corpus, numTopics)
    +  assert(brzSum(globalTopicCounter) == sumTerms)
    +  @transient private val sc = corpus.vertices.context
    +  @transient private val seed = new Random().nextInt()
    +  @transient private var innerIter = 1
    +  @transient private var cachedEdges: EdgeRDD[ED, VD] = null
    +  @transient private var cachedVertices: VertexRDD[VD] = null
    +
    +  private def termVertices = corpus.vertices.filter(t => t._1 >= 0)
    +
    +  private def docVertices = corpus.vertices.filter(t => t._1 < 0)
    +
    +  private def gibbsSampling(cachedEdges: EdgeRDD[ED, VD],
    +    cachedVertices: VertexRDD[VD]): (EdgeRDD[ED, VD], VertexRDD[VD]) = {
    +
    +    val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter,
    +      sumTerms, numTerms, numTopics, alpha, beta)
    +
    +    val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter,
    +      sumTerms, innerIter + seed, numTerms, numTopics, alpha, beta)
    +    corpusSampleTopics.edges.setName(s"edges-$innerIter").cache().count()
    +    Option(cachedEdges).foreach(_.unpersist())
    +    val edges = corpusSampleTopics.edges
    +
    +    corpus = updateCounter(corpusSampleTopics, numTopics)
    +    corpus.vertices.setName(s"vertices-$innerIter").cache()
    +    globalTopicCounter = collectGlobalCounter(corpus, numTopics)
    +    assert(brzSum(globalTopicCounter) == sumTerms)
    +    Option(cachedVertices).foreach(_.unpersist())
    +    val vertices = corpus.vertices
    +
    +    if (innerIter % 10 == 0 && sc.getCheckpointDir.isDefined) {
    --- End diff --
    
    This is only a temporary solution.
    The related PR #2631


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56299008
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20620/consoleFull) for   PR 2388 at commit [`f775916`](https://github.com/apache/spark/commit/f77591669634c1e2f6c64296fe26309c8740006b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57260810
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21015/consoleFull) for   PR 2388 at commit [`298c720`](https://github.com/apache/spark/commit/298c7207119002552ec929e3a4d8a32c747ab07e).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57594202
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21187/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55586887
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20335/consoleFull) for   PR 2388 at commit [`dc7ef13`](https://github.com/apache/spark/commit/dc7ef13c9b5b58cb7b0e12f586432e3140644b10).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55595533
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20335/consoleFull) for   PR 2388 at commit [`dc7ef13`](https://github.com/apache/spark/commit/dc7ef13c9b5b58cb7b0e12f586432e3140644b10).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val tokens: RDD[(TopicModeling.WordId, TopicModeling.DocId)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57200134
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20979/consoleFull) for   PR 2388 at commit [`13d2996`](https://github.com/apache/spark/commit/13d29968b6f732ba25aa5c2c5fdde8cf5eda86f1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56271458
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20611/consoleFull) for   PR 2388 at commit [`14903b1`](https://github.com/apache/spark/commit/14903b1b905026e29a58d13dd0da8a1dc4fb6d25).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by huifeidemaer <gi...@git.apache.org>.
Github user huifeidemaer commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18195829
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/TopicModeling.scala ---
    @@ -0,0 +1,818 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import java.util.Random
    +
    +import breeze.collection.mutable.SparseArray
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, sum => bsum}
    +import com.esotericsoftware.kryo.{Kryo, KryoException}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.Logging
    +import org.apache.spark.graphx._
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.rdd.RDD
    +
    +object TopicModeling {
    +
    +  type DocId = VertexId
    +  type WordId = VertexId
    +  type Count = Int
    +  type VD = (BV[Count], Option[(BV[Double], BV[Double])])
    +  type ED = Array[Count]
    +
    +  def train(docs: RDD[(DocId, SSV)],
    +    numTopics: Int = 2048,
    +    totalIter: Int = 150,
    +    burnInIter: Int = 135,
    +    alpha: Double = 0.1,
    +    beta: Double = 0.01): TopicModel = {
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta)
    +    val numTerms = topicModeling.numTerms
    +    val topicModel = TopicModel(numTopics, numTerms, alpha, beta)
    +    topicModeling.runGibbsSampling(topicModel, totalIter, burnInIter)
    +    topicModel
    +  }
    +
    +  private[mllib] def merge(a: BV[Count], b: BV[Count]): BV[Count] = {
    +    assert(a.size == b.size)
    +    a :+ b
    +  }
    +
    +  private[mllib] def update(a: BV[Count], t: Int, inc: Int): BV[Count] = {
    +    a(t) += inc
    +    a
    +  }
    +
    +  private[mllib] def zeros(numTopics: Int, isDense: Boolean = false): BV[Count] = {
    +    if (isDense) {
    +      BDV.zeros(numTopics)
    +    }
    +    else {
    +      BSV.zeros(numTopics)
    +    }
    +  }
    +
    +  private[mllib] def collectTermTopicDist(graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): Graph[VD, ED] = {
    +    graph.mapVertices[VD]((vertexId, counter) => {
    +      if (vertexId >= 0) {
    +        val termTopicCounter = counter._1
    +        val w = BSV.zeros[Double](numTopics)
    +        val w1 = BSV.zeros[Double](numTopics)
    +        var wi = 0D
    +
    +        termTopicCounter.activeIterator.foreach { case (i, v) =>
    +          var adjustment = 0D
    +          w(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta))
    +
    +          adjustment = -1D
    +          w1(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta)) - w(i)
    +
    +          wi = w(i) + wi
    +          w(i) = wi
    +        }
    +
    +        w(numTopics - 1) = wi
    +        (termTopicCounter, Some(w, w1))
    +      }
    +      else {
    +        counter
    +      }
    +    })
    +  }
    +
    +  @inline private[mllib] def collectDocTopicDist(
    +    totalTopicCounter: BV[Count],
    +    termTopicCounter: BV[Count],
    +    docTopicCounter: BV[Count],
    +    d: BDV[Double],
    +    d1: BDV[Double],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var di = 0D
    +    docTopicCounter.activeIterator.foreach { case (i, v) =>
    +
    +      var adjustment = 0D
    +      d(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta)
    +
    +      adjustment = -1D
    +      d1(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta) - d(i)
    +
    +      di = d(i) + di
    +      d(i) = di
    +    }
    +
    +    d(numTopics - 1) = di
    +
    +    (d, d1)
    +  }
    +
    +  private[mllib] def collectGlobalTopicDist(totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var i = 0
    +    val t = BDV.zeros[Double](numTopics)
    +    val t1 = BDV.zeros[Double](numTopics)
    +    var ti = 0D
    +
    +    while (i < numTopics) {
    +      var adjustment = 0D
    +      t(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta))
    +
    +      adjustment = -1D
    +      t1(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta)) - t(i)
    +
    +      ti = t(i) + ti
    +      t(i) = ti
    +
    +      i += 1
    +    }
    +    (t, t1)
    +  }
    +
    +  private[mllib] def sampleTopics(
    +    graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    innerIter: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double
    +  ): Graph[VD, ED] = {
    +    val parts = graph.edges.partitions.size
    +    val (t, t1) = TopicModeling.collectGlobalTopicDist(totalTopicCounter, sumTerms, numTerms,
    +      numTopics, alpha, beta)
    +    val sampleTopics = (gen: java.util.Random, d: BDV[Double], d1: BDV[Double],
    +    triplet: EdgeTriplet[VD, ED]) => {
    +      assert(triplet.srcId >= 0)
    +      val (termCounter, Some((w, w1))) = triplet.srcAttr
    +      val (docTopicCounter, _) = triplet.dstAttr
    +      TopicModeling.collectDocTopicDist(totalTopicCounter, termCounter,
    +        docTopicCounter, d, d1, sumTerms, numTerms, numTopics, alpha, beta)
    +
    +      val topics = triplet.attr
    +      var i = 0
    +      while (i < topics.length) {
    +        val oldTopic = topics(i)
    +        val newTopic = TopicModeling.multinomialDistSampler(gen, d, w, t, d1(oldTopic),
    +          w1(oldTopic), t1(oldTopic), oldTopic)
    +        topics(i) = newTopic
    +        i += 1
    +      }
    +      topics
    +    }
    +
    +    graph.mapTriplets {
    +      (pid, iter) =>
    +        val gen = new java.util.Random(parts * pid + innerIter)
    +        val d = BDV.zeros[Double](numTopics)
    +        val d1 = BDV.zeros[Double](numTopics)
    +        iter.map {
    +          token =>
    +            sampleTopics(gen, d, d1, token)
    +        }
    +    }
    +  }
    +
    +  private[mllib] def updateCounter(graph: Graph[VD, ED], numTopics: Int): Graph[VD, ED] = {
    +    val newCounter = graph.mapReduceTriplets[BV[Int]](e => {
    +      val docId = e.dstId
    +      val wordId = e.srcId
    +      val newTopics = e.attr
    +      val vector = zeros(numTopics)
    +      var i = 0
    +      while (i < newTopics.length) {
    +        val newTopic = newTopics(i)
    +        vector(newTopic) += 1
    +        i += 1
    +      }
    +      Iterator((docId, vector), (wordId, vector))
    +
    +    }, merge)
    +    graph.joinVertices(newCounter)((_, _, n) => (n, None))
    +  }
    +
    +  private[mllib] def collectGlobalCounter(graph: Graph[VD, ED],
    +    numTopics: Int): BV[Count] = {
    +    graph.vertices.filter(t => t._1 >= 0).map(_._2._1)
    +      .aggregate(zeros(numTopics, isDense = true))(merge, merge)
    +  }
    +
    +  /**
    +   * A multinomial distribution sampler, using roulette method to sample an Int back.
    +   */
    +  @inline private[mllib] def multinomialDistSampler(rand: Random, d: BV[Double], w: BV[Double],
    +    t: BV[Double], d1: Double, w1: Double, t1: Double, currentTopic: Int): Int = {
    +    /**
    +     * Asymmetric Dirichlet Priors you can refer to the paper:
    +     * "Rethinking LDA: Why Priors Matter", available at
    +     * [[http://people.ee.duke.edu/~lcarin/Eric3.5.2010.pdf]]
    +     *
    +     * var topicThisTerm = BDV.zeros[Double](numTopics)
    +     * while (i < numTopics) {
    +     * val adjustment = if (i == currentTopic) -1 else 0
    --- End diff --
    
    adjustment indicates that if this is the current topic, you need to reduce 1 from the corresponding terms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56503579
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20699/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57616946
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21199/consoleFull) for   PR 2388 at commit [`dbac77e`](https://github.com/apache/spark/commit/dbac77ed4c52173a57c7d5aa9d73d4f1489e7f9b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-55531748
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20308/consoleFull) for   PR 2388 at commit [`9860fd1`](https://github.com/apache/spark/commit/9860fd1f8dc969f905f1b3d1509214a817789a86).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58331009
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21455/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58736908
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21614/Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58862844
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21683/consoleFull) for   PR 2388 at commit [`1e2485c`](https://github.com/apache/spark/commit/1e2485c05c77dbca4332b9af616c27c45f2f5e32).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `
      * `class StreamingContext(object):`
      * `class DStream(object):`
      * `class TransformedDStream(DStream):`
      * `class TransformFunction(object):`
      * `class TransformFunctionSerializer(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58746407
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21642/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58735730
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57433219
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21100/consoleFull) for   PR 2388 at commit [`e00f5a6`](https://github.com/apache/spark/commit/e00f5a6352a2bbc7d8c5cdd87079f1d55a0b910a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57260729
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21015/consoleFull) for   PR 2388 at commit [`298c720`](https://github.com/apache/spark/commit/298c7207119002552ec929e3a4d8a32c747ab07e).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57491687
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21127/consoleFull) for   PR 2388 at commit [`accf8bd`](https://github.com/apache/spark/commit/accf8bdff0fd65526c436c22d4e69fd3c8927c89).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by huifeidemaer <gi...@git.apache.org>.
Github user huifeidemaer commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18195761
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/TopicModeling.scala ---
    @@ -0,0 +1,818 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import java.util.Random
    +
    +import breeze.collection.mutable.SparseArray
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, sum => bsum}
    +import com.esotericsoftware.kryo.{Kryo, KryoException}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.Logging
    +import org.apache.spark.graphx._
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.rdd.RDD
    +
    +object TopicModeling {
    +
    +  type DocId = VertexId
    +  type WordId = VertexId
    +  type Count = Int
    +  type VD = (BV[Count], Option[(BV[Double], BV[Double])])
    +  type ED = Array[Count]
    +
    +  def train(docs: RDD[(DocId, SSV)],
    +    numTopics: Int = 2048,
    +    totalIter: Int = 150,
    +    burnInIter: Int = 135,
    +    alpha: Double = 0.1,
    +    beta: Double = 0.01): TopicModel = {
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta)
    +    val numTerms = topicModeling.numTerms
    +    val topicModel = TopicModel(numTopics, numTerms, alpha, beta)
    +    topicModeling.runGibbsSampling(topicModel, totalIter, burnInIter)
    +    topicModel
    +  }
    +
    +  private[mllib] def merge(a: BV[Count], b: BV[Count]): BV[Count] = {
    +    assert(a.size == b.size)
    +    a :+ b
    +  }
    +
    +  private[mllib] def update(a: BV[Count], t: Int, inc: Int): BV[Count] = {
    +    a(t) += inc
    +    a
    +  }
    +
    +  private[mllib] def zeros(numTopics: Int, isDense: Boolean = false): BV[Count] = {
    +    if (isDense) {
    +      BDV.zeros(numTopics)
    +    }
    +    else {
    +      BSV.zeros(numTopics)
    +    }
    +  }
    +
    +  private[mllib] def collectTermTopicDist(graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): Graph[VD, ED] = {
    +    graph.mapVertices[VD]((vertexId, counter) => {
    +      if (vertexId >= 0) {
    +        val termTopicCounter = counter._1
    +        val w = BSV.zeros[Double](numTopics)
    +        val w1 = BSV.zeros[Double](numTopics)
    +        var wi = 0D
    +
    +        termTopicCounter.activeIterator.foreach { case (i, v) =>
    +          var adjustment = 0D
    +          w(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta))
    +
    +          adjustment = -1D
    +          w1(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta)) - w(i)
    +
    +          wi = w(i) + wi
    +          w(i) = wi
    +        }
    +
    +        w(numTopics - 1) = wi
    +        (termTopicCounter, Some(w, w1))
    +      }
    +      else {
    +        counter
    +      }
    +    })
    +  }
    +
    +  @inline private[mllib] def collectDocTopicDist(
    +    totalTopicCounter: BV[Count],
    +    termTopicCounter: BV[Count],
    +    docTopicCounter: BV[Count],
    +    d: BDV[Double],
    +    d1: BDV[Double],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var di = 0D
    +    docTopicCounter.activeIterator.foreach { case (i, v) =>
    +
    +      var adjustment = 0D
    +      d(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta)
    +
    +      adjustment = -1D
    +      d1(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta) - d(i)
    +
    +      di = d(i) + di
    +      d(i) = di
    +    }
    +
    +    d(numTopics - 1) = di
    +
    +    (d, d1)
    +  }
    +
    +  private[mllib] def collectGlobalTopicDist(totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var i = 0
    +    val t = BDV.zeros[Double](numTopics)
    +    val t1 = BDV.zeros[Double](numTopics)
    +    var ti = 0D
    +
    +    while (i < numTopics) {
    +      var adjustment = 0D
    +      t(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta))
    +
    +      adjustment = -1D
    +      t1(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta)) - t(i)
    +
    +      ti = t(i) + ti
    +      t(i) = ti
    +
    +      i += 1
    +    }
    +    (t, t1)
    +  }
    +
    +  private[mllib] def sampleTopics(
    +    graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    innerIter: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double
    +  ): Graph[VD, ED] = {
    +    val parts = graph.edges.partitions.size
    +    val (t, t1) = TopicModeling.collectGlobalTopicDist(totalTopicCounter, sumTerms, numTerms,
    +      numTopics, alpha, beta)
    +    val sampleTopics = (gen: java.util.Random, d: BDV[Double], d1: BDV[Double],
    +    triplet: EdgeTriplet[VD, ED]) => {
    +      assert(triplet.srcId >= 0)
    +      val (termCounter, Some((w, w1))) = triplet.srcAttr
    +      val (docTopicCounter, _) = triplet.dstAttr
    +      TopicModeling.collectDocTopicDist(totalTopicCounter, termCounter,
    +        docTopicCounter, d, d1, sumTerms, numTerms, numTopics, alpha, beta)
    +
    +      val topics = triplet.attr
    +      var i = 0
    +      while (i < topics.length) {
    +        val oldTopic = topics(i)
    +        val newTopic = TopicModeling.multinomialDistSampler(gen, d, w, t, d1(oldTopic),
    +          w1(oldTopic), t1(oldTopic), oldTopic)
    +        topics(i) = newTopic
    +        i += 1
    +      }
    +      topics
    +    }
    +
    +    graph.mapTriplets {
    +      (pid, iter) =>
    +        val gen = new java.util.Random(parts * pid + innerIter)
    +        val d = BDV.zeros[Double](numTopics)
    +        val d1 = BDV.zeros[Double](numTopics)
    +        iter.map {
    +          token =>
    +            sampleTopics(gen, d, d1, token)
    +        }
    +    }
    +  }
    +
    +  private[mllib] def updateCounter(graph: Graph[VD, ED], numTopics: Int): Graph[VD, ED] = {
    +    val newCounter = graph.mapReduceTriplets[BV[Int]](e => {
    +      val docId = e.dstId
    +      val wordId = e.srcId
    +      val newTopics = e.attr
    +      val vector = zeros(numTopics)
    +      var i = 0
    +      while (i < newTopics.length) {
    +        val newTopic = newTopics(i)
    +        vector(newTopic) += 1
    +        i += 1
    +      }
    +      Iterator((docId, vector), (wordId, vector))
    +
    +    }, merge)
    +    graph.joinVertices(newCounter)((_, _, n) => (n, None))
    +  }
    +
    +  private[mllib] def collectGlobalCounter(graph: Graph[VD, ED],
    +    numTopics: Int): BV[Count] = {
    +    graph.vertices.filter(t => t._1 >= 0).map(_._2._1)
    +      .aggregate(zeros(numTopics, isDense = true))(merge, merge)
    +  }
    +
    +  /**
    +   * A multinomial distribution sampler, using roulette method to sample an Int back.
    +   */
    +  @inline private[mllib] def multinomialDistSampler(rand: Random, d: BV[Double], w: BV[Double],
    +    t: BV[Double], d1: Double, w1: Double, t1: Double, currentTopic: Int): Int = {
    +    /**
    +     * Asymmetric Dirichlet Priors you can refer to the paper:
    +     * "Rethinking LDA: Why Priors Matter", available at
    +     * [[http://people.ee.duke.edu/~lcarin/Eric3.5.2010.pdf]]
    +     *
    --- End diff --
    
    adjustment indicates that if this is the current topic, you need to reduce 1 from the corresponding terms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56475771
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20686/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57433326
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21100/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58856594
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21683/consoleFull) for   PR 2388 at commit [`1e2485c`](https://github.com/apache/spark/commit/1e2485c05c77dbca4332b9af616c27c45f2f5e32).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57433323
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21100/consoleFull) for   PR 2388 at commit [`e00f5a6`](https://github.com/apache/spark/commit/e00f5a6352a2bbc7d8c5cdd87079f1d55a0b910a).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57590495
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21180/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57626073
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21199/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57500247
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21127/consoleFull) for   PR 2388 at commit [`accf8bd`](https://github.com/apache/spark/commit/accf8bdff0fd65526c436c22d4e69fd3c8927c89).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58744928
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21642/consoleFull) for   PR 2388 at commit [`b0734b8`](https://github.com/apache/spark/commit/b0734b86ab95774aec79af55d9de48b363fe243b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on GraphX

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-62077644
  
      [Test build #512 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/512/consoleFull) for   PR 2388 at commit [`fe40445`](https://github.com/apache/spark/commit/fe404451e1d7cf15a336160732a5d2a94c99a877).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org