You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by witgo <gi...@git.apache.org> on 2014/09/14 18:57:06 UTC

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB]LDA based on Graphx

GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/2388

    [WIP][SPARK-1405][MLLIB]LDA based on Graphx

    cc @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark graphx_lda

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2388.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2388
    
----
commit 9860fd1f8dc969f905f1b3d1509214a817789a86
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-09-14T16:55:15Z

    LDA based on Graphx

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by witgo <gi...@git.apache.org>.

GitHub user witgo reopened a pull request:

    https://github.com/apache/spark/pull/2388

    [WIP][SPARK-1405][MLLIB] topic modeling on Graphx

    This PR relies on  #2631
    
    - [X] Topic de-duplication
    - [X] Support  100000 topics
    - [X] Asymmetric Dirichlet priors
    - [ ] Add the documentation
    - [X] Add infer interface
    - [X] Add unit tests
    - [X] Add the performance test 
    - [X] Optimizing the infer interface performance
    - [ ] Verifying the correctness of the algorithm
    
    
    The performance test:
    
    `2000` topics:
    
    Item | value
    ------------ | -------------
    The cluster resource | 36 executors(36 cores, 216g memory)
    The corpus size | 253064 document, 29696335 words
    The number of iterations | `105`
    The number of distinct term |  75496
    The number of topics |  `2000`
    alpha | 0.01
    beta | 0.01
    The running time |  37.1 minutes
    
    `10000` topics:
    
    Item | value
    ------------ | -------------
    The cluster resource | 36 executors(36 cores, 216g memory)
    The corpus size | 253064 document, 29696335 words
    The number of iterations | `105`
    The number of distinct term |  75496
    The number of topics |  `10000`
    alpha | 0.01
    beta | 0.01
    The running time |  49 minutes
    
    
    `100000` topics:
    
    Item | value
    ------------ | -------------
    The cluster resource | 36 executors(36 cores, 216g memory)
    The corpus size | 253064 document, 29696335 words
    The number of iterations | `105`
    The number of distinct term |  75496
    The number of topics |  `100000`
    alpha | 0.1
    beta | 0.01
    The running time |  268.9 minutes
    
    conf/spark-defaults.conf:
    ```
    spark.akka.frameSize   20
    spark.executor.instances 36
    spark.rdd.compress true
    spark.executor.memory   6g
    spark.default.parallelism  72
    spark.broadcast.blockSize  8192
    spark.storage.memoryFraction 0.4
    spark.serializer org.apache.spark.serializer.KryoSerializer
    spark.kryo.registrator org.apache.spark.mllib.feature.TopicModelingKryoRegistrator
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark graphx_lda

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2388.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2388
    
----
commit ca8e6f296a2f7ed674dd3a5cde49d4301d3d6d14
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-10-08T08:10:12Z

    topic modeling on Graphx

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

Posted by witgo <gi...@git.apache.org>.

Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-72069846
  
    Here is a sample faster branch(work in progress): 
    https://github.com/witgo/spark/tree/lda_MH


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58330999
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21455/consoleFull) for   PR 2388 at commit [`ca8e6f2`](https://github.com/apache/spark/commit/ca8e6f296a2f7ed674dd3a5cde49d4301d3d6d14).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class Params(inputFile: String = null, threshold: Double = 0.1)`
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`
      * `  class SparkIMain(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56302557
  
    **[Tests timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20620/consoleFull)** after     a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57588021
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21174/consoleFull) for   PR 2388 at commit [`99945ce`](https://github.com/apache/spark/commit/99945ce52e7559728191226fbc21a2a592591ceb).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57588026
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21174/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57440229
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21101/consoleFull) for   PR 2388 at commit [`84f51e3`](https://github.com/apache/spark/commit/84f51e3857f6ffae0584100f53ac7e68767ba060).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by huifeidemaer <gi...@git.apache.org>.

Github user huifeidemaer commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2388#discussion_r18195844
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/TopicModeling.scala ---
    @@ -0,0 +1,818 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.clustering
    +
    +import java.util.Random
    +
    +import breeze.collection.mutable.SparseArray
    +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV, sum => bsum}
    +import com.esotericsoftware.kryo.{Kryo, KryoException}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.Logging
    +import org.apache.spark.graphx._
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.serializer.KryoRegistrator
    +import org.apache.spark.storage.StorageLevel
    +import org.apache.spark.mllib.linalg.{DenseVector => SDV, SparseVector => SSV, Vector => SV}
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.rdd.RDD
    +
    +object TopicModeling {
    +
    +  type DocId = VertexId
    +  type WordId = VertexId
    +  type Count = Int
    +  type VD = (BV[Count], Option[(BV[Double], BV[Double])])
    +  type ED = Array[Count]
    +
    +  def train(docs: RDD[(DocId, SSV)],
    +    numTopics: Int = 2048,
    +    totalIter: Int = 150,
    +    burnInIter: Int = 135,
    +    alpha: Double = 0.1,
    +    beta: Double = 0.01): TopicModel = {
    +    val topicModeling = new TopicModeling(docs, numTopics, alpha, beta)
    +    val numTerms = topicModeling.numTerms
    +    val topicModel = TopicModel(numTopics, numTerms, alpha, beta)
    +    topicModeling.runGibbsSampling(topicModel, totalIter, burnInIter)
    +    topicModel
    +  }
    +
    +  private[mllib] def merge(a: BV[Count], b: BV[Count]): BV[Count] = {
    +    assert(a.size == b.size)
    +    a :+ b
    +  }
    +
    +  private[mllib] def update(a: BV[Count], t: Int, inc: Int): BV[Count] = {
    +    a(t) += inc
    +    a
    +  }
    +
    +  private[mllib] def zeros(numTopics: Int, isDense: Boolean = false): BV[Count] = {
    +    if (isDense) {
    +      BDV.zeros(numTopics)
    +    }
    +    else {
    +      BSV.zeros(numTopics)
    +    }
    +  }
    +
    +  private[mllib] def collectTermTopicDist(graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): Graph[VD, ED] = {
    +    graph.mapVertices[VD]((vertexId, counter) => {
    +      if (vertexId >= 0) {
    +        val termTopicCounter = counter._1
    +        val w = BSV.zeros[Double](numTopics)
    +        val w1 = BSV.zeros[Double](numTopics)
    +        var wi = 0D
    +
    +        termTopicCounter.activeIterator.foreach { case (i, v) =>
    +          var adjustment = 0D
    +          w(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta))
    +
    +          adjustment = -1D
    +          w1(i) = v * ((totalTopicCounter(i) * (alpha * numTopics)) +
    +            (alpha * numTopics) * (adjustment + alpha) +
    +            adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +            (totalTopicCounter(i) + (numTerms * beta)) - w(i)
    +
    +          wi = w(i) + wi
    +          w(i) = wi
    +        }
    +
    +        w(numTopics - 1) = wi
    +        (termTopicCounter, Some(w, w1))
    +      }
    +      else {
    +        counter
    +      }
    +    })
    +  }
    +
    +  @inline private[mllib] def collectDocTopicDist(
    +    totalTopicCounter: BV[Count],
    +    termTopicCounter: BV[Count],
    +    docTopicCounter: BV[Count],
    +    d: BDV[Double],
    +    d1: BDV[Double],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var di = 0D
    +    docTopicCounter.activeIterator.foreach { case (i, v) =>
    +
    +      var adjustment = 0D
    +      d(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta)
    +
    +      adjustment = -1D
    +      d1(i) = v * (termTopicCounter(i) * (sumTerms - 1 + alpha * numTopics) +
    +        (adjustment + beta) * (sumTerms - 1 + alpha * numTopics)) /
    +        (totalTopicCounter(i) + adjustment + numTerms * beta) - d(i)
    +
    +      di = d(i) + di
    +      d(i) = di
    +    }
    +
    +    d(numTopics - 1) = di
    +
    +    (d, d1)
    +  }
    +
    +  private[mllib] def collectGlobalTopicDist(totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double): (BV[Double], BV[Double]) = {
    +    assert(totalTopicCounter.size == numTopics)
    +    var i = 0
    +    val t = BDV.zeros[Double](numTopics)
    +    val t1 = BDV.zeros[Double](numTopics)
    +    var ti = 0D
    +
    +    while (i < numTopics) {
    +      var adjustment = 0D
    +      t(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta))
    +
    +      adjustment = -1D
    +      t1(i) = (adjustment + beta) * (totalTopicCounter(i) * (alpha * numTopics) +
    +        alpha * numTopics * (adjustment + alpha) +
    +        adjustment * (sumTerms - 1 + (alpha * numTopics))) /
    +        (totalTopicCounter(i) + (adjustment + numTerms * beta)) - t(i)
    +
    +      ti = t(i) + ti
    +      t(i) = ti
    +
    +      i += 1
    +    }
    +    (t, t1)
    +  }
    +
    +  private[mllib] def sampleTopics(
    +    graph: Graph[VD, ED],
    +    totalTopicCounter: BV[Count],
    +    sumTerms: Long,
    +    innerIter: Long,
    +    numTerms: Int,
    +    numTopics: Int,
    +    alpha: Double,
    +    beta: Double
    +  ): Graph[VD, ED] = {
    +    val parts = graph.edges.partitions.size
    +    val (t, t1) = TopicModeling.collectGlobalTopicDist(totalTopicCounter, sumTerms, numTerms,
    +      numTopics, alpha, beta)
    +    val sampleTopics = (gen: java.util.Random, d: BDV[Double], d1: BDV[Double],
    +    triplet: EdgeTriplet[VD, ED]) => {
    +      assert(triplet.srcId >= 0)
    +      val (termCounter, Some((w, w1))) = triplet.srcAttr
    +      val (docTopicCounter, _) = triplet.dstAttr
    +      TopicModeling.collectDocTopicDist(totalTopicCounter, termCounter,
    +        docTopicCounter, d, d1, sumTerms, numTerms, numTopics, alpha, beta)
    +
    +      val topics = triplet.attr
    +      var i = 0
    +      while (i < topics.length) {
    +        val oldTopic = topics(i)
    +        val newTopic = TopicModeling.multinomialDistSampler(gen, d, w, t, d1(oldTopic),
    +          w1(oldTopic), t1(oldTopic), oldTopic)
    +        topics(i) = newTopic
    +        i += 1
    +      }
    +      topics
    +    }
    +
    +    graph.mapTriplets {
    +      (pid, iter) =>
    +        val gen = new java.util.Random(parts * pid + innerIter)
    +        val d = BDV.zeros[Double](numTopics)
    +        val d1 = BDV.zeros[Double](numTopics)
    +        iter.map {
    +          token =>
    +            sampleTopics(gen, d, d1, token)
    +        }
    +    }
    +  }
    +
    +  private[mllib] def updateCounter(graph: Graph[VD, ED], numTopics: Int): Graph[VD, ED] = {
    +    val newCounter = graph.mapReduceTriplets[BV[Int]](e => {
    +      val docId = e.dstId
    +      val wordId = e.srcId
    +      val newTopics = e.attr
    +      val vector = zeros(numTopics)
    +      var i = 0
    +      while (i < newTopics.length) {
    +        val newTopic = newTopics(i)
    +        vector(newTopic) += 1
    +        i += 1
    +      }
    +      Iterator((docId, vector), (wordId, vector))
    +
    +    }, merge)
    +    graph.joinVertices(newCounter)((_, _, n) => (n, None))
    +  }
    +
    +  private[mllib] def collectGlobalCounter(graph: Graph[VD, ED],
    +    numTopics: Int): BV[Count] = {
    +    graph.vertices.filter(t => t._1 >= 0).map(_._2._1)
    +      .aggregate(zeros(numTopics, isDense = true))(merge, merge)
    +  }
    +
    +  /**
    +   * A multinomial distribution sampler, using roulette method to sample an Int back.
    +   */
    +  @inline private[mllib] def multinomialDistSampler(rand: Random, d: BV[Double], w: BV[Double],
    +    t: BV[Double], d1: Double, w1: Double, t1: Double, currentTopic: Int): Int = {
    +    /**
    +     * Asymmetric Dirichlet Priors you can refer to the paper:
    +     * "Rethinking LDA: Why Priors Matter", available at
    +     * [[http://people.ee.duke.edu/~lcarin/Eric3.5.2010.pdf]]
    +     *
    +     * var topicThisTerm = BDV.zeros[Double](numTopics)
    +     * while (i < numTopics) {
    --- End diff --
    
    adjustment indicates that if this is the current topic, you need to reduce 1 from the corresponding terms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57260813
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21015/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on GraphX

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-62077399

@witgo Thanks for the PR! This looks like a very featureful implementation, but I think it will require some refactoring to fit in well with future development. I'll give some high-level comments for now, and can perhaps do a lower-level pass later on.

**APIs**

I suspect we'll have other types of topic modeling in the future, not just LDA. It would be great to think ahead for that. The simplest way is probably to rename everything as "LDA", not "topic modeling," and to minimize the public API. (Other topic models we might want later are LSA, PLSA, HDP, CTM, etc.)

This should probably go under "clustering" instead of "feature."

**Code organization**

Some of the code is more general than LDA and could go elsewhere in MLlib. E.g., some of the sampling methods could go in stat/ Also, minMaxIndexSearch, minMaxValueSearch, etc. (or can those be replaced using existing generic methods in Scala or Java?).

**Documentation and code clarity**

The current thing making this hardest to review is the lack of documentation and the difficulty in understanding what each value and method does. For documentation, it will be helpful to see comments for all classes and methods, and also inline comments explaining code where needed. For code clarity, using more descriptive variable and method names will help a lot.

**Other thoughts**

It would be nice to remove some experimental items (such as mergeDuplicateTopic) for now.

Thanks again!

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-56504704
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20700/consoleFull) for   PR 2388 at commit [`dfc83fe`](https://github.com/apache/spark/commit/dfc83feb1546dfc3ed1be615a28ebef60e145cb5).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-58862848
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21683/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-1405][MLLIB] topic modeling on Gra...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2388#issuecomment-57210710
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20979/consoleFull) for   PR 2388 at commit [`13d2996`](https://github.com/apache/spark/commit/13d29968b6f732ba25aa5c2c5fdde8cf5eda86f1).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class TopicModeling(@transient val docs: RDD[(TopicModeling.DocId, SSV)],`
      * `class TopicModelingKryoRegistrator extends KryoRegistrator `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org