You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by EntilZha <gi...@git.apache.org> on 2015/02/27 07:00:14 UTC

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

GitHub user EntilZha opened a pull request:

    https://github.com/apache/spark/pull/4807

    [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs)

    JIRA: https://issues.apache.org/jira/browse/SPARK-5556
    
    As discussed in that issue, it would be great to have multiple LDA algorithm options, principally EM (implemented already in #4047) and Gibbs.
    
    ## Goals of PR:
    1. Refactor LDA to allow multiple algorithm options (done)
    2. Refactor Gibbs code here to this interface (mostly done): https://github.com/EntilZha/spark/tree/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling
    3. Run the same performance tests run for the EM PR for comparison (todo, initial smaller tests have been run)
    
    At the moment, I am looking for feedback on the refactoring while working on putting the Gibbs code in.
    
    ## Summary of Changes:
    These traits were created with the purpose of encapsulating everything about implementation, while interfacing with the entry point ```LDA.run``` and ```DistributedLDAModel```.
    ```scala
    private[clustering] trait LearningState {
        def next(): LearningState
        def topicsMatrix: Matrix
        def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
        def logLikelihood: Double
        def logPrior: Double
        def topicDistributions: RDD[(Long, Vector)]
        def globalTopicTotals: LDA.TopicCounts
        def k: Int
        def vocabSize: Int
        def docConcentration: Double
        def topicConcentration: Double
        def deleteAllCheckpoints(): Unit
      }
    
      trait LearningStateInitializer {
        def initialState(
          docs: RDD[(Long, Vector)],
          k: Int,
          docConcentration: Double,
          topicConcentration: Double,
          randomSeed: Long,
          checkpointInterval: Int): LearningState
      }
    ```
    
    The entirety of an LDA implementation can be captured by an object and class which extend these traits. Specifically, the ```LearningStateInitializer``` provides the method for returning the ```LearningState``` which maintains state.
    
    Lastly, the algorithm can be set via an enum which is pattern matched to create the correct thing. My thought is the default algorithm should be whichever performs better.
    
    ## Gibbs Implementation
    Old design doc is here:
    Primary Gibbs algorithm from here (mostly notation/math, GraphX based, not table based): http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
    Implements FastLDA from here: http://www.ics.uci.edu/~newman/pubs/fastlda.pdf
    
    ### Specific Points for Feedback
    1. Naming, its hard, and I'me not sure if the traits are named appropriately
    2. Similarly, I am reasonably familiar with the Scala type system, but perhaps there is some ninja tricks I don't know that would be helpful
    3. General interface/cleanliness
    4. Should the LearningStates/etc go within LDA, I think so, thoughts?
    5. Anything else, I'me also learning here.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/EntilZha/spark LDA-pull-request

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4807.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4807
    
----
commit 4bd110385f36927990a980085d727f12fc90fd53
Author: Pedro Rodriguez <sk...@gmail.com>
Date:   2015-02-26T07:39:47Z

    updates to LDA and LDAModel to separate implementation from interface for LDA algorithms

commit 34d58532b278f95aaab9c250f4a2ed64ea959811
Author: Pedro Rodriguez <sk...@gmail.com>
Date:   2015-02-27T05:35:51Z

    refactored tests to make tests succeed

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-93856990
  
    @EntilZha  @hhbyyh  Ping.  Please let me know if there are items I can help with!
    
    Also, I thought of one more issue, which is that we'll have to make sure that the trait Optimizer API is Java-friendly.  I think it can be, but we'll have to verify.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76346072
  
    add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26186996
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    --- End diff --
    
    This is not necessary, right? Should be removed from the `LearningState` class.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76346401
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28051/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26189615
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    --- End diff --
    
    How about  ```def topicsMatrix: Matrix``` =>  ``` def  termTopicDistributions: RDD[(Long, Vector)]```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26187620
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    --- End diff --
    
    Why is it not necessary? The LDASuite which contains the Distributed/LocalModels calls it. How they are created, is up to the specific implementation of LDA. Could you be more specific why its not necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-78486179
  
     @jkbradley, @mengxr    Whether have time to look at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-97214197
  
    Commenting here and then on ticket. If there is interest in using the Gibbs implementation I wrote for next release using the interface/Refactor from that PR I am open to that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26188892
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    --- End diff --
    
    The reason for a separate method is twofold. First, although you could calculate it from `topicsMatrix` in theory, the size of `topicsMatrix` could be very large (too large to fit in the driver memory, as the docs warn). The describeTopics is intended to provide an interface for the implementation to extract a topics matrix bounded to only the top `maxTermsPerTopic` topics. It is less likely this runs the driver out of memory and keeps computation of the top `n` topics distributed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4807


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-96876898
  
    @EntilZha  I think this PR doesn't need to be updated now that [https://github.com/apache/spark/pull/5661] has been merged (for JIRA [https://issues.apache.org/jira/browse/SPARK-7090]).  Thank you though for this initial PR and discussion!  Could you please close this PR?  It will still be great if we can get Gibbs sampling in for the next release cycle


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-97232365
  
    Definitely interest; let's coordinate on the JIRA and a new PR, especially with @witgo 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76346087
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-89117599
  
    @jkbradley Thanks for the proposal and it looks reasonable.
    
    Just two minor things not clear, 
    1. Should different algorithms have different entrance in LDA, like runGibbs, runOnline, runEM? I kinda like it.
    2. Online LDA have several specific arguments. What's the recommended place to put them and their getter/setter, in LDA or optimizer ?
    
    I'm good with other parts. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-88653790
  
    I apologize for the lack of response!  I'm going to try to make a pass ASAP


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76354546
  
      [Test build #28052 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28052/consoleFull) for   PR 4807 at commit [`e895f38`](https://github.com/apache/spark/commit/e895f38bf0a6c40265a169f0e7a7b5a0454bcbaf).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26189764
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    --- End diff --
    
    Probably @jkbradley can weigh in here. I think both changes seem reasonable, then have the Matrix computed from the RDD. If there is agreement, i can make the change on the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-78090060
  
    As is, it can be merged (as far as work on refactoring goes). I had
    actually considered having separate PRs for the Refactor and Gibbs anyway.
    On Mar 10, 2015 9:22 AM, "Guoqiang Li" <no...@github.com> wrote:
    
    > @EntilZha <https://github.com/EntilZha> @mengxr
    > <https://github.com/mengxr> This branch can be merged into master?
    > I want merge the PR to LightLDA
    > <https://github.com/witgo/spark/tree/LightLDA> and lda_Gibbs
    > <https://github.com/witgo/spark/tree/lda_Gibbs>.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/4807#issuecomment-78089533>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76354552
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28052/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-94041899
  
    @hhbyyh  I agree with points 1-4.
    
    One clarification:
    {quote}
    A few questions:
    Based on previous discussion, users can specify algorithms by
    
    1. passing String parameter to LDA.run. (hard to specify parameter)
    2. or through setOptimizer of LDA. (a little tricky to set default Optimizer)
    {quote}
    Not quite.  This is what I had in mind for LDA optimizer parameter setting:
    ```
    def setOptimizer(value: Optimizer): LDA = ???  // set via Optimizer, constructed beforehand
    def setOptimizer(value: String): LDA = ???  // set with optimizer name ("em"), using default optimizer parameters
    ```
    I did not intend for us to pass extra parameters to the run() method.
    
    I'll respond to the Online LDA-specific items in the other PR: [https://github.com/apache/spark/pull/4419]



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26188380
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    --- End diff --
    
    Different implementations can be hidden by `topicsMatrix `, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-93962573
  
    Hi @jkbradley .
    I've finished the correctness test against Blei's implementation, more details are in  [#4419](https://github.com/apache/spark/pull/4419).
    
    I tried to refactor LDA as proposed and find it will be a big change (ut and examples included).
    First let's confirm on the change:
    1. All Optimizers will be public, as users would be able to specify parameters.
    2. return value for `LDA.run` will become LDAModel (now it's DistributedLDAModel)
    3. users can first create optimizer and then pass it to LDA.setOptimizer
    4. All the parameters specific to one algorithm should go into optimizer.
    
    A few questions:
    Based on previous discussion, users can specify algorithms by 
      1. passing String parameter to `LDA.run`. (hard to specify parameter) 
      2. or through setOptimizer of LDA. 
    I think it's better to leave only one place for determining which algorithm to use, since otherwise it can be confusing and conflicted. 
    
    Another question is about existing parameters in LDA:
    Except K, all other parameters (Alpha, Beta, Maxiteration, seed, checkPointInterval) is useless or have different default values for Online LDA. And for now I didn't use them in OnlineLDA.
    
    Actually I find LDA and OnlineLDA share quite few things and it's kind of difficult to merge them together. Maybe for OnlineLDA, I can only merge an optimizer and an example code. (Later I'll probably provide a interface for stream).
    
    
    
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-78189159
  
    @EntilZha  thx.
    @mengxr what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76346339
  
      [Test build #28051 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28051/consoleFull) for   PR 4807 at commit [`34d5853`](https://github.com/apache/spark/commit/34d58532b278f95aaab9c250f4a2ed64ea959811).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26187624
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    +    def logLikelihood: Double
    +    def logPrior: Double
    +    def topicDistributions: RDD[(Long, Vector)]
    +    def globalTopicTotals: LDA.TopicCounts
    --- End diff --
    
    Ditto comment from above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-89189465
  
    Just double checking, your suggestion would be to rebase from master, implement those general changes, then commit/push the modified branch?
    
    Primary reason I did it this way was to refactor/abstract along the method boundaries that exist right now, but as you noted it does mix model/algorithm. I like your approach on extending the abstract class with trait. I haven't taken much time to work on it, but could do that over the next couple days. I also plan on being at Databricks on Wednesday for the training if you want to chat then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26188379
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    +    def logLikelihood: Double
    +    def logPrior: Double
    +    def topicDistributions: RDD[(Long, Vector)]
    +    def globalTopicTotals: LDA.TopicCounts
    --- End diff --
    
    Different implementations can be hidden by `topicDistributions `.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-89964339
  
    Thanks for the reply. And the ideas looks good to me. I'll go ahead with the correctness verification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-88657523
  
    Sounds good. I think its reasonable that this PR only includes refactoring, not Gibbs. Then evaluate LightLDA vs FastLDA and choose which one makes sense. If changes look good, I will go ahead and make the changes proposed by @witgo 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-88991763
  
    @EntilZha I made a quick pass to get a sense of the structure.  My main comment is that these changes seem to mix the concepts of algorithm and model together.  I think that's why @witgo was confused about putting describeTopics within the learning state; that confused me a lot too.  You are right that we need some changes to the code to support other algorithms, but I'd like to keep the separation between algorithms and models.  Here's what I'd propose:
    
    Models:
    * abstract class LDAModel
    * class LocalLDAModel extends LDAModel
      * *This should be the de facto local LDAModel.  If algorithms store extra info with a local model, they can extend this class.*
    * trait WithTrainingResults
      * *This trait indicates that a model stores results for the whole training corpus.  This is important to store of course since recomputing it will not necessarily get the same results.  But it will also be useful to allow users to thrown out this extra info if they don't want to store it.*
    * class EMDistributedLDAModel extends LDAModel with WithTrainingResults
      * This (and other extensions of LDAModel) can be converted to a LocalLDAModel.
    * class GibbsDistributedLDAModel extends LDAModel with WithTrainingResults
    * *Notes*
      * I'm not listing a DistributedLDAModel trait since I don't think it would share anything across different algorithms' models.
    
    For algorithms, I like keeping everything in class LDA.  That can make calls out to Optimizers (1 for each algorithm), and the Optimizers can return an instance of LDAModel.  I'd eliminate LearningState and instead put everything in Optimizer, including initialization.  I don't see a need for multiple classes since they belong under the same concept.
    
    I'm sorry I took so long to respond, but I'll keep up-to-date from here out.  I hadn't realized quite so much was blocking on this PR.
    
    Also, I know I'm suggesting significant changes to the PR, but it should actually require fewer changes to master and still allow multiple algorithms.
    
    CC: @hhbyyh since you have investment in this via [https://github.com/apache/spark/pull/4419].  I believe OnlineLDA could fit under these abstractions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by EntilZha <gi...@git.apache.org>.
Github user EntilZha commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26189097
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    +    def logLikelihood: Double
    +    def logPrior: Double
    +    def topicDistributions: RDD[(Long, Vector)]
    +    def globalTopicTotals: LDA.TopicCounts
    --- End diff --
    
    I think I agree here. It would require some code/helper method on DistributedLDAModel since that is the only place that calls `globalTopicTotals`. I can make the change tonight/tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-78089533
  
    @EntilZha @mengxr  This branch can be merged into master?
    I want merge the PR to [LightLDA](https://github.com/witgo/spark/tree/LightLDA) and [lda_Gibbs](https://github.com/witgo/spark/tree/lda_Gibbs).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76348368
  
      [Test build #28052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28052/consoleFull) for   PR 4807 at commit [`e895f38`](https://github.com/apache/spark/commit/e895f38bf0a6c40265a169f0e7a7b5a0454bcbaf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76346397
  
      [Test build #28051 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28051/consoleFull) for   PR 4807 at commit [`34d5853`](https://github.com/apache/spark/commit/34d58532b278f95aaab9c250f4a2ed64ea959811).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  trait LearningStateInitializer `
      * `    class EMLearningState(optimizer: EMOptimizer) extends LearningState `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-76343507
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-89374472
  
    Here's a proposal.  Let me know what you think!
    
    @hhbyyh 
    
    > 1. Should different algorithms have different entrance in LDA, like runGibbs, runOnline, runEM? I kinda like it as the separation looks simple and clear.
    
    Multiple run methods do make that separation clearer, but they also force beginner users (who don't know what these algorithms are) to choose an algorithm before they can try LDA.  I'd prefer to keep a single run() method and specify the algorithm as a String parameter.
    
    One con of a single run() method is that users will get back an LDAModel which they will need to cast to a more specific type (if they want to use the specialization's extra functionality).  I think we could eliminate this issue later on by opening up each algorithm as its own Estimator (so that LDA would become a meta-Estimator, if you will).
    
    > 2. Online LDA have several specific arguments. What's the recommended place to put them and their getter/setter, in LDA or optimizer ?
    
    That is an issue, for sure.  I'd propose:
    ```
    trait Optimizer // no public API
    class EMOptimizer extends Optimizer {
      // public API: getters/setters for EM-specific parameters
      // private[mllib] API: methods for learning
    }
    class LDA {
      def setOptimizer(optimizer: String) // takes "EM" / "Gibbs" / "online"
      def setOptimizer(optimizer: Optimizer) // takes Optimizer instance which user can configure beforehand
      def getOptimizer: Optimizer
    }
    ```
    For users, Optimizer classes simply store algorithm-specific parameters.  Users can use the default Optimizer, or they can specify the optimizer via String (with default algorithm parameters) or via Optimizer (with configured algorithm parameters).
    
    @EntilZha It might be easiest to revert to master (to make diffs easier), but you can decide.  That would be great if you have time to work on it in the next couple of days, thanks.  I'll be out of town (but online) Wednesday unfortunately, but I hope it goes well!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4807#discussion_r26187041
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala ---
    @@ -311,165 +319,319 @@ private[clustering] object LDA {
     
       private[clustering] type TokenCount = Double
     
    -  /** Term vertex IDs are {-1, -2, ..., -vocabSize} */
    -  private[clustering] def term2index(term: Int): Long = -(1 + term.toLong)
     
    -  private[clustering] def index2term(termIndex: Long): Int = -(1 + termIndex).toInt
     
    -  private[clustering] def isDocumentVertex(v: (VertexId, _)): Boolean = v._1 >= 0
    +  object LearningAlgorithms extends Enumeration {
    +    type Algorithm = Value
    +    val Gibbs, EM = Value
    +  }
     
    -  private[clustering] def isTermVertex(v: (VertexId, _)): Boolean = v._1 < 0
    +  private[clustering] trait LearningState {
    +    def next(): LearningState
    +    def topicsMatrix: Matrix
    +    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    +    def logLikelihood: Double
    +    def logPrior: Double
    +    def topicDistributions: RDD[(Long, Vector)]
    +    def globalTopicTotals: LDA.TopicCounts
    --- End diff --
    
    Here is not necessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org