You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by akopich <gi...@git.apache.org> on 2017/10/24 13:53:45 UTC

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

GitHub user akopich opened a pull request:

    https://github.com/apache/spark/pull/19565

    [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter out empty documents beforehand

    ## What changes were proposed in this pull request?
    
    The empty documents are filtered out in the `initialize` method and are never included in mini-batches.  `batchSize`, and `nonEmptyDocsN` are now the same thing.
    
    ## How was this patch tested?
    
    Existing unit-tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/akopich/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19565.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19565
    
----
commit 721f235934f26e75172d39f0398365606616267f
Author: Valeriy Avanesov <av...@wias-berlin.de>
Date:   2017-10-24T14:08:39Z

    [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter out empty documents beforehand
    
    ## What changes were proposed in this pull request?
    
    The empty documents are filtered out in the `initialize` method and are never included in mini-batches.  `batchSize`, and `nonEmptyDocsN` are now the same thing.
    
    ## How was this patch tested?
    
    Existing unit-tests.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)** for PR 19565 at commit [`721f235`](https://github.com/apache/spark/commit/721f235934f26e75172d39f0398365606616267f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @hhbyyh, in case of "filter before sample" in a local test the overhead is negligible. 
    
    Regarding "sample before filter", you are right. There (strictly speaking) should be adjustment of `miniBatchFraction`. Which is why I do prefer "filter before sample".
    
    Also note, version "sample before filter" is logically equivalent to the current upstream/master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @WeichenXu123, in a case of large dataset this "adjustment" would have infinitesimal effect. (IMO, no adjustment is needed -- the expected number of non-empty docs in the same and does not depend on the order of filter and sample and equals to `docs.size * miniBatchFraction * fractionOfNonEmptyDocs`). 
    
    So I believe, we all agree that sampling should go before filtering. I'll send a commit shortly. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83040/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146812501
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
    --- End diff --
    
    Actually, the block from `mapPartition` will be simplified, since it will no longer need to process collections of documents. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83015/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146569516
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
     
         expElogbetaBc.destroy(false)
     
    -    if (nonEmptyDocsN == 0) {
    +    if (batchSize == 0) {
           logWarning("No non-empty documents were submitted in the batch.")
           // Therefore, there is no need to update any of the model parameters
           return this
         }
     
         val batchResult = statsSum *:* expElogbeta.t
    -    // Note that this is an optimization to avoid batch.count
    -    val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
         updateLambda(batchResult, batchSize)
     
    -    logphatOption.foreach(_ /= nonEmptyDocsN.toDouble)
    -    logphatOption.foreach(updateAlpha(_, nonEmptyDocsN))
    +    logphatOption.foreach(_ /= batchSize.toDouble)
    +    logphatOption.foreach(updateAlpha(_, batchSize))
     
         this
       }
     
       /**
        * Update lambda based on the batch submitted. batchSize can be different for each iteration.
        */
    -  private def updateLambda(stat: BDM[Double], batchSize: Int): Unit = {
    +  private def updateLambda(stat: BDM[Double], batchSize: Double): Unit = {
    --- End diff --
    
    Don't think this should change here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)** for PR 19565 at commit [`40685ee`](https://github.com/apache/spark/commit/40685ee960ef2c0e26b31ef96d1f3c8c974d3851).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83088 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)** for PR 19565 at commit [`f376a7b`](https://github.com/apache/spark/commit/f376a7b97aefa700868c14da24fb6410358826df).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @hhbyyh who shall we ping? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146572407
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
     
         expElogbetaBc.destroy(false)
     
    -    if (nonEmptyDocsN == 0) {
    +    if (batchSize == 0) {
           logWarning("No non-empty documents were submitted in the batch.")
           // Therefore, there is no need to update any of the model parameters
           return this
         }
     
         val batchResult = statsSum *:* expElogbeta.t
    -    // Note that this is an optimization to avoid batch.count
    -    val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
         updateLambda(batchResult, batchSize)
     
    -    logphatOption.foreach(_ /= nonEmptyDocsN.toDouble)
    -    logphatOption.foreach(updateAlpha(_, nonEmptyDocsN))
    +    logphatOption.foreach(_ /= batchSize.toDouble)
    +    logphatOption.foreach(updateAlpha(_, batchSize))
     
         this
       }
     
       /**
        * Update lambda based on the batch submitted. batchSize can be different for each iteration.
        */
    -  private def updateLambda(stat: BDM[Double], batchSize: Int): Unit = {
    +  private def updateLambda(stat: BDM[Double], batchSize: Double): Unit = {
    --- End diff --
    
    I've done this just in order to achieve consistency with `updateAlpha`. Is it a bad idea?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147229232
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    So technically, you suggest to move the check 
    
    ```
    if (batchSize == 0) {
          logWarning("No non-empty documents were submitted in the batch.")
          // Therefore, there is no need to update any of the model parameters
          return this
        }
    ```
    
    to `next()`? Sounds reasonable. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Consider the following scenario. Let `docs` be an RDD containing 1000 empty documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`.
    
    Assume, we use `filter(...).sample(...)`. Then the resulted RDD will have around `50` elements. 
    
    If we use `sample(...).filter(...)` instead, the `sample` returns around `100` elements. Now the number of elements in the RDD returned by `filter` is normally distributed. The expectation is `50` again though. 
    Do I miss smth?
    
    However, for larger samples this shouldn't make any difference. 
    
    On the purpose of the issue: there were two different variables `batchSize`, and `nonEmptyDocsN` which could not be used interchangeably. The purpose is to submit a batch containing no empty docs which makes the two  variables refer the same value. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @hhbyyh, is there a cluster I can use for this? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r148437931
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    I think isEmpty() is optimized to avoid a full materialization.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    I agree, they're the same. You said at https://github.com/apache/spark/pull/19565#issuecomment-339638791 that they weren't.
    
    But if you're saying the code already filters out empty docs further upstream anyway, then there is no change in logic, just where the filtering happens. Or did I misunderstand that part?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)** for PR 19565 at commit [`8b7f30b`](https://github.com/apache/spark/commit/8b7f30bdb11abcd3efe2204c614695b286c0ae95).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @hhbyyh OK, but it returns almost the same number of elements. Anyway, the variance is going to be much smaller that in the case with sample before filter.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @WeichenXu123 
    @jkbradley said, pings on Git don't work for him...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147021004
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
         submitMiniBatch(batch)
       }
     
       /**
        * Submit a subset (like 1%, decide by the miniBatchFraction) of the corpus to the Online LDA
        * model, and it will update the topic distribution adaptively for the terms appearing in the
        * subset.
    +   * The methods assumes no empty documents are submitted.
    --- End diff --
    
    Maybe add a require


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r148507781
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +481,46 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    -      .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
    -        elementWiseSum, elementWiseSum
    -      )
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) =
    +      batch.treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))({
    +        case (acc, (_, termCounts)) =>
    +          val stat = BDM.zeros[Double](k, vocabSize)
    --- End diff --
    
    Actually, we can fix this w/o falling back to `mapPartition`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    I wonder if we should add cache() for lda training data, even not for this feature. 
    
    @srowen Not sure where we're on caching the training data or not for different algorithms. Appreciate your advice.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146810442
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
    --- End diff --
    
    hmm... I think it only need to move the code block in `mapPartition` into the aggregate `seqOp` function. Will it make code more complicated ?
    I think it can be fix in this PR, it is only a simple modification and do not affect the logic.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Regarding caching: I think that can be ignored for purposes of this change. All this does is add a filter, and it doesn't cause an RDD to computed more than it was before.
    
    The only question is whether the filtering is worth it; does it filter out enough that it makes later work faster?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    I am sure that caching may by avoided here. Hence, it should not be used. 
    
    @srowen, maybe I don't get something, but I'm afraid, that currently lineage for a single mini-batch submission looks like this 
    `docs.filter(nonEmpty).sample(minibatchFraction).treeAggregate(...)`. 
    
    And I'm afraid that for each mini-batch `filter` will be performed all over again. But if we have smth like 
    `docs.sample(minibatchFraction).filter(nonEmpty).treeAggregate(...)`,
    this will be avoided. And no caching is needed. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r148517729
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    I've added the test. Thank you. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Filtering after sampling makes more sense. Though sampling isn't deterministic, it doesn't change the probability that any particular sample is produced.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)** for PR 19565 at commit [`8b7f30b`](https://github.com/apache/spark/commit/8b7f30bdb11abcd3efe2204c614695b286c0ae95).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147208156
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
         submitMiniBatch(batch)
       }
     
       /**
        * Submit a subset (like 1%, decide by the miniBatchFraction) of the corpus to the Online LDA
        * model, and it will update the topic distribution adaptively for the terms appearing in the
        * subset.
    +   * The methods assumes no empty documents are submitted.
    --- End diff --
    
    Thank you. I'll add it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @WeichenXu123, yes there indeed is a difference in logic. Eventually it boils down to semantics of `miniBatchFraction`. If it is a fraction of non-empty documents being sampled, the version with `filter` going first is correct. If it's a fraction of documents (empty and non-empty) being sampled, then the version with `sample` going first is correct. To me the first version seems more reasonable (who cares about empty docs anyway). @srowen, if I get it right, you would prefer the second option. Why? 
    
    @WeichenXu123, I agree with you: filtering introduces a minimal overhead. 
    
    @srowen, regarding performance... I don't actually think it makes any difference unless complexity of `sample` depends on the length of the parent RDD. In all the subsequent computations empty documents can be handled effectively. 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146571987
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -415,7 +415,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           docs: RDD[(Long, Vector)],
           lda: LDA): OnlineLDAOptimizer = {
         this.k = lda.getK
    -    this.corpusSize = docs.count()
    +    this.docs = docs.filter(_._2.numNonzeros > 0) // filter out empty documents
    +    this.corpusSize = this.docs.count()
         this.vocabSize = docs.first()._2.size
    --- End diff --
    
    `docs` is assumed to be non-empty, while `this.docs` may be empty. In this case `first()` fails. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146799989
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
    --- End diff --
    
    Not a big deal, but I think here use `mapPartition`(do aggregate in it) and then do treeAggregate, the process can be merged into one `treeAggregate`, is it right? it make the code more clear.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    ok I agree this change. @jkbradley Can you take a look ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146569259
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -415,7 +415,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           docs: RDD[(Long, Vector)],
           lda: LDA): OnlineLDAOptimizer = {
         this.k = lda.getK
    -    this.corpusSize = docs.count()
    +    this.docs = docs.filter(_._2.numNonzeros > 0) // filter out empty documents
    +    this.corpusSize = this.docs.count()
         this.vocabSize = docs.first()._2.size
    --- End diff --
    
    Needs to be `this.docs` too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r148506477
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +481,46 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    -      .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
    -        elementWiseSum, elementWiseSum
    -      )
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) =
    +      batch.treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))({
    +        case (acc, (_, termCounts)) =>
    +          val stat = BDM.zeros[Double](k, vocabSize)
    --- End diff --
    
    Thanks for a good point. Feels like it will definitely lead to higher load on GC. @WeichenXu123, what do you think? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83330/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    And the empty docs were not explicitly filtered out. They've just been ignored in `submitMiniBatch`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    I'm curious about the performance comparison, if "filter before sample" triggers a filter over the whole dataset for each `submitMiniBatch`, then there'll be some performance impact.
    
    And if "filter before sample" is used, IMO `miniBatchFraction`  should be adjusted proportionally.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)** for PR 19565 at commit [`721f235`](https://github.com/apache/spark/commit/721f235934f26e75172d39f0398365606616267f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83301/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146882424
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
     
         expElogbetaBc.destroy(false)
     
    -    if (nonEmptyDocsN == 0) {
    +    if (batchSize == 0) {
           logWarning("No non-empty documents were submitted in the batch.")
           // Therefore, there is no need to update any of the model parameters
           return this
         }
     
         val batchResult = statsSum *:* expElogbeta.t
    -    // Note that this is an optimization to avoid batch.count
    -    val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
         updateLambda(batchResult, batchSize)
     
    -    logphatOption.foreach(_ /= nonEmptyDocsN.toDouble)
    -    logphatOption.foreach(updateAlpha(_, nonEmptyDocsN))
    +    logphatOption.foreach(_ /= batchSize.toDouble)
    +    logphatOption.foreach(updateAlpha(_, batchSize))
     
         this
       }
     
       /**
        * Update lambda based on the batch submitted. batchSize can be different for each iteration.
        */
    -  private def updateLambda(stat: BDM[Double], batchSize: Int): Unit = {
    +  private def updateLambda(stat: BDM[Double], batchSize: Double): Unit = {
    --- End diff --
    
    Guess, isn't. If we want this kind of consistency, `batchSize` should rather be `Double` all the way. So I've change the type to `Long`. Thanks for the comment.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Yes I think when dataset is large enough, using the same `miniBatchFraction`, the result RDD size of "filter before sample" and "filter after sample" will be asymptotically equal, no matter how many empty elements in dataset. (correct me if I am wrong, I am a little confused about "miniBatchFraction should be adjusted proportionally", if adjusted, then asymptotically equality is broken?)  If So, does it really effect the algo ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83058/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @akopich If you want to cache the input dataset, create JIAR to discuss it first. It's another issue I think. This JIAR also related to input caching issues: https://issues.apache.org/jira/browse/SPARK-19422


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Now I feel that filtering empty docs out in the `initialize` is not a good idea, because it will be performed as many times, as the number of times `sample` in `next` gets called. Right?
    
    Alternatively we can cache `this.docs`, but it's a waste of space...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Let me know if I missed anything, but I don't quite catch the part 
    
    > all the batches have the same length
    
     IMO
    `docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction, randomGenerator.nextLong())` does not return the same number of documents during multiple function calls. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    ping @WeichenXu123 , @srowen , @hhbyyh 
    Further comments?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147208062
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    The method assumes no empty documents are submitted. But it's perfectly fine to submit an empty RDD. 
    
    And I need to fix a typo. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @akopich I'm actually leaning towards "filter after sample". 
    
    1. so we don't need to change `miniBatchFraction` in 
    ` docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
            randomGenerator.nextLong())`. 
    
    2. Minor: I'm not sure corpusSize = nonEmptyDoc is always good. I'd prefer to keep docs and corpusSize referring to the entire dataset. 
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83045/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)** for PR 19565 at commit [`8bf6f6d`](https://github.com/apache/spark/commit/8bf6f6d9d594120ed190b167c19511bcd3abf453).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146804206
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
    --- End diff --
    
    Yes, that's right. But on the other hand, we've currently got a rather simple `treeAggregate` so it barely contributes to the complexity of the code. 
    
    Anyway, is it OK to fix this in this issue? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @hhbyyh 
    
    Yes, in this way we don't change semantics of `miniBatchFraction`. But is the way it is defined now actually correct? As I mentioned above, in the `upstram/master` the number of non-empty documents in the mini-batch is asymptotically normally distributed. Hence, the size of RDD fed to `treeAggregate` differs from one batch to another. While in this PR (filtering before sampling) all the batches have the same length.
    
    But then again, for large datasets this should make no difference. If nobody thinks this disparity of batch sizes is an issue, I won't object against sampling before filtering.
    
    @WeichenXu123, I believe, you were advocating for filter before sample. Do you still have the preference?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83045 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)** for PR 19565 at commit [`1d923f5`](https://github.com/apache/spark/commit/1d923f5908fd3a7a2ba8fa284d909aeb914ebc0a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147020853
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    We still need this, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)** for PR 19565 at commit [`f376a7b`](https://github.com/apache/spark/commit/f376a7b97aefa700868c14da24fb6410358826df).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147021726
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    I believe, it's redundant now. Anyway, `submitMiniBatch` counts the documents.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @hhbyyh So, I guess, I should just roll the refactoring back, right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147226715
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    I would prefer to leave the check here
    1. performance-wise, sending an empty rdd to go through submitMiniBatch does not make much sense.
    2. it will make submitMiniBatch more simple if no empty RDD will be sent, especially when we're filtering empty documents.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83088/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147230366
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    But performance wise... Currently we have a single pass. But if we move the check to `next`, `isEmpty` being an action will trigger the materialization of `batch` and then `batch` will be materialized once again when `treeAggregate` is called in `submitMiniBatch`. 
    
    Right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)** for PR 19565 at commit [`40685ee`](https://github.com/apache/spark/commit/40685ee960ef2c0e26b31ef96d1f3c8c974d3851).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    I'm saying they are not the same, but for larger datasets this should not matter.
    
    There is a change in logic. The hack with 
    `val batchSize = (miniBatchFraction * corpusSize).ceil.toInt` 
    is not used anymore. The function `updateLambda` uses the real number of non-empty docs. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Okay... any idea why tests failed? It says 
    ```ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error?```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Or (and I think it would be the most efficient approach) we can just stick in the check for emptiness of the document to the `seqOp` of `treeAggregate`. However, it doesn't look like "filtering out beforehand". So, would this be OK?
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Yes, it changed the probability of samples indeed compared with current code.
    But according to the comments coming from @jkbradley in #18924 , "in order to make **corpusSize**, batchSize, and nonEmptyDocsN all refer to the same filtered corpus", I think @jkbradley 's meaning is to filter out empty docs before sampling, the relationship should be:
    `batchSize = miniBatchFraction * corpusSize` and `batchSize == nonEmptyDocsN`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r148438759
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    If still you want to remove the line, would you please add a unit test to ensure `submitMiniBatch` can handle empty rdd? Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @WeichenXu123, @hhbyyh, looking forward to your opinion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    I assume not-selecting a record in a sample is cheaper than just about any other operation, including filtering on a predicate. All else equal, I'd rather sample, then evaluate a predicate on only the sample.
    
    The two versions produce the same result though. Either way, every x% sample of non-empty docs appears with equal probability.
    
    Yes, I doubt one is meaningfully faster than the other overall though. If that's not the motive though, and there's not a functional change, what's the purpose here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)** for PR 19565 at commit [`8bf6f6d`](https://github.com/apache/spark/commit/8bf6f6d9d594120ed190b167c19511bcd3abf453).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146820166
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
    --- End diff --
    
    It looks better now. Thank you for the suggestion. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r148438581
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +481,46 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    -      .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
    -        elementWiseSum, elementWiseSum
    -      )
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) =
    +      batch.treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))({
    +        case (acc, (_, termCounts)) =>
    +          val stat = BDM.zeros[Double](k, vocabSize)
    --- End diff --
    
    This is a per-record operation. Will it consume more memory than mapPartitions ? especially when k and vocabSize are large.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    It's probably better to wait for the opinion from a committer. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146569644
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
           (u._1, u._2, u._3 + v._3)
         }
     
    -    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], nonEmptyDocsN: Long) = stats
    +    val (statsSum: BDM[Double], logphatOption: Option[BDV[Double]], batchSize: Long) = stats
           .treeAggregate((BDM.zeros[Double](k, vocabSize), logphatPartOptionBase(), 0L))(
             elementWiseSum, elementWiseSum
           )
     
         expElogbetaBc.destroy(false)
     
    -    if (nonEmptyDocsN == 0) {
    +    if (batchSize == 0) {
           logWarning("No non-empty documents were submitted in the batch.")
           // Therefore, there is no need to update any of the model parameters
           return this
         }
     
         val batchResult = statsSum *:* expElogbeta.t
    -    // Note that this is an optimization to avoid batch.count
    -    val batchSize = (miniBatchFraction * corpusSize).ceil.toInt
         updateLambda(batchResult, batchSize)
     
    -    logphatOption.foreach(_ /= nonEmptyDocsN.toDouble)
    -    logphatOption.foreach(updateAlpha(_, nonEmptyDocsN))
    +    logphatOption.foreach(_ /= batchSize.toDouble)
    +    logphatOption.foreach(updateAlpha(_, batchSize))
     
         this
       }
     
       /**
        * Update lambda based on the batch submitted. batchSize can be different for each iteration.
        */
    -  private def updateLambda(stat: BDM[Double], batchSize: Int): Unit = {
    +  private def updateLambda(stat: BDM[Double], batchSize: Double): Unit = {
         // weight of the mini-batch.
         val weight = rho()
     
         // Update lambda based on documents.
         lambda := (1 - weight) * lambda +
    -      weight * (stat * (corpusSize.toDouble / batchSize.toDouble) + eta)
    +      weight * (stat * (corpusSize.toDouble / batchSize) + eta)
    --- End diff --
    
    This isn't necessary to achieve floating-point division.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    @akopich IMO the filter won't cost too much, don't worry about the performance. (Or you can make a test to make sure)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    **[Test build #83045 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)** for PR 19565 at commit [`1d923f5`](https://github.com/apache/spark/commit/1d923f5908fd3a7a2ba8fa284d909aeb914ebc0a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r147207042
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
       override private[clustering] def next(): OnlineLDAOptimizer = {
         val batch = docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
           randomGenerator.nextLong())
    -    if (batch.isEmpty()) return this
    --- End diff --
    
    If that's the case, you should remove the comments below. 
    "The methods assumes no empty documents are submitted."


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

Posted by akopich <gi...@git.apache.org>.

Github user akopich commented on the issue:

    https://github.com/apache/spark/pull/19565
  
    Ping @hhbyyh, @WeichenXu123, @srowen. 
    
    Seems like the discussion is stuck. Does anybody think that the general approach implemented in this PR should be changed? Currently it is filtering before sampling with no caching. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org