You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ludatabricks <gi...@git.apache.org> on 2018/04/27 17:03:02 UTC
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
GitHub user ludatabricks opened a pull request:
https://github.com/apache/spark/pull/21183
[SPARK-22210][ML] Add seed for LDA variationalTopicInference
## What changes were proposed in this pull request?
- Add seed parameter for variationalTopicInference
- Add seed for calling variationalTopicInference in submitMiniBatch
- Add var seed in LDAModel so that it can take the seed from LDA and use it for the function call of variationalTopicInference in logLikelihoodBound, topicDistributions, getTopicDistributionMethod, and topicDistribution.
## How was this patch tested?
Check the test result in mllib.clustering.LDASuite to make sure the result is repeatable with the seed.
Please review http://spark.apache.org/contributing.html before opening a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ludatabricks/spark-1 SPARK-22210
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21183.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21183
----
commit e647e85e21c63546611c50e45bc57d232b0cbe83
Author: Lu WANG <lu...@...>
Date: 2018-04-27T16:46:45Z
Add seed for LDA variationalTopicInference
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21183#discussion_r188081089
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
None
}
- val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs =>
+ val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndex
--- End diff --
fix scala style:
```
val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndex {
(index, docs) =>
val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)
val stat = BDM.zeros[Double](k, vocabSize)
val logphatPartOption = logphatPartOptionBase()
var nonEmptyDocCount: Long = 0L
nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
nonEmptyDocCount += 1
val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference(
termCounts, expElogbetaBc.value, alpha, gammaShape, k, seed + index)
stat(::, ids) := stat(::, ids) + sstats
logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad))
}
Iterator((stat, logphatPartOption, nonEmptyDocCount))
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89936/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21183#discussion_r187217859
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -622,11 +623,11 @@ object LocalLDAModel extends MLReadable[LocalLDAModel] {
val vectorConverted = MLUtils.convertVectorColumnsToML(data, "docConcentration")
val matrixConverted = MLUtils.convertMatrixColumnsToML(vectorConverted, "topicsMatrix")
val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: Vector,
- topicConcentration: Double, gammaShape: Double) =
+ topicConcentration: Double, gammaShape: Double, seed: Long) =
--- End diff --
This will break backwards compatibility of ML persistence (when users try to load LDAModels saved using past versions of Spark). Could you please test this manually by saving a LocalLDAModel using Spark 2.3 and loading it with a build of your PR? You can fix this by checking for the Spark version (in the `metadata`) and loading the seed for Spark >= 2.4.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90429/testReport)** for PR 21183 at commit [`6c2a8aa`](https://github.com/apache/spark/commit/6c2a8aadb51d53055c0c7d2d7c17aeb1f049baa2).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90611 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90611/testReport)** for PR 21183 at commit [`7ee0ebf`](https://github.com/apache/spark/commit/7ee0ebf028e41719514c0588d378cb515aea744a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/21183
Thanks for checking this manually. Since the test sometimes fails, then let's leave it.
LGTM
Merging with master
Thanks @ludatabricks and @mengxr !
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #89936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89936/testReport)** for PR 21183 at commit [`e647e85`](https://github.com/apache/spark/commit/e647e85e21c63546611c50e45bc57d232b0cbe83).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90429/testReport)** for PR 21183 at commit [`6c2a8aa`](https://github.com/apache/spark/commit/6c2a8aadb51d53055c0c7d2d7c17aeb1f049baa2).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90528/testReport)** for PR 21183 at commit [`a846937`](https://github.com/apache/spark/commit/a8469374b00c0466d480367992799ca77a7afa06).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21183#discussion_r185049467
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -605,14 +609,16 @@ private[clustering] object OnlineLDAOptimizer {
expElogbeta: BDM[Double],
alpha: breeze.linalg.Vector[Double],
gammaShape: Double,
- k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+ k: Int,
+ seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
val (ids: List[Int], cts: Array[Double]) = termCounts match {
case v: DenseVector => ((0 until v.size).toList, v.values)
case v: SparseVector => (v.indices.toList, v.values)
}
// Initialize the variational distribution q(theta|gamma) for the mini-batch
+ val randBasis = new RandBasis(new org.apache.commons.math3.random.MersenneTwister(seed))
val gammad: BDV[Double] =
- new Gamma(gammaShape, 1.0 / gammaShape).samplesVector(k) // K
+ new Gamma(gammaShape, 1.0 / gammaShape)(randBasis).samplesVector(k) // K
--- End diff --
nit: Note that the original spacing with the comment was intentional to match lines below.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90528/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #89936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89936/testReport)** for PR 21183 at commit [`e647e85`](https://github.com/apache/spark/commit/e647e85e21c63546611c50e45bc57d232b0cbe83).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21183#discussion_r187216371
--- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -252,6 +252,15 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead
val lda = new LDA()
testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings,
LDASuite.allParamSettings, checkModelData)
+
+ def checkModelDataWithDataset(model: LDAModel, model2: LDAModel,
--- End diff --
style: Please fix this to match other multi-line method headers.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90697/testReport)** for PR 21183 at commit [`4740645`](https://github.com/apache/spark/commit/4740645eb6d5985cd559bd9d4fbb871fe3fe3c7e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/21183#discussion_r188779768
--- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -253,6 +253,14 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead
val lda = new LDA()
testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings,
LDASuite.allParamSettings, checkModelData)
+
+ def checkModelDataWithDataset(model: LDAModel, model2: LDAModel, dataset: Dataset[_]): Unit = {
--- End diff --
* We don't need this inline method.
* Leave a comment to explain why we are doing this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21183#discussion_r184753197
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging {
None
}
- val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs =>
+ val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndexInternal
--- End diff --
Let's not use mapPartitionsWithIndexInternal; I don't think closure cleaning is expensive enough for us to worry about here. Use mapPartitionsWithIndex instead.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90697/testReport)** for PR 21183 at commit [`4740645`](https://github.com/apache/spark/commit/4740645eb6d5985cd559bd9d4fbb871fe3fe3c7e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by ludatabricks <gi...@git.apache.org>.
Github user ludatabricks commented on the issue:
https://github.com/apache/spark/pull/21183
I tested to load the old saving models from Spark 2.3. It is ok to load it from this.
For the tests in LDASuite, I do see failing sometimes without this fix. It will not always happen. I can remove it if you think it is not necessary.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90528/testReport)** for PR 21183 at commit [`a846937`](https://github.com/apache/spark/commit/a8469374b00c0466d480367992799ca77a7afa06).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90611/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/21183
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21183
**[Test build #90611 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90611/testReport)** for PR 21183 at commit [`7ee0ebf`](https://github.com/apache/spark/commit/7ee0ebf028e41719514c0588d378cb515aea744a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90429/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21183
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90697/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org