You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2018/06/04 23:09:00 UTC
[jira] [Resolved] (SPARK-24300) generateLDAData in
ml.cluster.LDASuite didn't set seed correctly
[ https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng resolved SPARK-24300.
-----------------------------------
Resolution: Fixed
Fix Version/s: 2.4.0
Issue resolved by pull request 21492
[https://github.com/apache/spark/pull/21492]
> generateLDAData in ml.cluster.LDASuite didn't set seed correctly
> ----------------------------------------------------------------
>
> Key: SPARK-24300
> URL: https://issues.apache.org/jira/browse/SPARK-24300
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.0
> Reporter: Xiangrui Meng
> Assignee: Lu Wang
> Priority: Minor
> Fix For: 2.4.0
>
>
> [https://github.com/apache/spark/blob/0d63eb8888d17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala]
>
> generateLDAData uses the same RNG in all partitions to generate random data. This either causes duplicate rows in cluster mode or indeterministic behavior in local mode:
> {code:java}
> scala> val rng = new java.util.Random(10)
> rng: java.util.Random = java.util.Random@78c5ef58
> scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) }.collect().mkString("\n")
> res12: String =
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code}
> We should create one RNG per partition to make it safe.
>
> cc: [~lu.DB] [~josephkb]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org