You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2017/08/16 03:34:00 UTC
[jira] [Created] (SPARK-21742) BisectingKMeans generate different
results with/without caching
zhengruifeng created SPARK-21742:
------------------------------------
Summary: BisectingKMeans generate different results with/without caching
Key: SPARK-21742
URL: https://issues.apache.org/jira/browse/SPARK-21742
Project: Spark
Issue Type: Bug
Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng
I found that {{BisectingKMeans}} will generate different models if the input is cached or not.
Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we cache the input, then the number of centers will change from 2 to 3.
So it looks like a potential bug.
{code}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.linalg._
import scala.util.Random
case class TestRow(features: org.apache.spark.ml.linalg.Vector)
val rows = 10
val dim = 1000
val seed = 42
val random = new Random(seed)
val nnz = random.nextInt(dim)
val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v))
val sparseDataset = spark.createDataFrame(rdd)
val k = 5
val bkm = new BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
val model = bkm.fit(sparseDataset)
model.clusterCenters.length
res0: Int = 2
sparseDataset.persist()
val model = bkm.fit(sparseDataset)
model.clusterCenters.length
res2: Int = 3
{code}
[~imatiach]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org