You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/08/16 06:28:00 UTC

[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching

    [ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128393#comment-16128393 ] 

Sean Owen commented on SPARK-21742:
-----------------------------------

Is that a bug? Isn't it stochastic and dependent on the data order anyway, which could vary if the input varies? Neither answer is wrong. 

> BisectingKMeans generate different models with/without caching
> --------------------------------------------------------------
>
>                 Key: SPARK-21742
>                 URL: https://issues.apache.org/jira/browse/SPARK-21742
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val random = new Random(seed)
> val nnz = random.nextInt(dim)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> val k = 5
> val bkm = new BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res0: Int = 2
> sparseDataset.persist()
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res2: Int = 3
> {code}
> [~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org