You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2017/08/16 03:34:00 UTC

[jira] [Updated] (SPARK-21742) BisectingKMeans generate different models with/without caching

     [ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhengruifeng updated SPARK-21742:
---------------------------------
    Summary: BisectingKMeans generate different models with/without caching  (was: BisectingKMeans generate different results with/without caching)

> BisectingKMeans generate different models with/without caching
> --------------------------------------------------------------
>
>                 Key: SPARK-21742
>                 URL: https://issues.apache.org/jira/browse/SPARK-21742
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val random = new Random(seed)
> val nnz = random.nextInt(dim)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> val k = 5
> val bkm = new BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res0: Int = 2
> sparseDataset.persist()
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res2: Int = 3
> {code}
> [~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org