You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2017/08/16 03:34:00 UTC

[jira] [Created] (SPARK-21742) BisectingKMeans generate different results with/without caching

zhengruifeng created SPARK-21742:
------------------------------------

             Summary: BisectingKMeans generate different results with/without caching
                 Key: SPARK-21742
                 URL: https://issues.apache.org/jira/browse/SPARK-21742
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.3.0
            Reporter: zhengruifeng


I found that {{BisectingKMeans}} will generate different models if the input is cached or not.
Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we cache the input, then the number of centers will change from 2 to 3.

So it looks like a potential bug.
{code}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.linalg._
import scala.util.Random
case class TestRow(features: org.apache.spark.ml.linalg.Vector)

val rows = 10
val dim = 1000
val seed = 42

val random = new Random(seed)
val nnz = random.nextInt(dim)
val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v))
val sparseDataset = spark.createDataFrame(rdd)

val k = 5
val bkm = new BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
val model = bkm.fit(sparseDataset)
model.clusterCenters.length
res0: Int = 2

sparseDataset.persist()
val model = bkm.fit(sparseDataset)
model.clusterCenters.length
res2: Int = 3
{code}

[~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org