You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/09/03 08:25:20 UTC

[jira] [Assigned] (SPARK-17389) Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs'

     [ https://issues.apache.org/jira/browse/SPARK-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-17389:
------------------------------------

    Assignee: Apache Spark  (was: Sean Owen)

> Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs'
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-17389
>                 URL: https://issues.apache.org/jira/browse/SPARK-17389
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.0.0
>            Reporter: Sean Owen
>            Assignee: Apache Spark
>
> As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...
> {code}
> import org.apache.spark.mllib.random.RandomRDDs
> import org.apache.spark.mllib.clustering.KMeans
> val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
> data.count()
> new KMeans().setK(1000).setMaxIterations(5).run(data)
> {code}
> Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.
> There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This results in about a 13% init speedup, from 5:54 to 5:09 in this example.
> By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty. (See Figure 5.2; I believe the default of 5 was taken from Table 5 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 -- much improved over 5:54.
> Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited.
> It's also probably a good idea to deprecate the no-op getRuns/setRuns methods, which do nothing as of 2.0.0.
> CC [~mengxr] or perhaps [~dbtsai] for a look. PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org