You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/09/04 11:48:20 UTC
[jira] [Updated] (SPARK-17389) KMeans speedup with better choice of k-means|| init steps = 2

     [ https://issues.apache.org/jira/browse/SPARK-17389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-17389:
------------------------------
       Priority: Minor  (was: Major)
    Description: 
As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...

{code}
import org.apache.spark.mllib.random.RandomRDDs
import org.apache.spark.mllib.clustering.KMeans

val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
data.count()

new KMeans().setK(1000).setMaxIterations(5).run(data)
{code}

Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.

There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This is already covered by SPARK-11560, but just a simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in this example. That's not what this change is about though.

By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty, certainly when l=2k as is the case in our implementation. (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 -- much improved over 5:54.

Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited. However this is really the topic of SPARK-3261 because this can cause fewer than k clusters to be returned where that would actually be correct, too.

  was:
As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...

{code}
import org.apache.spark.mllib.random.RandomRDDs
import org.apache.spark.mllib.clustering.KMeans

val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
data.count()

new KMeans().setK(1000).setMaxIterations(5).run(data)
{code}

Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.

There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This results in about a 13% init speedup, from 5:54 to 5:09 in this example.

By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty. (See Figure 5.2; I believe the default of 5 was taken from Table 5 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 -- much improved over 5:54.

Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited.

It's also probably a good idea to deprecate the no-op getRuns/setRuns methods, which do nothing as of 2.0.0.

CC [~mengxr] or perhaps [~dbtsai] for a look. PR coming shortly.

        Summary: KMeans speedup with better choice of k-means|| init steps = 2  (was: Significant KMeans speedup with better choice of init steps, optimizing to remove 'runs')

> KMeans speedup with better choice of k-means|| init steps = 2
> -------------------------------------------------------------
>
>                 Key: SPARK-17389
>                 URL: https://issues.apache.org/jira/browse/SPARK-17389
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.0.0
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>            Priority: Minor
>
> As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...
> {code}
> import org.apache.spark.mllib.random.RandomRDDs
> import org.apache.spark.mllib.clustering.KMeans
> val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
> data.count()
> new KMeans().setK(1000).setMaxIterations(5).run(data)
> {code}
> Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.
> There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This is already covered by SPARK-11560, but just a simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in this example. That's not what this change is about though.
> By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty, certainly when l=2k as is the case in our implementation. (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 -- much improved over 5:54.
> Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited. However this is really the topic of SPARK-3261 because this can cause fewer than k clusters to be returned where that would actually be correct, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org