You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Abraham Zhan (JIRA)" <ji...@apache.org> on 2016/06/12 09:18:21 UTC
[jira] [Updated] (SPARK-15346) Reduce duplicate computation in
picking initial points in LocalKMeans
[ https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Abraham Zhan updated SPARK-15346:
---------------------------------
Target Version/s: 2.0.0
> Reduce duplicate computation in picking initial points in LocalKMeans
> ---------------------------------------------------------------------
>
> Key: SPARK-15346
> URL: https://issues.apache.org/jira/browse/SPARK-15346
> Project: Spark
> Issue Type: Improvement
> Environment: Ubuntu 14.04
> Reporter: Abraham Zhan
> Assignee: Abraham Zhan
> Priority: Minor
> Labels: performance
> Fix For: 2.0.0
>
>
> h2.Main Issue
> I found that for KMans|| in mllib, when dataset is in large scale, after initial KMeans|| finishes and before Lloyd's iteration begins, the program will stuck for a long time without terminal. After testing I see it's stucked with LocalKMeans. And there is a to be improved feature in LocalKMeans.scala in Mllib. After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old centers as below
> {code:scala}
> val costArray = points.map { point =>
> KMeans.fastSquaredDistance(point, centers(0))
> }
> {code}
> Instead this we can keep the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.
> h2.Test
> Download [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
> I provided a attach "LocalKMeans.zip" which contains the code "LocalKMeans2.scala" and dataset "bigKMeansMedia"
> LocalKMeans2.scala contains both original version method KMeansPlusPlus and a modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
> I added a tests and main function in it so that any one can run the file directly.
> h3.How to Test
> Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans2.scala or just put them in the same dir.
> Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia which is also provided in the patch.
> Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering number K and iteration number respectively.
> Then the console will print the cost time and SE of the two version of KMeans++ respectively.
> h2.Test Results
> This data is generated from a KMeans|| eperiment in spark, I add some inner function and output the result of KMeans|| initialization and restore.
> The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration num (in original KMeans||):points num" of the data.
> In my machine the experiment result is as below:
> !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
> the x-axis is the clustering num k while y-axis is the time in seconds
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org