You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Derrick Burns (JIRA)" <ji...@apache.org> on 2014/09/15 20:23:33 UTC

[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

    [ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134246#comment-14134246 ] 

Derrick Burns commented on SPARK-2308:
--------------------------------------

I have implemented MiniBatch KMeans in Spark.  

We do not need a special iterator type or random access to get the advantages of MiniBatch because the gain comes primarily from decreasing the number of distance calculations, and not from decreasing the number of points that are touched.  

MiniBatch is good if the number of points is dramatically larger than the number of clusters.  In that case, any sampling of points will impact a large number of clusters, leading to faster convergence. MiniBatch is less useful when the number of desired clusters is large. In this case, MiniBatch is less useful.  

A better approach is to track which clusters are dirty and which points are assigned to which clusters. Using this information, one can eliminate more and more distance calculations per round.  This leads to shorter and shorter rounds, and consequently faster convergence. 


> Add KMeans MiniBatch clustering algorithm to MLlib
> --------------------------------------------------
>
>                 Key: SPARK-2308
>                 URL: https://issues.apache.org/jira/browse/SPARK-2308
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>            Priority: Minor
>         Attachments: many_small_centers.pdf, uneven_centers.pdf
>
>
> Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy).  The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib.
> I suggest adding KMeans Mini-batch as an alternative.
> I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org