You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2008/02/03 16:15:00 UTC

[CONF] Apache Lucene Mahout: k-Means (page edited)

k-Means (MAHOUT) edited by Isabel Drost
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/k-Means
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=75159&originalVersion=2&revisedVersion=3






Content:
---------------------------------------------------------------------

h1. kMeans

k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as _k_) he wishes to identify.
Each object can be thought of as being represented by some feature vector in an _n_ dimensional space, _n_ being the number of all features used to describe the objects to cluster. The algorithm than randomly chooses _k_ points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task.
After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations.
Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters _k_. Yet the main principle always remains the same.



h2. Strategy for parallelization

Some ideas can be found in [Cluster computing and MapReduce|http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html] lecture video series \[by Google(r)\]; k-Mean clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]. Slides can be found [here|http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt].

Interestingly, Hadoop based implementation using [Canopy-clustering|http://en.wikipedia.org/wiki/Canopy_clustering_algorithm] seems to be here: [http://code.google.com/p/canopy-clustering/] (GPL 3 licence)

h2. Design of implementation

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence