You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2009/05/11 09:40:00 UTC

[CONF] Apache Lucene Mahout: Fuzzy K-Means (page edited)

Fuzzy K-Means (MAHOUT) edited by Jeff Eastman
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/Fuzzy+K-Means
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=95315&originalVersion=7&revisedVersion=8






Content:
---------------------------------------------------------------------

Fuzzy K-Means (also called Fuzzy C-Means) is an extension of [K-Means|http://cwiki.apache.org/MAHOUT/k-means.html], the popular simple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.

h4. Algorithm

Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined.
The algorithm is similar to k-means.
* Initialize k clusters
* Until converged
** Compute the probability of a point belong to a cluster for every <point,cluster> pair
** Recompute the cluster centers using above probability membership values of points to clusters

h4. Design Implementation

The design is similar to K-Means present in Mahout. It accepts an input file containing vector points. User can either provide the cluster centers as input or can allow canopy algorithm to run and create initial clusters.

Similar to K-Means, the program doesn't modify the input directories. And for every iteration, the cluster output is stored in a directory cluster-N. The code has set number of reduce tasks equal to number of map tasks. So, those many part-0
\\
\* files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as follows:

FuzzyKMeansDriver - This is similar to&nbsp; KMeansDriver. It iterates over input points and cluster points for specified number of iterations or until it is converged.During every iteration i, a new cluster-i directory is created which contains the modified cluster centers obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration.&nbsp; Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map task is run to output "the point and the cluster membership to each cluster" pair as final output to a directory named "points".

FuzzyKMeansMapper - reads the input cluster during its configure() method, then&nbsp; computes cluster membership probability of a point to each cluster.Cluster membership is inversely propotional to the distance. Distance is computed using&nbsp; user supplied distance measure. Output key is encoded cluster. Output values is. <probabilityValue>:<inputPoint>

FuzzyKMeansCombiner - receives all key:value pairs from the mapper and produces partial sums of the cluster membership probability times input vectors for each cluster. Output key is: encoded cluster. Output value is "<sum of cluster membership values in the partial sum>, <partial sum vector summing all such points>".

FuzzyKMeansReducer - Multiple reducers receives certain keys and all values associated with those keys. The reducer sums the values to produce a new centroid for the cluster which is output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted cluster (e.g. "C14 - \[c1, c2, ..., cn, \]). The reducer encodes unconverged clusters with a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.

h4. References&nbsp;

* [http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering]

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence