You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mahout.apache.org by co...@apache.org on 2010/10/10 20:16:00 UTC

[CONF] Apache Mahout > Mean Shift Clustering

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Mean Shift Clustering (https://cwiki.apache.org/confluence/display/MAHOUT/Mean+Shift+Clustering)

Change Comment:
---------------------------------------------------------------------
Updated text to reflect 0.4 implementation

Edited by Jeff Eastman:
---------------------------------------------------------------------
"Mean Shift: A Robust Approach to Feature Space Analysis" (http://www.caip.rutgers.edu/riul/research/papers/pdf/mnshft.pdf) introduces the geneology of the mean shift custering procedure which dates back to work in pattern recognition in 1975. The paper contains a detailed derivation and several examples of the use of mean shift for image smooting and segmentation. "Mean Shift Clustering" (http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf) presents an overview of the algorithm with a summary of the derivation. An attractive feature of mean shift clustering is that it does not require a-priori knowledge of the number of clusters (as required in k-means) and it will produce arbitrarily-shaped clusters that depend upon the topology of the data (unlike canopy).

The algorithm begins with a set of datapoints, and creates a fixed-radius window for each. It then iterates over each window, calculating a mean shift vector which points in the direction of the maximum increase in the local density function. Then each window is migrated to a new position via the vector and the iteration resumes. Iterations complete when each window has reached a local maximum in the density function and the vector becomes negligable.

h2. Reference Implementation

The implementation introduced by MAHOUT-15 uses modified Canopy Clustering canopies to represent the mean shift windows.
* It uses the canopy's T1 distance threshold as the radius of the window, and the canopy's T2 threshold to decide when two canopies have converged and will thereby follow the same path.
* Each canopy contains one or more bound points which are represented using the (internally specified) integer ids of the bound points.
* The algorithm is initialized with a canopy containing each input point.
* During each iteration, every canopy calculates its mean shift vector by summing the canopy centers within its T1 threshold. This value is normalized and the resulting centroid becomes the new canopy center.
** The centers are weighted in proportion to their numbers of bound points (weighted pair-group centroid)
* If any canopies are within their T2 thresholds they are merged and their respective bound points are accumulated.
* The iterations complete when each canopy's mean shift vector has a magnitude less than a given termination delta.
* Upon termination, the remaining canopies contain sets of points which are the members of their cluster.

h2. Map/Reduce Implementation

* Each mapper receives a subset of the canopies for each iteration. It compares each canopy with each one it has already seen and performs the T1 and T2 distance tests using an arbitrary user-supplied DistanceMeasure. The mapper merges canopies within T2 distance, moves each canopy to its new centroid position and outputs the canopy to the reducer with a constant key
* A single reducer coalesces all the canopies from the combiners by performing another clustering iteration on them.
* A driver class manages the iteration and determines when either a maximum number of iterations occur or the termination criteria is reached.

h2. Running Mean Shift Clustering

The Mean Shift clustering algorithm may be run using a command-line invocation on MeanShiftCanopyDriver.main or by making a Java call to MeanShiftCanopyDriver.runJob(). Both require several arguments:

# input: a file path string to a directory containing the input data set a SequenceFile(WritableComparable, VectorWritable). The sequence file _key_ is not used.
# output: a file path string to an empty directory which is used for all output from the algorithm.
# measure: the fully-qualified class name of an instance of DistanceMeasure which will be used for the clustering.
# t1: the T1 threshold is used to determine if clusters are close enough to influence each other's next mean calculation.
# t2: the T2 threshold is used to determine when two clusters are close enough to merge.
# convergence: a double value used to determine if the algorithm has converged (clusters have not moved more than the value in the last iteration)
# max-iterations: the maximum number of iterations to run, independent of the convergence specified
# inputIsClusters: a boolean indicating, if true, that the input directory already contains MeanShiftCanopies and no further initialization is needed. If false (the default) input VectorWritables are used to form the initial canopies and these will be written to the clusters-0 directory.
# runClustering: a boolean indicating, if true, that the clustering step is to be executed after clusters have been determined.
# runSequential: a boolean indicating, if true, that the clustering is to be done using the sequential reference implementation in memory.

After running the algorithm, the output directory will contain:
# clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy) produced by the algorithm for each iteration. The Text _key_ is a cluster identifier string.
# clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is the clusterId. The WeightedVectorWritable _value_ is a bean containing a double _weight_ and a VectorWritable _vector_ where the weight indicates the probability that the vector is a member of the cluster. As Mean Shift only produces a single clustering for each point, the weights are all == 1.

h1. Examples

The following images illustrate Mean Shift clustering applied to a set of randomly-generated 2-d data points. The points are generated using a normal distribution centered at a mean location and with a constant standard deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt|http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt] for details on running similar examples.

The points are generated as follows:

* 500 samples m=\[1.0, 1.0\] sd=3.0
* 300 samples m=\[1.0, 0.0\] sd=0.5
* 300 samples m=\[0.0, 2.0\] sd=0.1

In the first image, the points are plotted and the 3-sigma boundaries of their generator are superimposed.

!SampleData.png!

In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. In this image, each cluster renders in a different color and the T1 and T2 radii are superimposed upon the final cluster centers determined by the algorithm. Mean Shift does an excellent job of clustering this data, though by its design the cluster membership is unique and the clusters do not overlap.

!MeanShift.png!

The third image shows the results of running Mean Shift on a different data set (see [Dirichlet Process Clustering] for details) which is generated using asymmetrical standard deviations. Mean Shift does an excellent job of clustering this data set too.

!2dMeanShift.png!

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action