You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by sq...@apache.org on 2014/05/02 22:00:30 UTC

svn commit: r1592024 - /mahout/site/mahout_cms/trunk/content/users/clustering/spectral-clustering.mdtext

Author: squinn
Date: Fri May  2 20:00:30 2014
New Revision: 1592024

URL: http://svn.apache.org/r1592024
Log:
Draft of spectral clustering documentation update for Mahout.

Modified:
    mahout/site/mahout_cms/trunk/content/users/clustering/spectral-clustering.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/clustering/spectral-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/clustering/spectral-clustering.mdtext?rev=1592024&r1=1592023&r2=1592024&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/clustering/spectral-clustering.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/clustering/spectral-clustering.mdtext Fri May  2 20:00:30 2014
@@ -1,265 +1,77 @@
 Title: Spectral Clustering
 
-# Spectral Clustering
+# Spectral Clustering Overview
 
-Spectral clustering, a more powerful and specialized algorithm (compared to
-K-means), derives its name from spectral analysis of a graph, which is how
-the data are represented. Each object to be clustered can initially be
-represented as an _n_\-dimensional numeric vector, but the difference with
-this algorithm is that there must also be some method for performing a
-comparison between each object and expressing this comparison as a scalar.
-
-This _n_ by _n_ comparison of all objects with all others forms the
-_affinity_ matrix, which can be intuitively thought of as a rough
-representation of an underlying undirected, weighted, and fully-connected
-graph whose edges express the relative relationships, or affinities,
-between each pair of objects in the original data. This affinity matrix
-forms the basis from which the two spectral clustering algorithms operate.
-
-The equation by which the affinities are calculated can vary depending on
-the user's circumstances; typically, the equation takes the form of:
-
-exp( _-d^2_ / _c_ )
-
-where _d_ is the Euclidean distance between a pair of points, and _c_ is a
-scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of
-closest points to the current point; all other affinities are set to 0
-outside of the neighborhood. Again, this formula can vary depending on the
-situation (e.g. a fully-connected graph would ignore the _k_\-neighborhood
-and calculate affinities for all pairs of points).
-
-[Full overview on spectral clustering](http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/)
-
-<a name="SpectralClustering-K-MeansSpectralClustering"></a>
-## K-Means Spectral Clustering
-
-<a name="SpectralClustering-Overview"></a>
-### Overview
-
-This consists of a few basic steps of generalized spectral clustering,
-followed by standard k-means clustering over the intermediate results.
-Again, this process begins with an affinity matrix *A* \- whether or not it
-is fully-connected depends on the user's need.
-
-*A* is then transformed into a pseudo-Laplacian matrix via a multiplication
-with a diagonal matrix whose entries consist of the sums of the rows of
-*A*. The sums are modified to be the inverse square root of their original
-values. The final operation looks something like:
-
-L = D^{-1/2} A D^{-1/2}
-
-*L* has some properties that are of interest to us; most importantly, while
-it is symmetric like *A*, it has a more stable eigen-decomposition. *L* is
-decomposed into its constituent eigenvectors and corresponding eigenvalues
-(though the latter will not be needed for future calculations); the matrix
-of eigenvectors, *U*, is what we are now interested in.
-
-Assuming *U* is a column matrix (the eigenvectors comprise the columns),
-then we will now use the _rows_ of *U* as proxy data for the original data
-points. We will run each row through standard K-means clustering, and the
-label that each proxy point receives will be transparently assigned to the
-corresponding original data point, resulting in the final clustering
-assignments.
-
-[Full overview on k-means spectral clustering](http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/)
-
-<a name="SpectralClustering-Implementation"></a>
-### Implementation
-
-The Mahout implementation consists of a single driver -
-SpectralKMeansDriver - calling upon several common utilities. The driver
-performs the operations in sequential fashion: reading in and constructing
-the affinity matrix, building the diagonal matrix, building the
-pseudo-Laplacian and decomposing it, and clustering the components.
-
-The affinity matrix input is the most important part of this algorithm. It
-consists of text files which follow a specific format: that of a weighted,
-undirected graph. In order to represent a graph in text files, each line of
-a text file represents a single directional edge between two nodes. There
-are three comma-separated values on the line. The first number indicates
-the source node, the second is the destination node, and the third is the
-weight. For example:
-
-0, 1, 2.5
-
-would indicate the directional edge from node 0 to node 1 has a weight of
-2.5. *Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix
-is symmetric, hence there should be a corresponding line in the text file
-of: 1, 0, 2.5.* Also, each node should be an integer value.
-
-*note*: per MAHOUT-978 don't name your input file with leading '_' (or
-'.').
-
-M/R jobs written for SpectralKMeans:
-* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
-* MatrixDiagonalizeJob (constructs the diagonal matrix)
-* UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows)
-* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
-
-M/R jobs already in Mahout that were used:
-* DistributedRowMatrix.transpose()
-* DistributedLanczosSolver
-* EigenVerfierJob
-* KMeansDriver
-
-<a name="SpectralClustering-EigencutsSpectralClustering"></a>
-## Eigencuts Spectral Clustering
-
-<a name="SpectralClustering-Overview"></a>
-### Overview
-
-Intuitively, Eigencuts can be thought of as part 2 of the K-means
-algorithm, in that it performs the same initial steps up until the k-means
-clustering. The algorithm uses the same affinity matrix *A*, constructs the
-same diagonal matrix *D*, performs the same multiplication to create the
-pseudo-Laplacian *L*, and conducts an eigen-decomposition on *L* to obtain
-the eigenvectors and eigenvalues. But here is where the two algorithms
-differentiate.
-
-For each eigenvector, we wish to determine how stable its flow of
-probability is within the underlying graph of the original data.
-Intuitively, this is intrinsically related to the min-cut, max-flow problem
-of finding bottlenecks: if we perturb the flow rate on a specific edge, and
-overall the flow is stable, then we can conclude that this edge was not a
-bottleneck. If, however, perturbing an edge significantly alters the
-overall flow, we know this edge's eigenflow is very unstable and is a
-bottleneck.
-
-We have an [explicit form](http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png)
- of this "sensitivity" calculation ([full post here, under "computing
-sensitivities"|http://spectrallyclustered.wordpress.com/2010/07/15/sprint-3-three-last-mr-tasks/]).
-The next step is called "non-maximal suppression", which effectively means
-we will ignore any of the calculated sensitivities for which there exists
-another more negative sensitivity in the same neighborhood as the current
-one, effectively "suppressing" it.
-
-This non-maximal suppression then plays a role in the final
-affinity-cutting step, where we "cut" the affinity (set to 0) between any
-two nodes (effectively destroying the edge between them) for which the
-sensitivity calculated at that edge passes some threshold, _and_ for which
-the sensitivity was _not_ suppressed in the previous step.
-
-Once the cutting has been completed, the process loops upon itself
-(starting with the recalculation of *D* using the modified *A*) until no
-new cuts in *A* are made in the final step.
-
-[Full overview on Eigencuts spectral clustering](http://spectrallyclustered.wordpress.com/2010/07/06/sprint-3-introduction-to-eigencuts/)
-
-<a name="SpectralClustering-Implementation"></a>
-### Implementation
-
-Since the first half of Eigencuts uses the same calculations as Spectral
-K-means, it uses the same common M/R tasks, both those specific to spectral
-clustering, as well as those general to Mahout. Unlike SpectralKMeans,
-however, there are no DistributedRowMatrix-specific operations performed,
-and hence there is no need for the data type at all; Mahout Vectors are
-used heavily instead.
-
-Once the initial affinity matrix is constructed, there is a loop within the
-EigencutsDriver over the calculation of *D*, the creation of *L* and its
-eigen-decomposition, the calculation of the sensitivities, and the actual
-affinity cuts, such that the loop terminates only when no cuts are made to
-the affinity matrix *A*. The final matrix *A* will then be representative
-of a graph structure whose data points exist in intra-connected clusters.
-
-M/R tasks specific to Eigencuts:
-* EigencutsSensitivityJob (calculates the perturbation effects on edge
-weights)
-* EigencutsAffinityCutsJob (sets edge weights to 0)
-
-M/R tasks within spectral clustering:
-* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
-* MatrixDiagonalizeJob (constructs the diagonal matrix)
-* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
-
-M/R tasks general to Mahout:
-* DistributedLanczosSolver
-* EigenVerifierJob
-
-<a name="SpectralClustering-Quickstart"></a>
-## Quickstart
-
-As noted before, the data for both these algorithms - the affinity matrix -
-is required to be symmetric. As of the first release, there is no built-in
-mechanism in Mahout for generating the affinities from raw data, as the
-formula this follows varies depending on the user's need, so it is left as
-an exercise to the user to generate the affinities prior to using these
-algorithms.
-
-The affinity input should follow the standard format for textual
-representation of a graph, namely:
-
-node_i, node_j, value_ij
-
-For example, the following 3x3 affinity matrix:
-
-<table>
-<tr><td> 0.0 </td><td> 0.8 </td><td> 0.5 </td></tr>
-<tr><td> 0.8 </td><td> 0.0 </td><td> 0.9 </td></tr>
-<tr><td> 0.5 </td><td> 0.9 </td><td> 0.0 </td></tr>
-</table>
-
-would have the following textual input format:
-
-0, 0, 0
-0, 1, 0.8
-0, 2, 0.5
-1, 0, 0.8
-1, 1, 0
-1, 2, 0.9
-2, 0, 0.5
-2, 1, 0.9
-2, 2, 0
-
-Then simply invoke Spectral K-means or Eigencuts, using the input directory
-and setting the other parameters as necessary (e.g. "k" for K-means, "beta"
-for Eigencuts, etc).
-
-Format *rules*: node count begins at zero, is continuous. avoid whitespace,
-comments etc.
-
-<a name="SpectralClustering-Examples"></a>
-## Examples
-
-*NOTE: Am still waiting for Carnegie Mellon/Univ. of Pittsburgh approval
-for official data set*
-
-For these algorithms, it is useful to have a viable example, so I have
-created a small but effective synthetic data set to show how these
-algorithms operate. The raw data was generated synthetically and [can be viewed here](http://dl.dropbox.com/u/1377610/rawdata.csv)
-. It consists of 450 two-dimensional points drawn from 3 separate gaussian
-distributions their own means and standard deviations ([here is an image of
-the plotted
-points|http://spectrallyclustered.files.wordpress.com/2010/07/clusters.png].
-This same data in affinity matrix form looks like this: [view the affinity data set|http://dl.dropbox.com/u/1377610/affinity.csv]
-
-In order to run the program, then, for spectral k-means:
-
-bin/mahout spectralkmeans \-i /path/to/directory/with/affinity/matrix \-o
-/output/path \-k 3 \-d 450
-
-and for eigencuts:
-
-bin/mahout eigencuts \-i /path/to/directory/with/affinity/matrix \-o
-/output/path \-b 2 \-d 450
-
-In both cases, the "-d" flag refers to the dimensionality of the affinity
-matrix; since there are 450 points, this value will be 450. In spectral
-k-means, the "-k" is the number of clusters to estimate. In Eigencuts, the
-"-b" refers to an initial estimate on the half-life of the flow of
-probability in the graph; higher estimates correspond to fewer clusters.
-
-Spectral k-means will eventually yield the clustering results, and there
-should only be a single mistake: point 54. This is because that particular
-point is in between the two left-hand clusters in the image, and so has
-high affinities with both clusters.
-
-[Full overview of example here](http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/)
-
-<a name="SpectralClustering-Resources"></a>
-### Resources
-
-* [http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/](http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/)
-* [http://www.stanford.edu/class/ee378B/papers/luxburg-spectral.pdf](http://www.stanford.edu/class/ee378B/papers/luxburg-spectral.pdf)
-* [http://en.wikipedia.org/wiki/Laplacian_matrix](http://en.wikipedia.org/wiki/Laplacian_matrix)
-* [http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering](http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering)
+Spectral clustering, as its name implies, makes use of the spectrum (or eigenvalues) of the similarity matrix of the data. It examines the _connectedness_ of the data, whereas other clustering algorithms such as k-means use the _compactness_ to assign clusters. Consequently, in situations where k-means performs well, spectral clustering will also perform well. Additionally, there are situations in which k-means will underperform (e.g. concentric circles), but spectral clustering will be able to segment the underlying clusters. Spectral clustering is also very useful for image segmentation.
+
+At its simplest, spectral clustering relies on the following four steps:
+
+ 1. Computing a similarity (or _affinity_) matrix $\mathbf{A}$ from the data. This involves determining a pairwise distance function $f$ that takes a pair of data points and returns a scalar.
+
+ 2. Computing a graph Laplacian $\mathbf{L}$ from the affinity matrix. There are several types of graph Laplacians; which is used will often depends on the situation.
+
+ 3. Computing the eigenvectors and eigenvalues of $\mathbf{L}$. The degree of this decomposition is often modulated by $k$, or the number of clusters. Put another way, $k$ eigenvectors and eigenvalues are computed.
+
+ 4. The $k$ eigenvectors are used as "proxy" data for the original dataset, and fed into k-means clustering. The resulting cluster assignments are transparently passed back to the original data.
+
+For more theoretical background on spectral clustering, such as how affinity matrices are computed, the different types of graph Laplacians, and whether the top or bottom eigenvectors and eigenvalues are computed, please read [Ulrike von Luxburg's article in _Statistics and Computing_ from December 2007](http://link.springer.com/article/10.1007/s11222-007-9033-z). It provides an excellent description of the linear algebra operations behind spectral clustering, and imbues a thorough understanding of the types of situations in which it can be used.
+
+# Mahout Spectral Clustering
+
+As of Mahout 0.3, spectral clustering has been implemented to take advantage of the MapReduce framework. It uses [SSVD](http://mahout.apache.org/users/dim-reduction/ssvd.html) for dimensionality reduction of the input data set, and [k-means](http://mahout.apache.org/users/clustering/k-means-clustering.html) to perform the final clustering.
+
+**([MAHOUT-1538](https://issues.apache.org/jira/browse/MAHOUT-1538) will port the existing Hadoop MapReduce implementation to Mahout DSL, allowing for one of several distinct distributed back-ends to conduct the computation)**
+
+## Input
+
+The input format for the algorithm currently takes the form of a Hadoop-backed affinity matrix, in text form. Each line of the text file specifies a single element of the affinity matrix: the row index $i$, the column index $j$, and the value:
+
+`i, j, value`
+
+The affinity matrix is symmetric, and any unspecified $i, j$ pairs are assumed to be 0 for sparsity. The row and column indices are 0-indexed. Thus, only the non-zero entries of either the upper or lower triangular need be specified.
+
+**([MAHOUT-1539](https://issues.apache.org/jira/browse/MAHOUT-1539) will allow for the creation of the affinity matrix to occur as part of the core spectral clustering algorithm, as opposed to the current requirement that the user create this matrix themselves and provide it, rather than the original data, to the algorithm)**
+
+## Running spectral clustering
+
+**([MAHOUT-1540](https://issues.apache.org/jira/browse/MAHOUT-1540) will provide a running example of this algorithm and this section will be updated to show how to run the example and what the expected output should be; until then, this section provides a how-to for simply running the algorithm on arbitrary input)**
+
+Spectral clustering can be invoked with the following arguments.
+
+    bin/mahout spectralkmeans \
+        -i <affinity matrix directory> \
+        -o <output working directory> \
+        -d <number of data points> \
+        -k <number of clusters AND number of top eigenvectors to use> \
+        -x <maximum number of k-means iterations>
+
+The affinity matrix can be contained in a single text file (using the aforementioned one-line-per-entry format) or span many text files (per (MAHOUT-978)[https://issues.apache.org/jira/browse/MAHOUT-978], do not prefix text files with a leading underscore '_' or period '.'). The `-d` flag is required for the algorithm to know the dimensions of the affinity matrix. `-k` is the number of top eigenvectors from the normalized graph Laplacian in the SSVD step, and also the number of clusters given to k-means after the SSVD step.
+
+## Example
+
+To provide a simple example, take the following affinity matrix, contained in a text file called `affinity.txt`:
+
+    0, 0, 0
+    0, 1, 0.8
+    0, 2, 0.5
+    1, 0, 0.8
+    1, 1, 0
+    1, 2, 0.9
+    2, 0, 0.5
+    2, 1, 0.9
+    2, 2, 0
+
+With this 3-by-3 matrix, `-d` would be `3`. Furthermore, since all affinity matrices are assumed to be symmetric, the entries specifying both `1, 2, 0.9` and `2, 1, 0.9` are redundant; only one of these is needed. Additionally, any entries that are 0, such as those along the diagonal, also need not be specified at all. They are provided here for completeness.
+
+In general, larger values indicate a stronger "connectedness", whereas smaller values indicate a weaker connectedness. This will vary somewhat depending on the distance function used, though a common one is the [RBF kernel](http://en.wikipedia.org/wiki/RBF_kernel) (used in the above example) which returns values in the range [0, 1], where 0 indicates completely disconnected (or completely dissimilar) and 1 is fully connected (or identical).
+
+The call signature with this matrix could be as follows:
+
+    bin/mahout spectralkmeans \
+        -i s3://mahout-example/input/ \
+        -o s3://mahout-example/output/ \
+        -d 3 \
+        -k 2 \
+        -x 10
+
+There are many other optional arguments, in particular for tweaking the SSVD process (block size, number of power iterations, etc) and the k-means clustering step (distance measure, convergence delta, etc).
\ No newline at end of file