You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by is...@apache.org on 2013/11/03 22:36:27 UTC

svn commit: r1538467 [18/20] - in /mahout/site/mahout_cms: ./ cgi-bin/ content/ content/css/ content/developers/ content/general/ content/images/ content/js/ content/users/ content/users/basics/ content/users/classification/ content/users/clustering/ c...

Added: mahout/site/mahout_cms/content/users/clustering/dirichlet-process-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/dirichlet-process-clustering.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/dirichlet-process-clustering.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/dirichlet-process-clustering.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,266 @@
+Title: Dirichlet Process Clustering
+<a name="DirichletProcessClustering-Overview"></a>
+# Overview
+
+The Dirichlet Process Clustering algorithm performs Bayesian mixture
+modeling.
+
+The idea is that we use a probabilistic mixture of a number of models that
+we use to explain some observed data. Each observed data point is assumed
+to have come from one of the models in the mixture, but we don't know
+which.	The way we deal with that is to use a so-called latent parameter
+which specifies which model each data point came from.
+
+In addition, since this is a Bayesian clustering algorithm, we don't want
+to actually commit to any single explanation, but rather to sample from the
+distribution of models and latent assignments of data points to models
+given the observed data and the prior distributions of model parameters.
+This sampling process is initialized by taking models at random from the
+prior distribution for models.
+
+Then, we iteratively assign points to the different models using the
+mixture probabilities and the degree of fit between the point and each
+model expressed as a probability that the point was generated by that
+model. After points are assigned, new parameters for each model are sampled
+from the posterior distribution for the model parameters considering all of
+the observed data points that were assigned to the model.  Models without
+any data points are also sampled, but since they have no points assigned,
+the new samples are effectively taken from the prior distribution for model
+parameters.
+
+The result is a number of samples that represent mixing probabilities,
+models and assignment of points to models. If the total number of possible
+models is substantially larger than the number that ever have points
+assigned to them, then this algorithm provides a (nearly) non-parametric
+clustering algorithm. These samples can give us interesting information
+that is lacking from a normal clustering that consists of a single
+assignment of points to clusters.  Firstly, by examining the number of
+models in each sample that actually has any points assigned to it, we can
+get information about how many models (clusters) that the data support.
+Morevoer, by examining how often two points are assigned to the same model,
+we can get an approximate measure of how likely these points are to be
+explained by the same model.  Such soft membership information is difficult
+to come by with conventional clustering methods.
+
+Finally, we can get an idea of the stability of how the data can be
+described.  Typically, aspects of the data with lots of data available wind
+up with stable descriptions while at the edges, there are aspects that are
+phenomena that we can't really commit to a solid description, but it is
+still clear that the well supported explanations are insufficient to
+explain these additional aspects. One thing that can be difficult about
+these samples is that we can't always assign a correlation between the
+models in the different samples.  Probably the best way to do this is to
+look for overlap in the assignments of data observations to the different
+models.
+
+<a name="DirichletProcessClustering-DesignofImplementation"></a>
+## Design of Implementation
+
+The implementation accepts one input directory containing the data points
+to be clustered. The data directory contains multiple input files of
+SequenceFile(key, VectorWritable). The input data points are not modified
+by the implementation, allowing experimentation with initial clustering and
+convergence values.
+
+The program iterates over the input points, outputting a new directory
+"clusters-N" containing SequenceFile(Text, DirichletCluster) files for each
+iteration N. This process uses a mapper/reducer/driver as follows:
+
+DirichletMapper - reads the input clusters during its configure() method,
+then assigns and outputs each input point to a probable cluster as defined
+by the model's pdf() function. Output _key_ is: clusterId. Output _value_
+is: input point.
+DirichletReducer - reads the input clusters during its configure() method,
+then each reducer receives clusterId:VectorWritable pairs from all mappers
+and accumulates them to produce a new posterior model for each cluster
+which is output. Output _key_ is: clusterId. Output value is:
+DirichletCluster. Reducer outputs are used as the input clusters for the
+next iteration.
+DirichletDriver - iterates over the points and clusters until the given
+number of iterations has been reached. During iterations, a new clusters
+directory "clusters-N" is produced with the output clusters from the
+previous iteration used for input to the next. A final optional pass over
+the data using the DirichletClusterMapper clusters all points to an output
+directory "clusteredPoints" and has no combiner or reducer steps.
+
+<a name="DirichletProcessClustering-RunningDirichletProcessClustering"></a>
+## Running Dirichlet Process Clustering
+
+The Dirichlet clustering algorithm may be run using a command-line
+invocation on DirichletDriver.main or by making a Java call to
+DirichletDriver.runJob(). 
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout dirichlet \
+        -i <input vectors directory> \
+        -o <output working directory> \
+        -a0 <the alpha_0 parameter to the Dirichlet Distribution>
+        -x <maximum number of iterations> \
+        -k <number of models to create from prior> \
+        -md <the ModelDistribution class name. Default NormalModelDistribution>
+\
+        -mp <the ModelPrototype class name. Default
+SequentialAccessSparseVector> \
+        -dm <optional DistanceMeasure class name for some ModelDistribution>
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Clusters>
+        -e <emit vectors to most likely cluster during clustering>
+        -t <threshold to use for clustering if -e is false>
+        -xm <execution method: sequential or mapreduce>
+
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. modelFactory: an instance of ModelDistribution which will be used for the
+clustering.
+1. numClusters: the number of models to be used for the clustering. This
+should be larger than the number of clusters which is expected in the data
+set.
+1. maxIterations: the number of iterations to run for the clustering.
+1. alpha_0: a double value (default is 1.0) used for creating the
+DirichletDistribution. Influences the likelihood that new, empty clusters
+will be selected for assignment in the first iteration.
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. emitMostLikely: a boolean indicating, if true, that the clustering step
+should only emit the most likely cluster for each clustered point.
+1. threshold: a double indicating, if emitMostLikely is false, the cluster
+probability threshold used for emitting multiple clusters for each point. A
+value of 0 will emit all clusters with their associated probabilities for
+each vector.
+1. runSequential: a boolean indicating, if true, that the clustering is to
+be run using the sequential reference implementation in memory.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, DirichletCluster)
+produced by the algorithm for each iteration. The Text _key_ is a cluster
+identifier string.
+1. clusteredPoints: (if runClustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weight indicates
+the probability that the vector is a member of the cluster (the Probability
+Density Function (pdf) of the cluster model evaluated at the vector). 
+
+
+<a name="DirichletProcessClustering-Examples"></a>
+# Examples
+
+The following images illustrate three different prior models applied to a
+set of randomly-generated 2-d data points. The points are generated using a
+normal distribution centered at a mean location and with a constant
+standard deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=3.0
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=0.5
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed. It is, of course, impossible to tell
+which model actually generated each point as there is some probability -
+perhaps small - that any of the models could have generated every point.
+
+!SampleData.png!
+
+In the next image, the Dirichlet Process Clusterer is run against the sample points using a NormalModelDistribution with m=\[0.0, 0.0\](0.0,-0.0\.html)
+ sd=1.0. This distribution represents the least amount of prior
+information, as its sampled models all have constant parameters. The
+resulting significant models (representing > 5% of the population) are
+superimposed upon the sample data. Since all prior models are identical and
+their pdfs are the same, the first iteration's assignment of points to
+models is completely governed by the initial mixture values. Since these
+are also identical, it means the first iteration assigns points to models
+at random. During subsequent iterations, the models diverge from the origin
+but there is some over-fitting in the result.
+
+As Dirichlet clustering is an iterative process, the following illustrations include the cluster information from all iterations. The final cluster values are in bold red and earlier iterations are shown in \[orange, yellow, green, blue, violet and the rest are all gray\](orange,-yellow,-green,-blue,-violet-and-the-rest-are-all-gray\.html)
+. These illustrate the cluster convergence process over the last several
+iterations and can be helpful in tuning the algorithm.
+
+!DirichletN.png!
+
+The next image improves upon this situation by using a
+SampledNormalDistribution. In this distribution, the prior models have
+means that are sampled from a normal distribution and all have a constant
+sd=1. This distribution creates initial models that are centered at
+different coordinates. During the first iteration, each model thus has a
+different pdf for each point and the iteration assigns points to the
+more-likely models given this value. The result is a decent capture of the
+sample data parameters but there is still some over-fitting.
+
+!DirichletSN.png!
+
+The above image was run through 20 iterations and the cluster assignments
+are clearly moving indicating the clustering is not yet converged. The next
+image runs the same model for 40 iterations, producing an accurate model of
+the input data.
+
+!DirichletSN40.png!
+
+The next image uses an AsymmetricSampledNormalDistribution in which the
+model's standard deviation is also represented as a 2-d vector. This causes
+the clusters to assume elliptical shapes in the resulting clustering. This
+represents an incorrect prior assumption but it is interesting that it fits
+the actual sample data quite well. Had we suspected the sample points were
+generated in a similar manner then this distribution would have been the
+most logical model.
+
+!DirichletASN.png!
+
+In order to explore an asymmetrical sample data distribution, the following
+image shows a number of points generated according to the following
+parameters. Again, the generator's 3-sigma ellipses are superimposed:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=\[3.0, 1.0\]
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=\[0.5, 1.0\]
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=\[0.1, 0.5\]
+
+!AsymmetricSampleData.png!
+
+The following image shows the results of applying the symmetrical
+SampledNormalDistribution to the asymmetrically-generated sample data. It
+does a valiant effort but does not capture a very good set of models
+because the circular model assumption does not fit the data.
+
+!2dDirichletSN.png!
+
+Finally, the AsymmetricSampledNormalDistribution is run against the
+asymmetrical sample data. Though there is some over-fitting, it does a
+credible job of capturing the underlying models. Different arguments
+(numClusters, alpha0, numIterations) and display thresholds will yield
+slightly different results. Compare the first run of numClusters=20 models
+for 20 iterations with another run of numClusters=40 models for 40
+iterations.
+
+!2dDirichletASN.png!
+!2dDirichletASN4040.png!
+
+<a name="DirichletProcessClustering-References"></a>
+# References
+
+McCullagh and Yang:
+http://ba.stat.cmu.edu/journal/2008/vol03/issue01/yang.pdf
+
+There is also a more approachable example in [Chris Bishop's book on Machine Learning](-http://research.microsoft.com/en-us/um/people/cmbishop/prml/index.htm.html)
+. I think that chapter 9 is where the example of clustering using a mixture
+model is found.
+
+The Neal and Blei references from the McCullagh and Yang paper are also
+good. Zoubin Gharamani has some very [nice tutorials out which describe why non-parametric Bayesian approaches to problems are very cool](http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf)
+, there are video versions about as well.

Added: mahout/site/mahout_cms/content/users/clustering/expectation-maximization.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/expectation-maximization.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/expectation-maximization.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/expectation-maximization.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,57 @@
+Title: Expectation Maximization
+<a name="ExpectationMaximization-ExpectationMaximization"></a>
+# Expectation Maximization
+
+The principle of EM can be applied to several learning settings, but is
+most commonly associated with clustering. The main principle of the
+algorithm is comparable to k-Means. Yet in contrast to hard cluster
+assignments, each object is given some probability to belong to a cluster.
+Accordingly cluster centers are recomputed based on the average of all
+objects weighted by their probability of belonging to the cluster at hand.
+
+<a name="ExpectationMaximization-Canopy-modifiedEM"></a>
+## Canopy-modified EM
+
+One can also use the canopies idea to speed up prototypebased clustering
+methods like K-means and Expectation-Maximization (EM). In general, neither
+K-means nor EMspecify how many clusters to use. The canopies technique does
+not help this choice.
+
+Prototypes (our estimates of the cluster centroids) are associated with the
+canopies that contain them, and the prototypes are only influenced by data
+that are inside their associated canopies. After creating the canopies, we
+decide how many prototypes will be created for each canopy. This could be
+done, for example, using the number of data points in a canopy and AIC or
+BIC where points that occur in more than one canopy are counted
+fractionally. Then we place prototypesinto each canopy. This initial
+placement can be random, as long as it is within the canopy in question, as
+determined by the inexpensive distance metric.
+
+Then, instead of calculating the distance from each prototype to every
+point (as is traditional, a O(nk) operation), theE-step instead calculates
+the distance from each prototype to a much smaller number of points. For
+each prototype, we find the canopies that contain it (using the cheap
+distance metric), and only calculate distances (using the expensive
+distance metric) from that prototype to points within those canopies.
+
+Note that by this procedure prototypes may move across canopy boundaries
+when canopies overlap. Prototypes may move to cover the data in the
+overlapping region, and then move entirely into another canopy in order to
+cover data there.
+
+The canopy-modified EM algorithm behaves very similarly to traditional EM,
+with the slight difference that points outside the canopy have no influence
+on points in the canopy, rather than a minute influence. If the canopy
+property holds, and points in the same cluster fall in the same canopy,
+then the canopy-modified EM will almost always converge to the same maximum
+in likelihood as the traditional EM. In fact, the difference in each
+iterative step (apart from the enormous computational savings of computing
+fewer terms) will be negligible since points outside the canopy will have
+exponentially small influence.
+
+<a name="ExpectationMaximization-StrategyforParallelization"></a>
+## Strategy for Parallelization
+
+<a name="ExpectationMaximization-Map/ReduceImplementation"></a>
+## Map/Reduce Implementation
+

Added: mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means-commandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means-commandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means-commandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,104 @@
+Title: fuzzy-k-means-commandline
+<a name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a>
+# Running Fuzzy k-Means Clustering from the Command Line
+Mahout's Fuzzy k-Means clustering can be launched from the same command
+line invocation whether you are running on a single machine in stand-alone
+mode or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run FuzzyK on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout fkmeans <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="fuzzy-k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+
+<a name="fuzzy-k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="fuzzy-k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			       Path to job input directory. 
+    					       Must be a SequenceFile of    
+    					       VectorWritable		    
+      --clusters (-c) clusters		       The input centroids, as
+Vectors. 
+    					       Must be a SequenceFile of    
+    					       Writable, Cluster/Canopy. 
+If k  
+    					       is also specified, then a
+random 
+    					       set of vectors will be
+selected  
+    					       and written out to this path 
+    					       first			    
+      --output (-o) output			       The directory pathname for   
+    					       output.			    
+      --distanceMeasure (-dm) distanceMeasure      The classname of the	    
+    					       DistanceMeasure. Default is  
+    					       SquaredEuclidean 	    
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
+    					       Default is 0.5		    
+      --maxIter (-x) maxIter		       The maximum number of	    
+    					       iterations.		    
+      --k (-k) k				       The k in k-Means.  If
+specified, 
+    					       then a random selection of k 
+    					       Vectors will be chosen as
+the    
+    					       Centroid and written to the  
+    					       clusters input path.	    
+      --m (-m) m				       coefficient normalization    
+    					       factor, must be greater than
+1   
+      --overwrite (-ow)			       If present, overwrite the
+output 
+    					       directory before running job 
+      --help (-h)				       Print out help		    
+      --numMap (-u) numMap			       The number of map tasks.     
+    					       Defaults to 10		    
+      --maxRed (-r) maxRed			       The number of reduce tasks.  
+    					       Defaults to 2		    
+      --emitMostLikely (-e) emitMostLikely	       True if clustering should
+emit   
+    					       the most likely point only,  
+    					       false for threshold
+clustering.  
+    					       Default is true		    
+      --threshold (-t) threshold		       The pdf threshold used for   
+    					       cluster determination.
+Default   
+    					       is 0 
+      --clustering (-cl)			       If present, run clustering
+after 
+    					       the iterations have taken
+place  
+                                
+

Added: mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/fuzzy-k-means.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,180 @@
+Title: Fuzzy K-Means
+Fuzzy K-Means (also called Fuzzy C-Means) is an extension of [K-Means](http://cwiki.apache.org/MAHOUT/k-means.html)
+, the popular simple clustering technique. While K-Means discovers hard
+clusters (a point belong to only one cluster), Fuzzy K-Means is a more
+statistically formalized method and discovers soft clusters where a
+particular point can belong to more than one cluster with certain
+probability.
+
+<a name="FuzzyK-Means-Algorithm"></a>
+#### Algorithm
+
+Like K-Means, Fuzzy K-Means works on those objects which can be represented
+in n-dimensional vector space and a distance measure is defined.
+The algorithm is similar to k-means.
+* Initialize k clusters
+* Until converged
+** Compute the probability of a point belong to a cluster for every
+<point,cluster> pair
+** Recompute the cluster centers using above probability membership values
+of points to clusters
+
+<a name="FuzzyK-Means-DesignImplementation"></a>
+#### Design Implementation
+
+The design is similar to K-Means present in Mahout. It accepts an input
+file containing vector points. User can either provide the cluster centers
+as input or can allow canopy algorithm to run and create initial clusters.
+
+Similar to K-Means, the program doesn't modify the input directories. And
+for every iteration, the cluster output is stored in a directory cluster-N.
+The code has set number of reduce tasks equal to number of map tasks. So,
+those many part-0
+  
+  
+\* files are created in clusterN directory. The code uses
+driver/mapper/combiner/reducer as follows:
+
+FuzzyKMeansDriver - This is similar to&nbsp; KMeansDriver. It iterates over
+input points and cluster points for specified number of iterations or until
+it is converged.During every iteration i, a new cluster-i directory is
+created which contains the modified cluster centers obtained during
+FuzzyKMeans iteration. This will be feeded as input clusters in the next
+iteration.&nbsp; Once Fuzzy KMeans is run for specified number of
+iterations or until it is converged, a map task is run to output "the point
+and the cluster membership to each cluster" pair as final output to a
+directory named "points".
+
+FuzzyKMeansMapper - reads the input cluster during its configure() method,
+then&nbsp; computes cluster membership probability of a point to each
+cluster.Cluster membership is inversely propotional to the distance.
+Distance is computed using&nbsp; user supplied distance measure. Output key
+is encoded clusterId. Output values are ClusterObservations containing
+observation statistics.
+
+FuzzyKMeansCombiner - receives all key:value pairs from the mapper and
+produces partial sums of the cluster membership probability times input
+vectors for each cluster. Output key is: encoded cluster identifier. Output
+values are ClusterObservations containing observation statistics.
+
+FuzzyKMeansReducer - Multiple reducers receives certain keys and all values
+associated with those keys. The reducer sums the values to produce a new
+centroid for the cluster which is output. Output key is: encoded cluster
+identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g.
+"C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and
+converged clusters with 'Vn' clusterId.
+
+<a name="FuzzyK-Means-RunningFuzzyk-MeansClustering"></a>
+## Running Fuzzy k-Means Clustering
+
+The Fuzzy k-Means clustering algorithm may be run using a command-line
+invocation on FuzzyKMeansDriver.main or by making a Java call to
+FuzzyKMeansDriver.run(). 
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout fkmeans \
+        -i <input vectors directory> \
+        -c <input clusters directory> \
+        -o <output working directory> \
+        -dm <DistanceMeasure> \
+        -m <fuzziness argument >1> \
+        -x <maximum number of iterations> \
+        -k <optional number of initial clusters to sample from input vectors> \
+        -cd <optional convergence delta. Default is 0.5> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Clusters>
+        -e <emit vectors to most likely cluster during clustering>
+        -t <threshold to use for clustering if -e is false>
+        -xm <execution method: sequential or mapreduce>
+
+
+*Note:* if the -k argument is supplied, any clusters in the -c directory
+will be overwritten and -k random points will be sampled from the input
+vectors to become the initial cluster centers.
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. clustersIn: a file path string to a directory containing the initial
+clusters, a SequenceFile(key, SoftCluster | Cluster | Canopy). Fuzzy
+k-Means SoftClusters, k-Means Clusters and Canopy Canopies may be used for
+the initial clusters.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. measure: the fully-qualified class name of an instance of DistanceMeasure
+which will be used for the clustering.
+1. convergence: a double value used to determine if the algorithm has
+converged (clusters have not moved more than the value in the last
+iteration)
+1. max-iterations: the maximum number of iterations to run, independent of
+the convergence specified
+1. m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is
+equivalent to normalising the coefficient linearly to make their sum 1.
+When m is close to 1, then the cluster center closest to the point is given
+much more weight than the others, and the algorithm is similar to k-means.
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. emitMostLikely: a boolean indicating, if true, that the clustering step
+should only emit the most likely cluster for each clustered point.
+1. threshold: a double indicating, if emitMostLikely is false, the cluster
+probability threshold used for emitting multiple clusters for each point. A
+value of 0 will emit all clusters with their associated probabilities for
+each vector.
+1. runSequential: a boolean indicating, if true, that the algorithm is to
+use the sequential reference implementation running in memory.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, SoftCluster)
+produced by the algorithm for each iteration. The Text _key_ is a cluster
+identifier string.
+1. clusteredPoints: (if runClustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weights are
+computed as 1/(1+distance) where the distance is between the cluster center
+and the vector using the chosen DistanceMeasure. 
+
+<a name="FuzzyK-Means-Examples"></a>
+# Examples
+
+The following images illustrate Fuzzy k-Means clustering applied to a set
+of randomly-generated 2-d data points. The points are generated using a
+normal distribution centered at a mean location and with a constant
+standard deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=3.0
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=0.5
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed. 
+
+!SampleData.png!
+
+In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As Fuzzy k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in \[orange, yellow, green, blue, violet and gray\](orange,-yellow,-green,-blue,-violet-and-gray\.html)
+. Although it misses a lot of the points and cannot capture the original,
+superimposed cluster centers, it does a decent job of clustering this data.
+
+!FuzzyKMeans.png!
+
+The third image shows the results of running Fuzzy k-Means on a different
+data set (see [Dirichlet Process Clustering](dirichlet-process-clustering.html)
+ for details) which is generated using asymmetrical standard deviations.
+Fuzzy k-Means does a fair job handling this data set as well.
+
+!2dFuzzyKMeans.png!
+
+<a name="FuzzyK-Means-References&nbsp;"></a>
+#### References&nbsp;
+
+* [http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering](http://en.wikipedia.org/wiki/Data_clustering#Fuzzy_c-means_clustering)

Added: mahout/site/mahout_cms/content/users/clustering/hierarchical-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/hierarchical-clustering.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/hierarchical-clustering.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/hierarchical-clustering.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,10 @@
+Title: Hierarchical Clustering
+Hierarchical clustering is the process or finding bigger clusters, and also
+the smaller clusters inside the bigger clusters.
+
+In Apache Mahout, separate algorithms can be used for finding clusters at
+different levels. 
+
+See [Top Down Clustering](https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering)
+.
+

Added: mahout/site/mahout_cms/content/users/clustering/k-means-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/k-means-clustering.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/k-means-clustering.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/k-means-clustering.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,217 @@
+Title: K-Means Clustering
+k-Means is a rather {color:#ff0000}simple{color} but well known algorithm
+for grouping objects, clustering. Again all objects need to be represented
+as a set of numerical features. In addition the user has to specify the
+number of groups (referred to as _k_) he wishes to identify.
+
+Each object can be thought of as being represented by some feature vector
+in an _n_ dimensional space, _n_ being the number of all features used to
+describe the objects to cluster. The algorithm then randomly chooses _k_
+points in that vector space, these point serve as the initial centers of
+the clusters. Afterwards all objects are each assigned to the center they
+are closest to. Usually the distance measure is chosen by the user and
+determined by the learning task.
+
+After that, for each cluster a new center is computed by averaging the
+feature vectors of all objects assigned to it. The process of assigning
+objects and recomputing centers is repeated until the process converges.
+The algorithm can be proven to converge after a finite number of
+iterations.
+
+Several tweaks concerning distance measure, initial center choice and
+computation of new average centers have been explored, as well as the
+estimation of the number of clusters _k_. Yet the main principle always
+remains the same.
+
+
+
+<a name="K-MeansClustering-Quickstart"></a>
+## Quickstart
+
+[Here](k-means-clustering^quickstart-kmeans.sh.html)
+ is a short shell script outline that will get you started quickly with
+k-Means. This does the following:
+
+* Get the Reuters dataset
+* Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
+reuters-out from reuters-sgm(the downloaded archive)
+* Run seqdirectory to convert reuters-out to SequenceFile format
+* Run seq2sparse to convert SequenceFiles to sparse vector format
+* Finally, run kMeans with 20 clusters.
+
+After following through the output that scrolls past, reading the code will
+offer you a better understanding.
+
+
+<a name="K-MeansClustering-Strategyforparallelization"></a>
+## Strategy for parallelization
+
+Some ideas can be found in [Cluster computing and MapReduce](http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html)
+ lecture video series \[by Google(r)\]; k-Means clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]
+. Slides can be found [here|http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt]
+.
+
+Interestingly, Hadoop based implementation using [Canopy-clustering](http://en.wikipedia.org/wiki/Canopy_clustering_algorithm)
+ seems to be here: [http://code.google.com/p/canopy-clustering/]
+ (GPL 3 licence)
+
+Here's another useful paper [http://www2.chass.ncsu.edu/garson/PA765/cluster.htm](http://www2.chass.ncsu.edu/garson/PA765/cluster.htm)
+.
+
+<a name="K-MeansClustering-Designofimplementation"></a>
+## Design of implementation
+
+The implementation accepts two input directories: one for the data points
+and one for the initial clusters. The data directory contains multiple
+input files of SequenceFile(key, VectorWritable), while the clusters
+directory contains one or more SequenceFiles(Text, Cluster \| Canopy)
+containing _k_ initial clusters or canopies. None of the input directories
+are modified by the implementation, allowing experimentation with initial
+clustering and convergence values.
+
+The program iterates over the input points and clusters, outputting a new
+directory "clusters-N" containing SequenceFile(Text, Cluster) files for
+each iteration N. This process uses a mapper/combiner/reducer/driver as
+follows:
+* KMeansMapper - reads the input clusters during its setup() method, then
+assigns and outputs each input point to its nearest cluster as defined by
+the user-supplied distance measure. Output key is: cluster identifier.
+Output value is: ClusterObservation.
+* KMeansCombiner - receives all key:value pairs from the mapper and
+produces partial sums of the input vectors for each cluster. Output key is:
+cluster identifier. Output value is ClusterObservation.
+* KMeansReducer - a single reducer receives all key:value pairs from all
+combiners and sums them to produce a new centroid for the cluster which is
+output. Output key is: encoded cluster identifier. Output value is:
+Cluster. The reducer encodes unconverged clusters with a 'Cn' cluster Id
+and converged clusters with 'Vn' clusterId.
+* KMeansDriver - iterates over the points and clusters until all output
+clusters have converged (Vn clusterIds) or until a maximum number of
+iterations has been reached. During iterations, a new clusters directory
+"clusters-N" is produced with the output clusters from the previous
+iteration used for input to the next. A final optional pass over the data
+using the KMeansClusterMapper clusters all points to an output directory
+"clusteredPoints" and has no combiner or reducer steps.
+
+Canopy clustering can be used to compute the initial clusters for k-KMeans:
+{quote}
+// run the CanopyDriver job
+CanopyDriver.runJob("testdata", "output"
+ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);
+
+// now run the KMeansDriver job
+KMeansDriver.runJob("testdata", "output/clusters-0", "output",
+EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);
+{quote}
+
+In the above example, the input data points are stored in 'testdata' and
+the CanopyDriver is configured to output to the 'output/clusters-0'
+directory. Once the driver executes it will contain the canopy definition
+files. Upon running the KMeansDriver the output directory will have two or
+more new directories: 'clusters-N'' containining the clusters for each
+iteration and 'clusteredPoints' will contain the clustered data points.
+
+This diagram shows the examplary dataflow of the k-Means example
+implementation provided by Mahout:
+{gliffy:name=Example implementation of k-Means provided with
+Mahout|space=MAHOUT|page=k-Means|pageid=75159|align=left|size=L|version=7}
+
+This diagram doesn't consider CanopyClustering:
+{gliffy:name=k-Means Example|space=MAHOUT|page=k-Means|align=left|size=L}
+
+<a name="K-MeansClustering-Runningk-MeansClustering"></a>
+## Running k-Means Clustering
+
+The k-Means clustering algorithm may be run using a command-line invocation
+on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout kmeans \
+        -i <input vectors directory> \
+        -c <input clusters directory> \
+        -o <output working directory> \
+        -k <optional number of initial clusters to sample from input vectors> \
+        -dm <DistanceMeasure> \
+        -x <maximum number of iterations> \
+        -cd <optional convergence delta. Default is 0.5> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Canopies>
+        -xm <execution method: sequential or mapreduce>
+
+
+Note: if the \-k argument is supplied, any clusters in the \-c directory
+will be overwritten and \-k random points will be sampled from the input
+vectors to become the initial cluster centers.
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. clusters: a file path string to a directory containing the initial
+clusters, a SequenceFile(key, Cluster \| Canopy). Both KMeans clusters and
+Canopy canopies may be used for the initial clusters.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. distanceMeasure: the fully-qualified class name of an instance of
+DistanceMeasure which will be used for the clustering.
+1. convergenceDelta: a double value used to determine if the algorithm has
+converged (clusters have not moved more than the value in the last
+iteration)
+1. maxIter: the maximum number of iterations to run, independent of the
+convergence specified
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. runSequential: a boolean indicating, if true, that the k-means sequential
+implementation is to be used to process the input data.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, Cluster) produced
+by the algorithm for each iteration. The Text _key_ is a cluster identifier
+string.
+1. clusteredPoints: (if \--clustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weight indicates
+the probability that the vector is a member of the cluster. For k-Means
+clustering, the weights are computed as 1/(1+distance) where the distance
+is between the cluster center and the vector using the chosen
+DistanceMeasure.
+
+<a name="K-MeansClustering-Examples"></a>
+# Examples
+
+The following images illustrate k-Means clustering applied to a set of
+randomly-generated 2-d data points. The points are generated using a normal
+distribution centered at a mean location and with a constant standard
+deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=3.0
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=0.5
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed.
+
+!SampleData.png!
+
+In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in \[orange, yellow, green, blue, violet and gray\](orange,-yellow,-green,-blue,-violet-and-gray\.html)
+. Although it misses a lot of the points and cannot capture the original,
+superimposed cluster centers, it does a decent job of clustering this data.
+
+!KMeans.png!
+
+The third image shows the results of running k-Means on a different data
+set (see [Dirichlet Process Clustering](dirichlet-process-clustering.html)
+ for details) which is generated using asymmetrical standard deviations.
+K-Means does a fair job handling this data set as well.
+
+!2dKMeans.png!

Added: mahout/site/mahout_cms/content/users/clustering/k-means-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/k-means-commandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/k-means-commandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/k-means-commandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,99 @@
+Title: k-means-commandline
+<a name="k-means-commandline-Introduction"></a>
+# Introduction
+
+This quick start page describes how to run the kMeans clustering algorithm
+on a Hadoop cluster. 
+
+<a name="k-means-commandline-Steps"></a>
+# Steps
+
+Mahout's k-Means clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run k-Means on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout kmeans <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+
+<a name="k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			       Path to job input directory. 
+    					       Must be a SequenceFile of    
+    					       VectorWritable		    
+      --clusters (-c) clusters		       The input centroids, as
+Vectors. 
+    					       Must be a SequenceFile of    
+    					       Writable, Cluster/Canopy. 
+If k  
+    					       is also specified, then a
+random 
+    					       set of vectors will be
+selected  
+    					       and written out to this path 
+    					       first			    
+      --output (-o) output			       The directory pathname for   
+    					       output.			    
+      --distanceMeasure (-dm) distanceMeasure      The classname of the	    
+    					       DistanceMeasure. Default is  
+    					       SquaredEuclidean 	    
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
+    					       Default is 0.5		    
+      --maxIter (-x) maxIter		       The maximum number of	    
+    					       iterations.		    
+      --maxRed (-r) maxRed			       The number of reduce tasks.  
+    					       Defaults to 2		    
+      --k (-k) k				       The k in k-Means.  If
+specified, 
+    					       then a random selection of k 
+    					       Vectors will be chosen as
+the    
+    					       Centroid and written to the  
+    					       clusters input path.	    
+      --overwrite (-ow)			       If present, overwrite the
+output 
+    					       directory before running job 
+      --help (-h)				       Print out help		    
+      --clustering (-cl)			       If present, run clustering
+after 
+    					       the iterations have taken
+place  
+

Added: mahout/site/mahout_cms/content/users/clustering/latent-dirichlet-allocation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/latent-dirichlet-allocation.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/latent-dirichlet-allocation.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/latent-dirichlet-allocation.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,149 @@
+Title: Latent Dirichlet Allocation
+<a name="LatentDirichletAllocation-Overview"></a>
+# Overview
+
+Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
+algorithm for automatically and jointly clustering words into "topics" and
+documents into mixtures of topics. It has been successfully applied to
+model change in scientific fields over time (Griffiths and Steyvers, 2004;
+Hall, et al. 2008). 
+
+A topic model is, roughly, a hierarchical Bayesian model that associates
+with each document a probability distribution over "topics", which are in
+turn distributions over words. For instance, a topic in a collection of
+newswire might include words about "sports", such as "baseball", "home
+run", "player", and a document about steroid use in baseball might include
+"sports", "drugs", and "politics". Note that the labels "sports", "drugs",
+and "politics", are post-hoc labels assigned by a human, and that the
+algorithm itself only assigns associate words with probabilities. The task
+of parameter estimation in these models is to learn both what the topics
+are, and which documents employ them in what proportions.
+
+Another way to view a topic model is as a generalization of a mixture model
+like [Dirichlet Process Clustering](dirichlet-process-clustering.html)
+. Starting from a normal mixture model, in which we have a single global
+mixture of several distributions, we instead say that _each_ document has
+its own mixture distribution over the globally shared mixture components.
+Operationally in Dirichlet Process Clustering, each document has its own
+latent variable drawn from a global mixture that specifies which model it
+belongs to, while in LDA each word in each document has its own parameter
+drawn from a document-wide mixture.
+
+The idea is that we use a probabilistic mixture of a number of models that
+we use to explain some observed data. Each observed data point is assumed
+to have come from one of the models in the mixture, but we don't know
+which.	The way we deal with that is to use a so-called latent parameter
+which specifies which model each data point came from.
+
+<a name="LatentDirichletAllocation-CollapsedVariationalBayes"></a>
+# Collapsed Variational Bayes
+The CVB algorithm which is implemented in Mahout for LDA combines
+advantages of both regular Variational Bayes and Gibbs Sampling.  The
+algorithm relies on modeling dependence of parameters on latest variables
+which are in turn mutually independent.   The algorithm uses 2
+methodologies to marginalize out parameters when calculating the joint
+distribution and the other other is to model the posterior of theta and phi
+given the inputs z and x.
+
+A common solution to the CVB algorithm is to compute each expectation term
+by using simple Gaussian approximation which is accurate and requires low
+computational overhead.  The specifics behind the approximation involve
+computing the sum of the means and variances of the individual Bernoulli
+variables.
+
+CVB with Gaussian approximation is implemented by tracking the mean and
+variance and subtracting the mean and variance of the corresponding
+Bernoulli variables.  The computational cost for the algorithm scales on
+the order of O(K) with each update to q(z(i,j)).  Also for each
+document/word pair only 1 copy of the variational posterior is required
+over the latent variable.
+
+<a name="LatentDirichletAllocation-InvocationandUsage"></a>
+# Invocation and Usage
+
+Mahout's implementation of LDA operates on a collection of SparseVectors of
+word counts. These word counts should be non-negative integers, though
+things will-- probably --work fine if you use non-negative reals. (Note
+that the probabilistic model doesn't make sense if you do!) To create these
+vectors, it's recommended that you follow the instructions in [Creating Vectors From Text](creating-vectors-from-text.html)
+, making sure to use TF and not TFIDF as the scorer.
+
+Invocation takes the form:
+
+
+    bin/mahout cvb \
+        -i <input path for document vectors> \
+        -dict <path to term-dictionary file(s) , glob expression supported> \
+        -o <output path for topic-term distributions>
+        -dt <output path for doc-topic distributions> \
+        -k <number of latent topics> \
+        -nt <number of unique features defined by input document vectors> \
+        -mt <path to store model state after each iteration> \
+        -maxIter <max number of iterations> \
+        -mipd <max number of iterations per doc for learning> \
+        -a <smoothing for doc topic distributions> \
+        -e <smoothing for term topic distributions> \
+        -seed <random seed> \
+        -tf <fraction of data to hold for testing> \
+        -block <number of iterations per perplexity check, ignored unless
+test_set_percentage>0> \
+
+
+Topic smoothing should generally be about 50/K, where K is the number of
+topics. The number of words in the vocabulary can be an upper bound, though
+it shouldn't be too high (for memory concerns). 
+
+Choosing the number of topics is more art than science, and it's
+recommended that you try several values.
+
+After running LDA you can obtain an output of the computed topics using the
+LDAPrintTopics utility:
+
+
+    bin/mahout ldatopics \
+        -i <input vectors directory> \
+        -d <input dictionary file> \
+        -w <optional number of words to print> \
+        -o <optional output working directory. Default is to console> \
+        -h <print out help> \
+        -dt <optional dictionary type (text|sequencefile). Default is text>
+
+
+
+<a name="LatentDirichletAllocation-Example"></a>
+# Example
+
+An example is located in mahout/examples/bin/build-reuters.sh. The script
+automatically downloads the Reuters-21578 corpus, builds a Lucene index and
+converts the Lucene index to vectors. By uncommenting the last two lines
+you can then cause it to run LDA on the vectors and finally print the
+resultant topics to the console. 
+
+To adapt the example yourself, you should note that Lucene has specialized
+support for Reuters, and that building your own index will require some
+adaptation. The rest should hopefully not differ too much.
+
+<a name="LatentDirichletAllocation-ParameterEstimation"></a>
+# Parameter Estimation
+
+We use mean field variational inference to estimate the models. Variational
+inference can be thought of as a generalization of [EM](expectation-maximization.html)
+ for hierarchical Bayesian models. The E-Step takes the form of, for each
+document, inferring the posterior probability of each topic for each word
+in each document. We then take the sufficient statistics and emit them in
+the form of (log) pseudo-counts for each word in each topic. The M-Step is
+simply to sum these together and (log) normalize them so that we have a
+distribution over the entire vocabulary of the corpus for each topic. 
+
+In implementation, the E-Step is implemented in the Map, and the M-Step is
+executed in the reduce step, with the final normalization happening as a
+post-processing step.
+
+<a name="LatentDirichletAllocation-References"></a>
+# References
+
+[David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent Dirichlet Allocation. JMLR.](-http://www.cs.princeton.edu/~blei/papers/bleingjordan2003.pdf.html)
+
+[Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS.  ](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)
+
+[David Hall, Dan Jurafsky, and Christopher D. Manning. 2008. Studying the History of Ideas Using Topic Models ](-http://www.aclweb.org/anthology-new/d/d08/d08-1038.pdf.html)

Added: mahout/site/mahout_cms/content/users/clustering/lda-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/lda-commandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/lda-commandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/lda-commandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,97 @@
+Title: lda-commandline
+<a name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
+# Running Latent Dirichlet Allocation (algorithm) from the Command Line
+[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897)
+ lda has been implemented as Collapsed Variable Bayes (cvb). 
+
+Mahout's LDA can be launched from the same command line invocation whether
+you are running on a single machine in stand-alone mode or on a larger
+Hadoop cluster. The difference is determined by the $HADOOP_HOME and
+$HADOOP_CONF_DIR environment variables. If both are set to an operating
+Hadoop cluster on the target machine then the invocation will run the LDA
+algorithm on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+
+    ./bin/mahout cvb <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+
+<a name="lda-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a>
+# Command line options from Mahout cvb version 0.8
+
+    mahout cvb -h 
+      --input (-i) input					  Path to job input
+directory.	      
+      --output (-o) output					  The directory
+pathname for output.  
+      --maxIter (-x) maxIter				  The maximum
+number of iterations.		
+      --convergenceDelta (-cd) convergenceDelta		  The convergence
+delta value		    
+      --overwrite (-ow)					  If present,
+overwrite the output directory before running job    
+      --num_topics (-k) num_topics				  Number of topics
+to learn		 
+      --num_terms (-nt) num_terms				  Vocabulary size   
+      --doc_topic_smoothing (-a) doc_topic_smoothing	  Smoothing for
+document/topic distribution	     
+      --term_topic_smoothing (-e) term_topic_smoothing	  Smoothing for
+topic/term distribution 	 
+      --dictionary (-dict) dictionary			  Path to
+term-dictionary file(s) (glob expression supported) 
+      --doc_topic_output (-dt) doc_topic_output		  Output path for
+the training doc/topic distribution	     
+      --topic_model_temp_dir (-mt) topic_model_temp_dir	  Path to
+intermediate model path (useful for restarting)       
+      --iteration_block_size (-block) iteration_block_size	  Number of
+iterations per perplexity check  
+      --random_seed (-seed) random_seed			  Random seed	    
+      --test_set_fraction (-tf) test_set_fraction		  Fraction of data
+to hold out for testing  
+      --num_train_threads (-ntt) num_train_threads		  number of threads
+per mapper to train with  
+      --num_update_threads (-nut) num_update_threads	  number of threads
+per mapper to update the model with	       
+      --max_doc_topic_iters (-mipd) max_doc_topic_iters	  max number of
+iterations per doc for p(topic|doc) learning		  
+      --num_reduce_tasks num_reduce_tasks			  number of
+reducers to use during model estimation 	   
+      --backfill_perplexity 				  enable
+backfilling of missing perplexity values		
+      --help (-h)						  Print out help    
+      --tempDir tempDir					  Intermediate
+output directory	     
+      --startPhase startPhase				  First phase to
+run    
+      --endPhase endPhase					  Last phase to run
+

Added: mahout/site/mahout_cms/content/users/clustering/llr---log-likelihood-ratio.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/llr---log-likelihood-ratio.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/llr---log-likelihood-ratio.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/llr---log-likelihood-ratio.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,37 @@
+Title: LLR - Log-likelihood Ratio
+{excerpt}Likelihood ratio test is used to compare the fit of two models one
+of which is nested within the other.{excerpt}
+
+In the context of machine learning and the Mahout project in particular,
+the term LLR is usually meant to refer to a test of significance for two
+binomial distributions, also known as the G squared statistic.	This is a
+special case of the multinomial test and is closely related to mutual
+information.  The value of this statistic is not normally used in this
+context as a true frequentist test of significance since there would be
+obvious and dreadful problems to do with multiple comparisons, but rather
+as a heuristic score to order pairs of items with the most interestingly
+connected items having higher scores.  In this usage, the LLR has proven
+very useful for discriminating pairs of features that have interesting
+degrees of cooccurrence and those that do not with usefully small false
+positive and false negative rates.  The LLR is typically far more suitable
+in the case of small than many other measures such as Pearson's
+correlation, Pearson's chi squared statistic or z statistics.  The LLR as
+stated does not, however, make any use of rating data which can limit its
+applicability in problems such as the Netflix competition. 
+
+The actual value of the LLR is not usually very helpful other than as a way
+of ordering pairs of items.  As such, it is often used to determine a
+sparse set of coefficients to be estimated by other means such as TF-IDF. 
+Since the actual estimation of these coefficients can be done in a way that
+is independent of the training data such as by general corpus statistics,
+and since the ordering imposed by the LLR is relatively robust to counting
+fluctuation, this technique can provide very strong results in very sparse
+problems where the potential number of features vastly out-numbers the
+number of training examples and where features are highly interdependent.
+
+ See Also: 
+ * http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
+ * http://en.wikipedia.org/wiki/G-test
+ * http://en.wikipedia.org/wiki/Likelihood-ratio_test
+
+      
\ No newline at end of file

Added: mahout/site/mahout_cms/content/users/clustering/mean-shift-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/mean-shift-clustering.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/mean-shift-clustering.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/mean-shift-clustering.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,160 @@
+Title: Mean Shift Clustering
+"Mean Shift: A Robust Approach to Feature Space Analysis"
+(http://www.caip.rutgers.edu/riul/research/papers/pdf/mnshft.pdf)
+introduces the geneology of the mean shift custering procedure which dates
+back to work in pattern recognition in 1975. The paper contains a detailed
+derivation and several examples of the use of mean shift for image smooting
+and segmentation. "Mean Shift Clustering"
+(http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf)
+presents an overview of the algorithm with a summary of the derivation. An
+attractive feature of mean shift clustering is that it does not require
+a-priori knowledge of the number of clusters (as required in k-means) and
+it will produce arbitrarily-shaped clusters that depend upon the topology
+of the data (unlike canopy).
+
+The algorithm begins with a set of datapoints, and creates a fixed-radius
+window for each. It then iterates over each window, calculating a mean
+shift vector which points in the direction of the maximum increase in the
+local density function. Then each window is migrated to a new position via
+the vector and the iteration resumes. Iterations complete when each window
+has reached a local maximum in the density function and the vector becomes
+negligable.
+
+<a name="MeanShiftClustering-ReferenceImplementation"></a>
+## Reference Implementation
+
+The implementation introduced by MAHOUT-15 uses modified Canopy Clustering
+canopies to represent the mean shift windows. 
+* It uses the canopy's T1 distance threshold as the radius of the window,
+and the canopy's T2 threshold to decide when two canopies have converged
+and will thereby follow the same path. 
+* Each canopy contains one or more bound points which are represented using
+the (internally specified) integer ids of the bound points. 
+* The algorithm is initialized with a canopy containing each input point. 
+* During each iteration, every canopy calculates its mean shift vector by
+summing the canopy centers within its T1 threshold. This value is
+normalized and the resulting centroid becomes the new canopy center.
+** The centers are weighted in proportion to their numbers of bound points
+(weighted pair-group centroid)
+* If any canopies are within their T2 thresholds they are merged and their
+respective bound points are accumulated. 
+* The iterations complete when each canopy's mean shift vector has a
+magnitude less than a given termination delta. 
+* Upon termination, the remaining canopies contain sets of points which are
+the members of their cluster.
+
+<a name="MeanShiftClustering-Map/ReduceImplementation"></a>
+## Map/Reduce Implementation
+
+* Each mapper receives a subset of the canopies for each iteration. It
+compares each canopy with each one it has already seen and performs the T1
+and T2 distance tests using an arbitrary user-supplied DistanceMeasure. The
+mapper merges canopies within T2 distance, moves each canopy to its new
+centroid position and outputs the canopy to the reducer with a constant key
+ * A single reducer coalesces all the canopies from the combiners by
+performing another clustering iteration on them.
+ * A driver class manages the iteration and determines when either a
+maximum number of iterations occur or the termination criteria is reached. 
+
+
+<a name="MeanShiftClustering-RunningMeanShiftClustering"></a>
+## Running Mean Shift Clustering
+
+The Mean Shift clustering algorithm may be run using a command-line
+invocation on MeanShiftCanopyDriver.main or by making a Java call to
+MeanShiftCanopyDriver.run(). 
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout meanshift \
+        -i <input vectors directory> \
+        -o <output working directory> \
+        -inputIsCanopies <input directory contains mean shift canopies not
+vectors> \
+        -dm <DistanceMeasure> \
+        -t1 <the T1 threshold> \
+        -t2 <the T2 threshold> \
+        -x <maximum number of iterations> \
+        -cd <optional convergence delta. Default is 0.5> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Clusters>
+        -xm <execution method: sequential or mapreduce>
+
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. measure: the fully-qualified class name of an instance of DistanceMeasure
+which will be used for the clustering.
+1. t1: the T1 threshold is used to determine if clusters are close enough to
+influence each other's next mean calculation.
+1. t2: the T2 threshold is used to determine when two clusters are close
+enough to merge.
+1. convergence: a double value used to determine if the algorithm has
+converged (clusters have not moved more than the value in the last
+iteration)
+1. max-iterations: the maximum number of iterations to run, independent of
+the convergence specified
+1. inputIsClusters: a boolean indicating, if true, that the input directory
+already contains MeanShiftCanopies and no further initialization is needed.
+If false (the default) input VectorWritables are used to form the initial
+canopies and these will be written to the clusters-0 directory.
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. runSequential: a boolean indicating, if true, that the clustering is to
+be done using the sequential reference implementation in memory.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy)
+produced by the algorithm for each iteration. The Text _key_ is a cluster
+identifier string.
+1. clusteredPoints: (if runClustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weight indicates
+the probability that the vector is a member of the cluster. As Mean Shift
+only produces a single clustering for each point, the weights are all == 1.
+
+<a name="MeanShiftClustering-Examples"></a>
+# Examples
+
+The following images illustrate Mean Shift clustering applied to a set of
+randomly-generated 2-d data points. The points are generated using a normal
+distribution centered at a mean location and with a constant standard
+deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=3.0
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=0.5
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed. 
+
+!SampleData.png!
+
+In the second image, the resulting clusters (k=3) are shown superimposed
+upon the sample data. In this image, each cluster renders in a different
+color and the T1 and T2 radii are superimposed upon the final cluster
+centers determined by the algorithm. Mean Shift does an excellent job of
+clustering this data, though by its design the cluster membership is unique
+and the clusters do not overlap. 
+
+!MeanShift.png!
+
+The third image shows the results of running Mean Shift on a different data
+set (see [Dirichlet Process Clustering](dirichlet-process-clustering.html)
+ for details) which is generated using asymmetrical standard deviations.
+Mean Shift does an excellent job of clustering this data set too.
+
+!2dMeanShift.png!

Added: mahout/site/mahout_cms/content/users/clustering/mean-shift-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/mean-shift-commandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/mean-shift-commandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/mean-shift-commandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,74 @@
+Title: mean-shift-commandline
+<a name="mean-shift-commandline-RunningMeanShiftCanopyClusteringfromtheCommandLine"></a>
+# Running Mean Shift Canopy Clustering from the Command Line
+Mahout's Mean Shift clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Mean Shift on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout meanshift <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="mean-shift-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout meanshift -i testdata <OTHER OPTIONS>
+
+
+<a name="mean-shift-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout meanshift -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="mean-shift-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			       Path to job input directory. 
+    					       Must be a SequenceFile of    
+    					       VectorWritable		    
+      --output (-o) output			       The directory pathname for   
+    					       output.			    
+      --overwrite (-ow)			       If present, overwrite the
+output 
+    					       directory before running job 
+      --distanceMeasure (-dm) distanceMeasure      The classname of the	    
+    					       DistanceMeasure. Default is  
+    					       SquaredEuclidean 	    
+      --help (-h)				       Print out help		    
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
+    					       Default is 0.5		    
+      --t1 (-t1) t1 			       T1 threshold value	    
+      --t2 (-t2) t2 			       T2 threshold value	    
+      --clustering (-cl)			       If present, run clustering
+after 
+    					       the iterations have taken
+place  
+      --maxIter (-x) maxIter		       The maximum number of	    
+    					       iterations.		    
+      --inputIsCanopies (-ic) inputIsCanopies      If present, the input
+directory  
+    					       already contains 	    
+    					       MeanShiftCanopies	    
+

Added: mahout/site/mahout_cms/content/users/clustering/spectral-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/spectral-clustering.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/spectral-clustering.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/spectral-clustering.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,262 @@
+Title: Spectral Clustering
+Spectral clustering, a more powerful and specialized algorithm (compared to
+K-means), derives its name from spectral analysis of a graph, which is how
+the data are represented. Each object to be clustered can initially be
+represented as an _n_\-dimensional numeric vector, but the difference with
+this algorithm is that there must also be some method for performing a
+comparison between each object and expressing this comparison as a scalar.
+
+This _n_ by _n_ comparison of all objects with all others forms the
+_affinity_ matrix, which can be intuitively thought of as a rough
+representation of an underlying undirected, weighted, and fully-connected
+graph whose edges express the relative relationships, or affinities,
+between each pair of objects in the original data. This affinity matrix
+forms the basis from which the two spectral clustering algorithms operate.
+
+The equation by which the affinities are calculated can vary depending on
+the user's circumstances; typically, the equation takes the form of:
+
+exp( _-d^2_ / _c_ )
+
+where _d_ is the Euclidean distance between a pair of points, and _c_ is a
+scaling factor. _c_ is often calculated relative to a _k_\-neighborhood of
+closest points to the current point; all other affinities are set to 0
+outside of the neighborhood. Again, this formula can vary depending on the
+situation (e.g. a fully-connected graph would ignore the _k_\-neighborhood
+and calculate affinities for all pairs of points).
+
+[Full overview on spectral clustering](http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/)
+
+<a name="SpectralClustering-K-MeansSpectralClustering"></a>
+## K-Means Spectral Clustering
+
+<a name="SpectralClustering-Overview"></a>
+### Overview
+
+This consists of a few basic steps of generalized spectral clustering,
+followed by standard k-means clustering over the intermediate results.
+Again, this process begins with an affinity matrix *A* \- whether or not it
+is fully-connected depends on the user's need.
+
+*A* is then transformed into a pseudo-Laplacian matrix via a multiplication
+with a diagonal matrix whose entries consist of the sums of the rows of
+*A*. The sums are modified to be the inverse square root of their original
+values. The final operation looks something like:
+
+L = D^{-1/2} A D^{-1/2}
+
+*L* has some properties that are of interest to us; most importantly, while
+it is symmetric like *A*, it has a more stable eigen-decomposition. *L* is
+decomposed into its constituent eigenvectors and corresponding eigenvalues
+(though the latter will not be needed for future calculations); the matrix
+of eigenvectors, *U*, is what we are now interested in.
+
+Assuming *U* is a column matrix (the eigenvectors comprise the columns),
+then we will now use the _rows_ of *U* as proxy data for the original data
+points. We will run each row through standard K-means clustering, and the
+label that each proxy point receives will be transparently assigned to the
+corresponding original data point, resulting in the final clustering
+assignments.
+
+[Full overview on k-means spectral clustering](http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/)
+
+<a name="SpectralClustering-Implementation"></a>
+### Implementation
+
+The Mahout implementation consists of a single driver -
+SpectralKMeansDriver - calling upon several common utilities. The driver
+performs the operations in sequential fashion: reading in and constructing
+the affinity matrix, building the diagonal matrix, building the
+pseudo-Laplacian and decomposing it, and clustering the components.
+
+The affinity matrix input is the most important part of this algorithm. It
+consists of text files which follow a specific format: that of a weighted,
+undirected graph. In order to represent a graph in text files, each line of
+a text file represents a single directional edge between two nodes. There
+are three comma-separated values on the line. The first number indicates
+the source node, the second is the destination node, and the third is the
+weight. For example:
+
+0, 1, 2.5
+
+would indicate the directional edge from node 0 to node 1 has a weight of
+2.5. *Please note: as of 8/16/2010, Eigencuts assumes the affinity matrix
+is symmetric, hence there should be a corresponding line in the text file
+of: 1, 0, 2.5.* Also, each node should be an integer value.
+
+*note*: per MAHOUT-978 don't name your input file with leading '_' (or
+'.').
+
+M/R jobs written for SpectralKMeans:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* UnitVectorizerJob (converts the eigenvector matrix *U* to unit rows)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+
+M/R jobs already in Mahout that were used:
+* DistributedRowMatrix.transpose()
+* DistributedLanczosSolver
+* EigenVerfierJob
+* KMeansDriver
+
+<a name="SpectralClustering-EigencutsSpectralClustering"></a>
+## Eigencuts Spectral Clustering
+
+<a name="SpectralClustering-Overview"></a>
+### Overview
+
+Intuitively, Eigencuts can be thought of as part 2 of the K-means
+algorithm, in that it performs the same initial steps up until the k-means
+clustering. The algorithm uses the same affinity matrix *A*, constructs the
+same diagonal matrix *D*, performs the same multiplication to create the
+pseudo-Laplacian *L*, and conducts an eigen-decomposition on *L* to obtain
+the eigenvectors and eigenvalues. But here is where the two algorithms
+differentiate.
+
+For each eigenvector, we wish to determine how stable its flow of
+probability is within the underlying graph of the original data.
+Intuitively, this is intrinsically related to the min-cut, max-flow problem
+of finding bottlenecks: if we perturb the flow rate on a specific edge, and
+overall the flow is stable, then we can conclude that this edge was not a
+bottleneck. If, however, perturbing an edge significantly alters the
+overall flow, we know this edge's eigenflow is very unstable and is a
+bottleneck.
+
+We have an [explicit form](http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png)
+ of this "sensitivity" calculation ([full post here, under "computing
+sensitivities"|http://spectrallyclustered.wordpress.com/2010/07/15/sprint-3-three-last-mr-tasks/]).
+The next step is called "non-maximal suppression", which effectively means
+we will ignore any of the calculated sensitivities for which there exists
+another more negative sensitivity in the same neighborhood as the current
+one, effectively "suppressing" it.
+
+This non-maximal suppression then plays a role in the final
+affinity-cutting step, where we "cut" the affinity (set to 0) between any
+two nodes (effectively destroying the edge between them) for which the
+sensitivity calculated at that edge passes some threshold, _and_ for which
+the sensitivity was _not_ suppressed in the previous step.
+
+Once the cutting has been completed, the process loops upon itself
+(starting with the recalculation of *D* using the modified *A*) until no
+new cuts in *A* are made in the final step.
+
+[Full overview on Eigencuts spectral clustering](http://spectrallyclustered.wordpress.com/2010/07/06/sprint-3-introduction-to-eigencuts/)
+
+<a name="SpectralClustering-Implementation"></a>
+### Implementation
+
+Since the first half of Eigencuts uses the same calculations as Spectral
+K-means, it uses the same common M/R tasks, both those specific to spectral
+clustering, as well as those general to Mahout. Unlike SpectralKMeans,
+however, there are no DistributedRowMatrix-specific operations performed,
+and hence there is no need for the data type at all; Mahout Vectors are
+used heavily instead.
+
+Once the initial affinity matrix is constructed, there is a loop within the
+EigencutsDriver over the calculation of *D*, the creation of *L* and its
+eigen-decomposition, the calculation of the sensitivities, and the actual
+affinity cuts, such that the loop terminates only when no cuts are made to
+the affinity matrix *A*. The final matrix *A* will then be representative
+of a graph structure whose data points exist in intra-connected clusters.
+
+M/R tasks specific to Eigencuts:
+* EigencutsSensitivityJob (calculates the perturbation effects on edge
+weights)
+* EigencutsAffinityCutsJob (sets edge weights to 0)
+
+M/R tasks within spectral clustering:
+* AffinityMatrixInputJob (reads the raw input into a DistributedRowMatrix)
+* MatrixDiagonalizeJob (constructs the diagonal matrix)
+* VectorMatrixMultiplicationJob (multiplies *D* with *A*)
+
+M/R tasks general to Mahout:
+* DistributedLanczosSolver
+* EigenVerifierJob
+
+<a name="SpectralClustering-Quickstart"></a>
+## Quickstart
+
+As noted before, the data for both these algorithms - the affinity matrix -
+is required to be symmetric. As of the first release, there is no built-in
+mechanism in Mahout for generating the affinities from raw data, as the
+formula this follows varies depending on the user's need, so it is left as
+an exercise to the user to generate the affinities prior to using these
+algorithms.
+
+The affinity input should follow the standard format for textual
+representation of a graph, namely:
+
+node_i, node_j, value_ij
+
+For example, the following 3x3 affinity matrix:
+
+<table>
+<tr><td> 0.0 </td><td> 0.8 </td><td> 0.5 </td></tr>
+<tr><td> 0.8 </td><td> 0.0 </td><td> 0.9 </td></tr>
+<tr><td> 0.5 </td><td> 0.9 </td><td> 0.0 </td></tr>
+</table>
+
+would have the following textual input format:
+
+0, 0, 0
+0, 1, 0.8
+0, 2, 0.5
+1, 0, 0.8
+1, 1, 0
+1, 2, 0.9
+2, 0, 0.5
+2, 1, 0.9
+2, 2, 0
+
+Then simply invoke Spectral K-means or Eigencuts, using the input directory
+and setting the other parameters as necessary (e.g. "k" for K-means, "beta"
+for Eigencuts, etc).
+
+Format *rules*: node count begins at zero, is continuous. avoid whitespace,
+comments etc.
+
+<a name="SpectralClustering-Examples"></a>
+## Examples
+
+*NOTE: Am still waiting for Carnegie Mellon/Univ. of Pittsburgh approval
+for official data set*
+
+For these algorithms, it is useful to have a viable example, so I have
+created a small but effective synthetic data set to show how these
+algorithms operate. The raw data was generated synthetically and [can be viewed here](http://dl.dropbox.com/u/1377610/rawdata.csv)
+. It consists of 450 two-dimensional points drawn from 3 separate gaussian
+distributions their own means and standard deviations ([here is an image of
+the plotted
+points|http://spectrallyclustered.files.wordpress.com/2010/07/clusters.png].
+This same data in affinity matrix form looks like this: [view the affinity data set|http://dl.dropbox.com/u/1377610/affinity.csv]
+
+In order to run the program, then, for spectral k-means:
+
+bin/mahout spectralkmeans \-i /path/to/directory/with/affinity/matrix \-o
+/output/path \-k 3 \-d 450
+
+and for eigencuts:
+
+bin/mahout eigencuts \-i /path/to/directory/with/affinity/matrix \-o
+/output/path \-b 2 \-d 450
+
+In both cases, the "-d" flag refers to the dimensionality of the affinity
+matrix; since there are 450 points, this value will be 450. In spectral
+k-means, the "-k" is the number of clusters to estimate. In Eigencuts, the
+"-b" refers to an initial estimate on the half-life of the flow of
+probability in the graph; higher estimates correspond to fewer clusters.
+
+Spectral k-means will eventually yield the clustering results, and there
+should only be a single mistake: point 54. This is because that particular
+point is in between the two left-hand clusters in the image, and so has
+high affinities with both clusters.
+
+[Full overview of example here](http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/)
+
+<a name="SpectralClustering-Resources"></a>
+### Resources
+
+* [http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/](http://spectrallyclustered.wordpress.com/2010/05/27/intro-and-spectral-clustering-101/)
+* [http://www.stanford.edu/class/ee378B/papers/luxburg-spectral.pdf](http://www.stanford.edu/class/ee378B/papers/luxburg-spectral.pdf)
+* [http://en.wikipedia.org/wiki/Laplacian_matrix](http://en.wikipedia.org/wiki/Laplacian_matrix)
+* [http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering](http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering)