You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2013/11/21 11:43:57 UTC

svn commit: r887480 - in /websites/staging/mahout/trunk/content: ./ users/clustering/k-means-clustering.html

Author: buildbot
Date: Thu Nov 21 10:43:57 2013
New Revision: 887480

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 21 10:43:57 2013
@@ -1 +1 @@
-1544096
+1544098

Modified: websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html (original)
+++ websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html Thu Nov 21 10:43:57 2013
@@ -381,7 +381,8 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p>k-Means is a rather {color:#ff0000}simple{color} but well known algorithm
+    <h1 id="k-means-clustering-basics">k-Means clustering - basics</h1>
+<p><a href="http://en.wikipedia.org/wiki/Kmeans">k-Means</a> is a rather simple but well known algorithm
 for grouping objects, clustering. Again all objects need to be represented
 as a set of numerical features. In addition the user has to specify the
 number of groups (referred to as <em>k</em>) he wishes to identify.</p>
@@ -419,11 +420,11 @@ offer you a better understanding.</p>
 <p><a name="K-MeansClustering-Strategyforparallelization"></a></p>
 <h2 id="strategy-for-parallelization">Strategy for parallelization</h2>
 <p>Some ideas can be found in <a href="http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html">Cluster computing and MapReduce</a>
- lecture video series [by Google(r)]; k-Means clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]
-. Slides can be found [here|http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt]
+ lecture video series [by Google(r)]; k-Means clustering is discussed in <a href="http://www.youtube.com/watch?v=1ZDybXl212Q9">lecture 4</a>
+. Slides can be found <a href="http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt">here</a>
 .</p>
 <p>Interestingly, Hadoop based implementation using <a href="http://en.wikipedia.org/wiki/Canopy_clustering_algorithm">Canopy-clustering</a>
- seems to be here: [http://code.google.com/p/canopy-clustering/]
+ seems to be here: <a href="http://code.google.com/p/canopy-clustering/">canopy</a>
  (GPL 3 licence)</p>
 <p>Here's another useful paper <a href="http://www2.chass.ncsu.edu/garson/PA765/cluster.htm">http://www2.chass.ncsu.edu/garson/PA765/cluster.htm</a>
 .</p>
@@ -439,35 +440,39 @@ clustering and convergence values.</p>
 <p>The program iterates over the input points and clusters, outputting a new
 directory "clusters-N" containing SequenceFile(Text, Cluster) files for
 each iteration N. This process uses a mapper/combiner/reducer/driver as
-follows:
-<em> KMeansMapper - reads the input clusters during its setup() method, then
+follows:</p>
+<ul>
+<li>KMeansMapper - reads the input clusters during its setup() method, then
 assigns and outputs each input point to its nearest cluster as defined by
 the user-supplied distance measure. Output key is: cluster identifier.
-Output value is: ClusterObservation.
-</em> KMeansCombiner - receives all key:value pairs from the mapper and
+Output value is: ClusterObservation.</li>
+<li>KMeansCombiner - receives all key:value pairs from the mapper and
 produces partial sums of the input vectors for each cluster. Output key is:
-cluster identifier. Output value is ClusterObservation.
-<em> KMeansReducer - a single reducer receives all key:value pairs from all
+cluster identifier. Output value is ClusterObservation.</li>
+<li>KMeansReducer - a single reducer receives all key:value pairs from all
 combiners and sums them to produce a new centroid for the cluster which is
 output. Output key is: encoded cluster identifier. Output value is:
 Cluster. The reducer encodes unconverged clusters with a 'Cn' cluster Id
-and converged clusters with 'Vn' clusterId.
-</em> KMeansDriver - iterates over the points and clusters until all output
+and converged clusters with 'Vn' clusterId.</li>
+<li>KMeansDriver - iterates over the points and clusters until all output
 clusters have converged (Vn clusterIds) or until a maximum number of
 iterations has been reached. During iterations, a new clusters directory
 "clusters-N" is produced with the output clusters from the previous
 iteration used for input to the next. A final optional pass over the data
 using the KMeansClusterMapper clusters all points to an output directory
-"clusteredPoints" and has no combiner or reducer steps.</p>
-<p>Canopy clustering can be used to compute the initial clusters for k-KMeans:
-{quote}
-// run the CanopyDriver job
-CanopyDriver.runJob("testdata", "output"
-ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);</p>
-<p>// now run the KMeansDriver job
-KMeansDriver.runJob("testdata", "output/clusters-0", "output",
-EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);
-{quote}</p>
+"clusteredPoints" and has no combiner or reducer steps.</li>
+</ul>
+<p>Canopy clustering can be used to compute the initial clusters for k-KMeans:</p>
+<div class="codehilite"><pre><span class="c1">// run the CanopyDriver job</span>
+<span class="n">CanopyDriver</span><span class="p">.</span><span class="n">runJob</span><span class="p">(</span><span class="s">&quot;testdata&quot;</span><span class="p">,</span> <span class="s">&quot;output&quot;</span>
+<span class="n">ManhattanDistanceMeasure</span><span class="p">.</span><span class="k">class</span><span class="p">.</span><span class="n">getName</span><span class="p">(),</span> <span class="p">(</span><span class="n">float</span><span class="p">)</span> <span class="mf">3.1</span><span class="p">,</span> <span class="p">(</span><span class="n">float</span><span class="p">)</span> <span class="mf">2.1</span><span class="p">,</span> <span class="n">false</span><span class="p">);</span>
+
+<span class="c1">// now run the KMeansDriver job</span>
+<span class="n">KMeansDriver</span><span class="p">.</span><span class="n">runJob</span><span class="p">(</span><span class="s">&quot;testdata&quot;</span><span class="p">,</span> <span class="s">&quot;output/clusters-0&quot;</span><span class="p">,</span> <span class="s">&quot;output&quot;</span><span class="p">,</span>
+<span class="n">EuclideanDistanceMeasure</span><span class="p">.</span><span class="k">class</span><span class="p">.</span><span class="n">getName</span><span class="p">(),</span> <span class="s">&quot;0.001&quot;</span><span class="p">,</span> <span class="s">&quot;10&quot;</span><span class="p">,</span> <span class="n">true</span><span class="p">);</span>
+</pre></div>
+
+
 <p>In the above example, the input data points are stored in 'testdata' and
 the CanopyDriver is configured to output to the 'output/clusters-0'
 directory. Once the driver executes it will contain the canopy definition
@@ -476,10 +481,7 @@ more new directories: 'clusters-N'' cont
 iteration and 'clusteredPoints' will contain the clustered data points.</p>
 <p>This diagram shows the examplary dataflow of the k-Means example
 implementation provided by Mahout:
-{gliffy:name=Example implementation of k-Means provided with
-Mahout|space=MAHOUT|page=k-Means|pageid=75159|align=left|size=L|version=7}</p>
-<p>This diagram doesn't consider CanopyClustering:
-{gliffy:name=k-Means Example|space=MAHOUT|page=k-Means|align=left|size=L}</p>
+<img alt="dataflow" src="../../images/Example" title="implementation of k-Means provided with Mahout.png" /></p>
 <p><a name="K-MeansClustering-Runningk-MeansClustering"></a></p>
 <h2 id="running-k-means-clustering">Running k-Means Clustering</h2>
 <p>The k-Means clustering algorithm may be run using a command-line invocation
@@ -554,16 +556,16 @@ deviation. See the README file in the <a
 </ul>
 <p>In the first image, the points are plotted and the 3-sigma boundaries of
 their generator are superimposed.</p>
-<p>!SampleData.png!</p>
+<p><img alt="Sample data graph" src="../../images/SampleData.png" /></p>
 <p>In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in [orange, yellow, green, blue, violet and gray](orange,-yellow,-green,-blue,-violet-and-gray.html)
 . Although it misses a lot of the points and cannot capture the original,
 superimposed cluster centers, it does a decent job of clustering this data.</p>
-<p>!KMeans.png!</p>
+<p><img alt="kmeans" src="../../images/KMeans.png" /></p>
 <p>The third image shows the results of running k-Means on a different data
 set (see <a href="dirichlet-process-clustering.html">Dirichlet Process Clustering</a>
  for details) which is generated using asymmetrical standard deviations.
 K-Means does a fair job handling this data set as well.</p>
-<p>!2dKMeans.png!</p>
+<p><img alt="2d kmeans" src="../../images/2dKMeans.png" /></p>
    </div>
   </div>     
 </div>