You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2013/11/21 11:43:57 UTC
svn commit: r887480 - in /websites/staging/mahout/trunk/content: ./
users/clustering/k-means-clustering.html
Author: buildbot
Date: Thu Nov 21 10:43:57 2013
New Revision: 887480
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 21 10:43:57 2013
@@ -1 +1 @@
-1544096
+1544098
Modified: websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html (original)
+++ websites/staging/mahout/trunk/content/users/clustering/k-means-clustering.html Thu Nov 21 10:43:57 2013
@@ -381,7 +381,8 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <p>k-Means is a rather {color:#ff0000}simple{color} but well known algorithm
+ <h1 id="k-means-clustering-basics">k-Means clustering - basics</h1>
+<p><a href="http://en.wikipedia.org/wiki/Kmeans">k-Means</a> is a rather simple but well known algorithm
for grouping objects, clustering. Again all objects need to be represented
as a set of numerical features. In addition the user has to specify the
number of groups (referred to as <em>k</em>) he wishes to identify.</p>
@@ -419,11 +420,11 @@ offer you a better understanding.</p>
<p><a name="K-MeansClustering-Strategyforparallelization"></a></p>
<h2 id="strategy-for-parallelization">Strategy for parallelization</h2>
<p>Some ideas can be found in <a href="http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html">Cluster computing and MapReduce</a>
- lecture video series [by Google(r)]; k-Means clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]
-. Slides can be found [here|http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt]
+ lecture video series [by Google(r)]; k-Means clustering is discussed in <a href="http://www.youtube.com/watch?v=1ZDybXl212Q9">lecture 4</a>
+. Slides can be found <a href="http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt">here</a>
.</p>
<p>Interestingly, Hadoop based implementation using <a href="http://en.wikipedia.org/wiki/Canopy_clustering_algorithm">Canopy-clustering</a>
- seems to be here: [http://code.google.com/p/canopy-clustering/]
+ seems to be here: <a href="http://code.google.com/p/canopy-clustering/">canopy</a>
(GPL 3 licence)</p>
<p>Here's another useful paper <a href="http://www2.chass.ncsu.edu/garson/PA765/cluster.htm">http://www2.chass.ncsu.edu/garson/PA765/cluster.htm</a>
.</p>
@@ -439,35 +440,39 @@ clustering and convergence values.</p>
<p>The program iterates over the input points and clusters, outputting a new
directory "clusters-N" containing SequenceFile(Text, Cluster) files for
each iteration N. This process uses a mapper/combiner/reducer/driver as
-follows:
-<em> KMeansMapper - reads the input clusters during its setup() method, then
+follows:</p>
+<ul>
+<li>KMeansMapper - reads the input clusters during its setup() method, then
assigns and outputs each input point to its nearest cluster as defined by
the user-supplied distance measure. Output key is: cluster identifier.
-Output value is: ClusterObservation.
-</em> KMeansCombiner - receives all key:value pairs from the mapper and
+Output value is: ClusterObservation.</li>
+<li>KMeansCombiner - receives all key:value pairs from the mapper and
produces partial sums of the input vectors for each cluster. Output key is:
-cluster identifier. Output value is ClusterObservation.
-<em> KMeansReducer - a single reducer receives all key:value pairs from all
+cluster identifier. Output value is ClusterObservation.</li>
+<li>KMeansReducer - a single reducer receives all key:value pairs from all
combiners and sums them to produce a new centroid for the cluster which is
output. Output key is: encoded cluster identifier. Output value is:
Cluster. The reducer encodes unconverged clusters with a 'Cn' cluster Id
-and converged clusters with 'Vn' clusterId.
-</em> KMeansDriver - iterates over the points and clusters until all output
+and converged clusters with 'Vn' clusterId.</li>
+<li>KMeansDriver - iterates over the points and clusters until all output
clusters have converged (Vn clusterIds) or until a maximum number of
iterations has been reached. During iterations, a new clusters directory
"clusters-N" is produced with the output clusters from the previous
iteration used for input to the next. A final optional pass over the data
using the KMeansClusterMapper clusters all points to an output directory
-"clusteredPoints" and has no combiner or reducer steps.</p>
-<p>Canopy clustering can be used to compute the initial clusters for k-KMeans:
-{quote}
-// run the CanopyDriver job
-CanopyDriver.runJob("testdata", "output"
-ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);</p>
-<p>// now run the KMeansDriver job
-KMeansDriver.runJob("testdata", "output/clusters-0", "output",
-EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);
-{quote}</p>
+"clusteredPoints" and has no combiner or reducer steps.</li>
+</ul>
+<p>Canopy clustering can be used to compute the initial clusters for k-KMeans:</p>
+<div class="codehilite"><pre><span class="c1">// run the CanopyDriver job</span>
+<span class="n">CanopyDriver</span><span class="p">.</span><span class="n">runJob</span><span class="p">(</span><span class="s">"testdata"</span><span class="p">,</span> <span class="s">"output"</span>
+<span class="n">ManhattanDistanceMeasure</span><span class="p">.</span><span class="k">class</span><span class="p">.</span><span class="n">getName</span><span class="p">(),</span> <span class="p">(</span><span class="n">float</span><span class="p">)</span> <span class="mf">3.1</span><span class="p">,</span> <span class="p">(</span><span class="n">float</span><span class="p">)</span> <span class="mf">2.1</span><span class="p">,</span> <span class="n">false</span><span class="p">);</span>
+
+<span class="c1">// now run the KMeansDriver job</span>
+<span class="n">KMeansDriver</span><span class="p">.</span><span class="n">runJob</span><span class="p">(</span><span class="s">"testdata"</span><span class="p">,</span> <span class="s">"output/clusters-0"</span><span class="p">,</span> <span class="s">"output"</span><span class="p">,</span>
+<span class="n">EuclideanDistanceMeasure</span><span class="p">.</span><span class="k">class</span><span class="p">.</span><span class="n">getName</span><span class="p">(),</span> <span class="s">"0.001"</span><span class="p">,</span> <span class="s">"10"</span><span class="p">,</span> <span class="n">true</span><span class="p">);</span>
+</pre></div>
+
+
<p>In the above example, the input data points are stored in 'testdata' and
the CanopyDriver is configured to output to the 'output/clusters-0'
directory. Once the driver executes it will contain the canopy definition
@@ -476,10 +481,7 @@ more new directories: 'clusters-N'' cont
iteration and 'clusteredPoints' will contain the clustered data points.</p>
<p>This diagram shows the examplary dataflow of the k-Means example
implementation provided by Mahout:
-{gliffy:name=Example implementation of k-Means provided with
-Mahout|space=MAHOUT|page=k-Means|pageid=75159|align=left|size=L|version=7}</p>
-<p>This diagram doesn't consider CanopyClustering:
-{gliffy:name=k-Means Example|space=MAHOUT|page=k-Means|align=left|size=L}</p>
+<img alt="dataflow" src="../../images/Example" title="implementation of k-Means provided with Mahout.png" /></p>
<p><a name="K-MeansClustering-Runningk-MeansClustering"></a></p>
<h2 id="running-k-means-clustering">Running k-Means Clustering</h2>
<p>The k-Means clustering algorithm may be run using a command-line invocation
@@ -554,16 +556,16 @@ deviation. See the README file in the <a
</ul>
<p>In the first image, the points are plotted and the 3-sigma boundaries of
their generator are superimposed.</p>
-<p>!SampleData.png!</p>
+<p><img alt="Sample data graph" src="../../images/SampleData.png" /></p>
<p>In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in [orange, yellow, green, blue, violet and gray](orange,-yellow,-green,-blue,-violet-and-gray.html)
. Although it misses a lot of the points and cannot capture the original,
superimposed cluster centers, it does a decent job of clustering this data.</p>
-<p>!KMeans.png!</p>
+<p><img alt="kmeans" src="../../images/KMeans.png" /></p>
<p>The third image shows the results of running k-Means on a different data
set (see <a href="dirichlet-process-clustering.html">Dirichlet Process Clustering</a>
for details) which is generated using asymmetrical standard deviations.
K-Means does a fair job handling this data set as well.</p>
-<p>!2dKMeans.png!</p>
+<p><img alt="2d kmeans" src="../../images/2dKMeans.png" /></p>
</div>
</div>
</div>