You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2013/11/21 11:50:16 UTC
svn commit: r887483 - in /websites/staging/mahout/trunk/content: ./
users/clustering/k-means-commandline.html
Author: buildbot
Date: Thu Nov 21 10:50:15 2013
New Revision: 887483
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/clustering/k-means-commandline.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 21 10:50:15 2013
@@ -1 +1 @@
-1544100
+1544101
Modified: websites/staging/mahout/trunk/content/users/clustering/k-means-commandline.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/k-means-commandline.html (original)
+++ websites/staging/mahout/trunk/content/users/clustering/k-means-commandline.html Thu Nov 21 10:50:15 2013
@@ -382,7 +382,7 @@
<div id="content-wrap" class="clearfix">
<div id="main">
<p><a name="k-means-commandline-Introduction"></a></p>
-<h1 id="introduction">Introduction</h1>
+<h1 id="kmeans-commandline-introduction">kMeans commandline introduction</h1>
<p>This quick start page describes how to run the kMeans clustering algorithm
on a Hadoop cluster. </p>
<p><a name="k-means-commandline-Steps"></a></p>
@@ -398,12 +398,10 @@ missing then the stand-alone Hadoop conf
</pre></div>
-<ul>
-<li>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+<p>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
the Mahout version number. For example, when using Mahout 0.3 release, the
-job will be mahout-core-0.3.job</li>
-</ul>
+job will be mahout-core-0.3.job</p>
<p><a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a></p>
<h2 id="testing-it-on-one-single-machine-wo-cluster">Testing it on one single machine w/o cluster</h2>
<ul>
@@ -424,9 +422,7 @@ org.apache.mahout.common.distance.Cosine
<p>Run the Job: </p>
<p>export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
-./bin/mahout kmeans -i testdata -o output -c clusters -dm
-org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
-25</p>
+./bin/mahout kmeans -i testdata -o output -c clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25</p>
</li>
<li>
<p>Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
@@ -438,46 +434,35 @@ to view all outputs.</p>
<div class="codehilite"><pre> <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span> <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span>
<span class="n">Must</span> <span class="n">be</span> <span class="n">a</span> <span class="n">SequenceFile</span> <span class="n">of</span>
<span class="n">VectorWritable</span>
- <span class="o">--</span><span class="n">clusters</span> <span class="p">(</span><span class="o">-</span><span class="n">c</span><span class="p">)</span> <span class="n">clusters</span> <span class="n">The</span> <span class="n">input</span> <span class="n">centroids</span><span class="p">,</span> <span class="n">as</span>
+ <span class="o">--</span><span class="n">clusters</span> <span class="p">(</span><span class="o">-</span><span class="n">c</span><span class="p">)</span> <span class="n">clusters</span> <span class="n">The</span> <span class="n">input</span> <span class="n">centroids</span><span class="p">,</span> <span class="n">as</span> <span class="n">Vectors</span><span class="p">.</span>
+ <span class="n">Must</span> <span class="n">be</span> <span class="n">a</span> <span class="n">SequenceFile</span> <span class="n">of</span>
+ <span class="n">Writable</span><span class="p">,</span> <span class="n">Cluster</span><span class="o">/</span><span class="n">Canopy</span><span class="p">.</span> <span class="n">If</span> <span class="n">k</span>
+ <span class="n">is</span> <span class="n">also</span> <span class="n">specified</span><span class="p">,</span> <span class="n">then</span> <span class="n">a</span> <span class="n">random</span>
+ <span class="n">set</span> <span class="n">of</span> <span class="n">vectors</span> <span class="n">will</span> <span class="n">be</span> <span class="n">selected</span>
+ <span class="n">and</span> <span class="n">written</span> <span class="n">out</span> <span class="n">to</span> <span class="n">this</span> <span class="n">path</span>
+ <span class="n">first</span>
+ <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span>
+ <span class="n">output</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">distanceMeasure</span> <span class="p">(</span><span class="o">-</span><span class="n">dm</span><span class="p">)</span> <span class="n">distanceMeasure</span> <span class="n">The</span> <span class="n">classname</span> <span class="n">of</span> <span class="n">the</span>
+ <span class="n">DistanceMeasure</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span>
+ <span class="n">SquaredEuclidean</span>
+ <span class="o">--</span><span class="n">convergenceDelta</span> <span class="p">(</span><span class="o">-</span><span class="n">cd</span><span class="p">)</span> <span class="n">convergenceDelta</span> <span class="n">The</span> <span class="n">convergence</span> <span class="n">delta</span> <span class="n">value</span><span class="p">.</span>
+ <span class="n">Default</span> <span class="n">is</span> 0<span class="p">.</span>5
+ <span class="o">--</span><span class="n">maxIter</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxIter</span> <span class="n">The</span> <span class="n">maximum</span> <span class="n">number</span> <span class="n">of</span>
+ <span class="n">iterations</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">maxRed</span> <span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="p">)</span> <span class="n">maxRed</span> <span class="n">The</span> <span class="n">number</span> <span class="n">of</span> <span class="n">reduce</span> <span class="n">tasks</span><span class="p">.</span>
+ <span class="n">Defaults</span> <span class="n">to</span> 2
+ <span class="o">--</span><span class="n">k</span> <span class="p">(</span><span class="o">-</span><span class="n">k</span><span class="p">)</span> <span class="n">k</span> <span class="n">The</span> <span class="n">k</span> <span class="n">in</span> <span class="n">k</span><span class="o">-</span><span class="n">Means</span><span class="p">.</span> <span class="n">If</span> <span class="n">specified</span><span class="p">,</span>
+ <span class="n">then</span> <span class="n">a</span> <span class="n">random</span> <span class="n">selection</span> <span class="n">of</span> <span class="n">k</span>
+ <span class="n">Vectors</span> <span class="n">will</span> <span class="n">be</span> <span class="n">chosen</span> <span class="n">as</span> <span class="n">the</span>
+ <span class="n">Centroid</span> <span class="n">and</span> <span class="n">written</span> <span class="n">to</span> <span class="n">the</span>
+ <span class="n">clusters</span> <span class="n">input</span> <span class="n">path</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> <span class="n">output</span>
+ <span class="n">directory</span> <span class="n">before</span> <span class="n">running</span> <span class="n">job</span>
+ <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>
+ <span class="o">--</span><span class="n">clustering</span> <span class="p">(</span><span class="o">-</span><span class="n">cl</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">run</span> <span class="n">clustering</span> <span class="n">after</span>
+ <span class="n">the</span> <span class="n">iterations</span> <span class="n">have</span> <span class="n">taken</span> <span class="n">place</span>
</pre></div>
-
-
-<p>Vectors.
- Must be a SequenceFile of <br />
- Writable, Cluster/Canopy.
-If k<br />
- is also specified, then a
-random
- set of vectors will be
-selected<br />
- and written out to this path
- first <br />
- --output (-o) output The directory pathname for <br />
- output. <br />
- --distanceMeasure (-dm) distanceMeasure The classname of the <br />
- DistanceMeasure. Default is<br />
- SquaredEuclidean <br />
- --convergenceDelta (-cd) convergenceDelta The convergence delta value.
- Default is 0.5 <br />
- --maxIter (-x) maxIter The maximum number of <br />
- iterations. <br />
- --maxRed (-r) maxRed The number of reduce tasks.<br />
- Defaults to 2 <br />
- --k (-k) k The k in k-Means. If
-specified,
- then a random selection of k
- Vectors will be chosen as
-the <br />
- Centroid and written to the<br />
- clusters input path. <br />
- --overwrite (-ow) If present, overwrite the
-output
- directory before running job
- --help (-h) Print out help <br />
- --clustering (-cl) If present, run clustering
-after
- the iterations have taken
-place </p>
</div>
</div>
</div>