You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2015/04/03 03:36:05 UTC

svn commit: r946146 - in /websites/staging/mahout/trunk/content: ./ users/clustering/cluster-dumper.html

Author: buildbot
Date: Fri Apr  3 01:36:05 2015
New Revision: 946146

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/clustering/cluster-dumper.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri Apr  3 01:36:05 2015
@@ -1 +1 @@
-1670767
+1670995

Modified: websites/staging/mahout/trunk/content/users/clustering/cluster-dumper.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/clustering/cluster-dumper.html (original)
+++ websites/staging/mahout/trunk/content/users/clustering/cluster-dumper.html Fri Apr  3 01:36:05 2015
@@ -252,37 +252,60 @@
   <div id="content-wrap" class="clearfix">
    <div id="main">
     <p><a name="ClusterDumper-Introduction"></a></p>
-<h1 id="cluster-dumper-introduction">Cluster Dumper - Introduction</h1>
+<h2 id="cluster-dumper-introduction">Cluster Dumper - Introduction</h2>
 <p>Clustering tasks in Mahout will output data in the format of a SequenceFile
 (Text, Cluster) and the Text is a cluster identifier string. To analyze
 this output we need to convert the sequence files to a human readable
 format and this is achieved using the clusterdump utility.</p>
 <p><a name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a></p>
-<h1 id="steps-for-analyzing-cluster-output-using-clusterdump-utility">Steps for analyzing cluster output using clusterdump utility</h1>
+<h2 id="steps-for-analyzing-cluster-output-using-clusterdump-utility">Steps for analyzing cluster output using clusterdump utility</h2>
 <p>After you've executed a clustering tasks (either examples or real-world),
-you can run clusterdumper in 2 modes.</p>
+you can run clusterdumper in 2 modes:</p>
 <ol>
-<li><a href="#hadoopenvironment.html">Hadoop Environment</a></li>
-<li><a href="#standalonejavaprogram.html">Standalone Java Program </a></li>
+<li>Hadoop Environment</li>
+<li>Standalone Java Program </li>
 </ol>
 <p><a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a></p>
 <h3 id="hadoop-environment">Hadoop Environment</h3>
 <p>If you have setup your HADOOP_HOME environment variable, you can use the
-command line utility "mahout" to execute the ClusterDumper on Hadoop. In
+command line utility <code>mahout</code> to execute the ClusterDumper on Hadoop. In
 this case we wont need to get the output clusters to our local machines.
 The utility will read the output clusters present in HDFS and output the
 human-readable cluster values into our local file system. Say you've just
 executed the <a href="clustering-of-synthetic-control-data.html">synthetic control example </a>
- and want to analyze the output, you can execute</p>
-<h3 id="standalone-java-program-anchorstandalonejavaprogram">Standalone Java Program {anchor:StandaloneJavaProgram}</h3>
-<p>ClusterDumper can be run using CLI. If your HADOOP_HOME environment
-variable is not set, you can execute ClusterDumper using "mahout" command
-line utility.</p>
-<p>Get the output data from hadoop into your local machine. For example, in
-the case where you've executed a clustering example use</p>
-<p>This will create a folder called output inside your $MAHOUT_HOME/examples
-and will have sub-folders for each cluster outputs and ClusteredPoints</p>
-<p>Run the clusterdump utility as follows as a standalone Java Program through Eclipse - if you are using eclipse, setup mahout-utils as a project as specified in <a href="../../developers/buildingmahout.html">Working with Maven in Eclipse</a>.
+ and want to analyze the output, you can execute the <code>mahout clusterdumper</code> utility from the command line.</p>
+<h4 id="cli-options">CLI options:</h4>
+<div class="codehilite"><pre><span class="o">--</span><span class="n">help</span>                               <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span> 
+<span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span>                   <span class="n">The</span> <span class="n">directory</span> <span class="n">containing</span> <span class="n">Sequence</span>
+                                       <span class="n">Files</span> <span class="k">for</span> <span class="n">the</span> <span class="n">Clusters</span>       
+<span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span>                 <span class="n">The</span> <span class="n">output</span> <span class="n">file</span><span class="p">.</span>  <span class="n">If</span> <span class="n">not</span> <span class="n">specified</span><span class="p">,</span>
+                                       <span class="n">dumps</span> <span class="n">to</span> <span class="n">the</span> <span class="n">console</span><span class="p">.</span>
+<span class="o">--</span><span class="n">outputFormat</span> <span class="p">(</span><span class="o">-</span><span class="n">of</span><span class="p">)</span> <span class="n">outputFormat</span>    <span class="n">The</span> <span class="n">optional</span> <span class="n">output</span> <span class="n">format</span> <span class="n">to</span> <span class="n">write</span>
+                                       <span class="n">the</span> <span class="n">results</span> <span class="n">as</span><span class="p">.</span> <span class="n">Options</span><span class="p">:</span> <span class="n">TEXT</span><span class="p">,</span> <span class="n">CSV</span><span class="p">,</span> <span class="n">or</span> <span class="n">GRAPH_ML</span>       
+<span class="o">--</span><span class="n">substring</span> <span class="p">(</span><span class="o">-</span><span class="n">b</span><span class="p">)</span> <span class="n">substring</span>           <span class="n">The</span> <span class="n">number</span> <span class="n">of</span> <span class="n">chars</span> <span class="n">of</span> <span class="n">the</span>     
+                       <span class="n">asFormatString</span><span class="p">()</span> <span class="n">to</span> <span class="n">print</span>    
+<span class="o">--</span><span class="n">pointsDir</span> <span class="p">(</span><span class="o">-</span><span class="n">p</span><span class="p">)</span> <span class="n">pointsDir</span>           <span class="n">The</span> <span class="n">directory</span> <span class="n">containing</span> <span class="n">points</span>  
+                                       <span class="n">sequence</span> <span class="n">files</span> <span class="n">mapping</span> <span class="n">input</span> <span class="n">vectors</span>
+                                       <span class="n">to</span> <span class="n">their</span> <span class="n">cluster</span><span class="p">.</span>  <span class="n">If</span> <span class="n">specified</span><span class="p">,</span> 
+                                       <span class="n">then</span> <span class="n">the</span> <span class="n">program</span> <span class="n">will</span> <span class="n">output</span> <span class="n">the</span> 
+                                       <span class="n">points</span> <span class="n">associated</span> <span class="n">with</span> <span class="n">a</span> <span class="n">cluster</span> 
+<span class="o">--</span><span class="n">dictionary</span> <span class="p">(</span><span class="o">-</span><span class="n">d</span><span class="p">)</span> <span class="n">dictionary</span>         <span class="n">The</span> <span class="n">dictionary</span> <span class="n">file</span><span class="p">.</span>
+<span class="o">--</span><span class="n">dictionaryType</span> <span class="p">(</span><span class="o">-</span><span class="n">dt</span><span class="p">)</span> <span class="n">dictionaryType</span>    <span class="n">The</span> <span class="n">dictionary</span> <span class="n">file</span> <span class="n">type</span>       
+                                     <span class="p">(</span><span class="n">text</span><span class="o">|</span><span class="n">sequencefile</span><span class="p">)</span>
+<span class="o">--</span><span class="n">distanceMeasure</span> <span class="p">(</span><span class="o">-</span><span class="n">dm</span><span class="p">)</span> <span class="n">distanceMeasure</span>  <span class="n">The</span> <span class="n">classname</span> <span class="n">of</span> <span class="n">the</span> <span class="n">DistanceMeasure</span><span class="p">.</span>
+                                           <span class="n">Default</span> <span class="n">is</span> <span class="n">SquaredEuclidean</span><span class="p">.</span>
+<span class="o">--</span><span class="n">numWords</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">numWords</span>             <span class="n">The</span> <span class="n">number</span> <span class="n">of</span> <span class="n">top</span> <span class="n">terms</span> <span class="n">to</span> <span class="n">print</span> 
+<span class="o">--</span><span class="n">tempDir</span> <span class="n">tempDir</span>                    <span class="n">Intermediate</span> <span class="n">output</span> <span class="n">directory</span>
+<span class="o">--</span><span class="n">startPhase</span> <span class="n">startPhase</span>              <span class="n">First</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span>
+<span class="o">--</span><span class="n">endPhase</span> <span class="n">endPhase</span>                  <span class="n">Last</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span>
+<span class="o">--</span><span class="n">evaluate</span> <span class="p">(</span><span class="o">-</span><span class="n">e</span><span class="p">)</span>                      <span class="n">Run</span> <span class="n">ClusterEvaluator</span> <span class="n">and</span> <span class="n">CDbwEvaluator</span> <span class="n">over</span> <span class="n">the</span>
+                                      <span class="n">input</span><span class="p">.</span> <span class="n">The</span> <span class="n">output</span> <span class="n">will</span> <span class="n">be</span> <span class="n">appended</span> <span class="n">to</span> <span class="n">the</span> <span class="n">rest</span> <span class="n">of</span>
+                                      <span class="n">the</span> <span class="n">output</span> <span class="n">at</span> <span class="n">the</span> <span class="k">end</span><span class="p">.</span>
+</pre></div>
+
+
+<h3 id="standalone-java-program">Standalone Java Program</h3>
+<p>Run the clusterdump utility as follows as a standalone Java Program through Eclipse. <!-- - if you are using eclipse, setup mahout-utils as a project as specified in <a href="../../developers/buildingmahout.html">Working with Maven in Eclipse</a>. -->
     To execute ClusterDumper.java,</p>
 <ul>
 <li>Under mahout-utils, Right-Click on ClusterDumper.java</li>