You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2014/05/02 20:00:38 UTC

svn commit: r907792 - in /websites/staging/mahout/trunk/content: ./ users/basics/creating-vectors-from-text.html

Author: buildbot
Date: Fri May  2 18:00:37 2014
New Revision: 907792

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri May  2 18:00:37 2014
@@ -1 +1 @@
-1591731
+1591989

Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original)
+++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Fri May  2 18:00:37 2014
@@ -236,7 +236,6 @@
   <div id="content-wrap" class="clearfix">
    <div id="main">
     <h1 id="creating-vectors-from-text">Creating vectors from text</h1>
-<p>available starting <em>Mahout_0.2</em></p>
 <p><a name="CreatingVectorsfromText-Introduction"></a></p>
 <h1 id="introduction">Introduction</h1>
 <p>For clustering and classifying documents it is usually necessary to convert the raw text
@@ -254,10 +253,10 @@ representations from a Lucene (and Solr,
 <p>For this, we assume you know how to build a Lucene/Solr index.  For those
 who don't, it is probably easiest to get up and running using <a href="http://lucene.apache.org/solr">Solr</a>
  as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
-index.  For those wanting to use just Lucene, see the <a href="http://lucene.apache.org/java">Lucene website</a>
+index.  For those wanting to use just Lucene, see the <a href="http://lucene.apache.org/core">Lucene website</a>
  or check out <em>Lucene In Action</em> by Erik Hatcher, Otis Gospodnetic and Mike
 McCandless.</p>
-<p>To get started, make sure you get a fresh copy of Mahout from <a href="../developers/buildingmahout.html">SVN</a>
+<p>To get started, make sure you get a fresh copy of Mahout from <a href="http://mahout.apache.org/developers/buildingmahout.html">SVN</a>
  and are comfortable building it. It defines interfaces and implementations
 for efficiently iterating over a Data Source (it only supports Lucene
 currently, but should be extensible to databases, Solr, etc.) and produces
@@ -267,28 +266,69 @@ in the org.apache.mahout.utils.vectors p
 several input options, which can be displayed by specifying the --help
 option.  Examples of running the Driver are included below:</p>
 <p><a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p>
-<h2 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h2>
-<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout lucene.vector <span class="o">&lt;</span>PATH TO DIRECTORY CONTAINING LUCENE INDEX<span class="o">&gt;</span>
-
-    <span class="o">--</span>output <span class="o">&lt;</span>PATH TO OUTPUT LOCATION<span class="o">&gt;</span>
-
-   <span class="o">--</span>field <span class="o">&lt;</span>NAME OF FIELD IN INDEX<span class="o">&gt;</span>
-
-   <span class="o">--</span>dictOut <span class="o">&lt;</span>PATH TO FILE TO OUTPUT THE DICTIONARY TO<span class="o">&gt;</span>
-
-   <span class="o">&lt;--</span>max <span class="o">&lt;</span>Number of vectors to output<span class="o">&gt;&gt;</span> <span class="o">&lt;--</span>norm <span class="p">{</span>INF<span class="o">|</span>integer <span class="o">&gt;=</span> <span class="m">0</span><span class="p">}</span><span class="o">&gt;</span>
-
-   <span class="o">&lt;--</span>idField <span class="o">&lt;</span>Name of the idField in the Lucene index<span class="o">&gt;&gt;</span>
+<h4 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h4>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> 
+    <span class="o">--</span><span class="n">dir</span> <span class="p">(</span><span class="o">-</span><span class="n">d</span><span class="p">)</span> <span class="n">dir</span>                     <span class="n">The</span> <span class="n">Lucene</span> <span class="n">directory</span>      
+    <span class="o">--</span><span class="n">idField</span> <span class="n">idField</span>                  <span class="n">The</span> <span class="n">field</span> <span class="n">in</span> <span class="n">the</span> <span class="n">index</span>    
+                                           <span class="n">containing</span> <span class="n">the</span> <span class="n">index</span><span class="p">.</span>  <span class="n">If</span> 
+                                           <span class="n">null</span><span class="p">,</span> <span class="n">then</span> <span class="n">the</span> <span class="n">Lucene</span>     
+                                           <span class="n">internal</span> <span class="n">doc</span> <span class="n">id</span> <span class="n">is</span> <span class="n">used</span>   
+                                           <span class="n">which</span> <span class="n">is</span> <span class="n">prone</span> <span class="n">to</span> <span class="n">error</span>   
+                                           <span class="k">if</span> <span class="n">the</span> <span class="n">underlying</span> <span class="n">index</span>   
+                                           <span class="n">changes</span>                   
+    <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span>               <span class="n">The</span> <span class="n">output</span> <span class="n">file</span>           
+    <span class="o">--</span><span class="n">delimiter</span> <span class="p">(</span><span class="o">-</span><span class="n">l</span><span class="p">)</span> <span class="n">delimiter</span>         <span class="n">The</span> <span class="n">delimiter</span> <span class="k">for</span>         
+                                           <span class="n">outputting</span> <span class="n">the</span> <span class="n">dictionary</span> 
+    <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span>                        <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>            
+    <span class="o">--</span><span class="n">field</span> <span class="p">(</span><span class="o">-</span><span class="n">f</span><span class="p">)</span> <span class="n">field</span>                 <span class="n">The</span> <span class="n">field</span> <span class="n">in</span> <span class="n">the</span> <span class="n">index</span>    
+    <span class="o">--</span><span class="n">max</span> <span class="p">(</span><span class="o">-</span><span class="n">m</span><span class="p">)</span> <span class="n">max</span>                         <span class="n">The</span> <span class="n">maximum</span> <span class="n">number</span> <span class="n">of</span>     
+                                           <span class="n">vectors</span> <span class="n">to</span> <span class="n">output</span><span class="p">.</span>  <span class="n">If</span>    
+                                           <span class="n">not</span> <span class="n">specified</span><span class="p">,</span> <span class="n">then</span> <span class="n">it</span>    
+                                           <span class="n">will</span> <span class="n">loop</span> <span class="n">over</span> <span class="n">all</span> <span class="n">docs</span>   
+    <span class="o">--</span><span class="n">dictOut</span> <span class="p">(</span><span class="o">-</span><span class="n">t</span><span class="p">)</span> <span class="n">dictOut</span>             <span class="n">The</span> <span class="n">output</span> <span class="n">of</span> <span class="n">the</span>         
+                                           <span class="n">dictionary</span>                
+    <span class="o">--</span><span class="n">seqDictOut</span> <span class="p">(</span><span class="o">-</span><span class="n">st</span><span class="p">)</span> <span class="n">seqDictOut</span>      <span class="n">The</span> <span class="n">output</span> <span class="n">of</span> <span class="n">the</span>         
+                                           <span class="n">dictionary</span> <span class="n">as</span> <span class="n">sequence</span>    
+                                           <span class="n">file</span>                      
+    <span class="o">--</span><span class="n">norm</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">norm</span>                   <span class="n">The</span> <span class="n">norm</span> <span class="n">to</span> <span class="n">use</span><span class="p">,</span>          
+                                           <span class="n">expressed</span> <span class="n">as</span> <span class="n">either</span> <span class="n">a</span>     
+                                           <span class="n">double</span> <span class="n">or</span> &quot;<span class="n">INF</span>&quot; <span class="k">if</span> <span class="n">you</span>    
+                                           <span class="n">want</span> <span class="n">to</span> <span class="n">use</span> <span class="n">the</span> <span class="n">Infinite</span>  
+                                           <span class="n">norm</span><span class="p">.</span>  <span class="n">Must</span> <span class="n">be</span> <span class="n">greater</span> <span class="n">or</span> 
+                                           <span class="n">equal</span> <span class="n">to</span> 0<span class="p">.</span>  <span class="n">The</span> <span class="n">default</span>  
+                                           <span class="n">is</span> <span class="n">not</span> <span class="n">to</span> <span class="n">normalize</span>       
+    <span class="o">--</span><span class="n">maxDFPercent</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxDFPercent</span>   <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span>     
+                                           <span class="n">docs</span> <span class="k">for</span> <span class="n">the</span> <span class="n">DF</span><span class="p">.</span>  <span class="n">Can</span> <span class="n">be</span>  
+                                           <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span>     
+                                           <span class="n">high</span> <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span>     
+                                           <span class="n">Expressed</span> <span class="n">as</span> <span class="n">an</span> <span class="n">integer</span>   
+                                           <span class="n">between</span> 0 <span class="n">and</span> 100<span class="p">.</span>        
+                                           <span class="n">Default</span> <span class="n">is</span> 99<span class="p">.</span>            
+    <span class="o">--</span><span class="n">weight</span> <span class="p">(</span><span class="o">-</span><span class="n">w</span><span class="p">)</span> <span class="n">weight</span>               <span class="n">The</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">weight</span> <span class="n">to</span>     
+                                           <span class="n">use</span><span class="p">.</span> <span class="n">Currently</span> <span class="n">TF</span> <span class="n">or</span>      
+                                           <span class="n">TFIDF</span>                     
+    <span class="o">--</span><span class="n">minDF</span> <span class="p">(</span><span class="o">-</span><span class="n">md</span><span class="p">)</span> <span class="n">minDF</span>                <span class="n">The</span> <span class="n">minimum</span> <span class="n">document</span>      
+                                           <span class="n">frequency</span><span class="p">.</span>  <span class="n">Default</span> <span class="n">is</span> 1  
+    <span class="o">--</span><span class="n">maxPercentErrorDocs</span> <span class="p">(</span><span class="o">-</span><span class="n">err</span><span class="p">)</span> <span class="n">mErr</span>  <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span>     
+                                           <span class="n">docs</span> <span class="n">that</span> <span class="n">can</span> <span class="n">have</span> <span class="n">a</span> <span class="n">null</span> 
+                                           <span class="n">term</span> <span class="n">vector</span><span class="p">.</span> <span class="n">These</span> <span class="n">are</span>    
+                                           <span class="n">noise</span> <span class="n">document</span> <span class="n">and</span> <span class="n">can</span>    
+                                           <span class="n">occur</span> <span class="k">if</span> <span class="n">the</span> <span class="n">analyzer</span>     
+                                           <span class="n">used</span> <span class="n">strips</span> <span class="n">out</span> <span class="n">all</span> <span class="n">terms</span> 
+                                           <span class="n">in</span> <span class="n">the</span> <span class="n">target</span> <span class="n">field</span><span class="p">.</span> <span class="n">This</span> 
+                                           <span class="n">percentage</span> <span class="n">is</span> <span class="n">expressed</span>   
+                                           <span class="n">as</span> <span class="n">a</span> <span class="n">value</span> <span class="n">between</span> 0 <span class="n">and</span>  
+                                           1<span class="p">.</span> <span class="n">The</span> <span class="n">default</span> <span class="n">is</span> 0<span class="p">.</span>
 </pre></div>
 
 
-<p><a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a></p>
-<h3 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h3>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
-
+<h4 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h4>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span>
+    <span class="o">--</span><span class="n">dir</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> 
+    <span class="o">--</span><span class="n">field</span> <span class="n">body</span> 
     <span class="o">--</span><span class="n">dictOut</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
-
-    <span class="o">--</span><span class="n">output</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50
+    <span class="o">--</span><span class="n">output</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> 
+    <span class="o">--</span><span class="n">max</span> 50
 </pre></div>
 
 
@@ -296,83 +336,126 @@ option.  Examples of running the Driver 
 out the info to the output dir and the dictionary to dict.txt.  It only
 outputs 50 vectors.  If you don't specify --max, then all the documents in
 the index are output.</p>
-<p><a name="CreatingVectorsfromText-Normalize50VectorsfromaLuceneIndexusingthe<a href="http://en.wikipedia.org/wiki/Lp_space">L_2Norm</a>"></a></p>
-<h3 id="normalize-50-vectors-from-a-lucene-index-using-the-l_2-normhttpenwikipediaorgwikilp_space">Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]</h3>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
-
-      <span class="o">--</span><span class="n">dictOut</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
-
-      <span class="o">--</span><span class="n">output</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50 <span class="o">--</span><span class="n">norm</span> 2
+<p><a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a></p>
+<h4 id="creating-50-normalized-vectors-from-a-lucene-index-using-the-l_2-norm">Creating 50 Normalized Vectors from a Lucene Index using the <a href="http://en.wikipedia.org/wiki/Lp_space">L_2 Norm</a></h4>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> 
+    <span class="o">--</span><span class="n">dir</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> 
+    <span class="o">--</span><span class="n">field</span> <span class="n">body</span> 
+    <span class="o">--</span><span class="n">dictOut</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
+    <span class="o">--</span><span class="n">output</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> 
+    <span class="o">--</span><span class="n">max</span> 50 
+    <span class="o">--</span><span class="n">norm</span> 2
 </pre></div>
 
 
 <p><a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a></p>
-<h1 id="from-directory-of-text-documents">From Directory of Text documents</h1>
+<h2 id="from-a-directory-of-text-documents">From A Directory of Text documents</h2>
 <p>Mahout has utilities to generate Vectors from a directory of text
 documents. Before creating the vectors, you need to convert the documents
 to SequenceFile format. SequenceFile is a hadoop class which allows us to
 write arbitary key,value pairs into it. The DocumentVectorizer requires the
 key to be a Text with a unique document id, and value to be the Text
 content in UTF-8 format.</p>
-<p>You may find Tika (http://lucene.apache.org/tika) helpful in converting
+<p>You may find <a href="http://tika.apache.org/">Tika</a> helpful in converting
 binary documents to text.</p>
 <p><a name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a></p>
-<h2 id="converting-directory-of-documents-to-sequencefile-format">Converting directory of documents to SequenceFile format</h2>
+<h4 id="converting-directory-of-documents-to-sequencefile-format">Converting directory of documents to SequenceFile format</h4>
 <p>Mahout has a nifty utility which reads a directory path including its
 sub-directories and creates the SequenceFile in a chunked manner for us.
 the document id generated is <PREFIX><RELATIVE PATH FROM
 PARENT>/document.txt</p>
-<p>From the examples directory run</p>
-<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seqdirectory
-
-<span class="o">--</span>input <span class="o">&lt;</span>PARENT DIR WHERE DOCS ARE LOCATED<span class="o">&gt;</span> <span class="o">--</span>output <span class="o">&lt;</span>OUTPUT DIRECTORY<span class="o">&gt;</span>
-
-<span class="o">&lt;-</span>c <span class="o">&lt;</span>CHARSET NAME OF THE INPUT DOCUMENTS<span class="o">&gt;</span> <span class="p">{</span>UTF<span class="o">-</span><span class="m">8</span><span class="o">|</span>cp1252<span class="o">|</span>ascii...<span class="p">}</span><span class="o">&gt;</span>
-
-<span class="o">&lt;-</span>chunk <span class="o">&lt;</span>MAX SIZE OF EACH CHUNK in Megabytes<span class="o">&gt;</span> <span class="m">64</span><span class="o">&gt;</span>
-
-<span class="o">&lt;-</span>prefix <span class="o">&lt;</span>PREFIX TO ADD TO THE DOCUMENT ID<span class="o">&gt;&gt;</span>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seqdirectory</span> 
+    <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span>                       <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span>   
+    <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span>                     <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span>     
+                                                 <span class="n">output</span><span class="p">.</span>                        
+    <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span>                        <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span>      
+                                                 <span class="n">output</span> <span class="n">directory</span> <span class="n">before</span>        
+                                                 <span class="n">running</span> <span class="n">job</span>                    
+    <span class="o">--</span><span class="n">method</span> <span class="p">(</span><span class="o">-</span><span class="n">xm</span><span class="p">)</span> <span class="n">method</span>                    <span class="n">The</span> <span class="n">execution</span> <span class="n">method</span> <span class="n">to</span> <span class="n">use</span><span class="p">:</span>   
+                                                 <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="p">.</span>       
+                                                 <span class="n">Default</span> <span class="n">is</span> <span class="n">mapreduce</span>           
+    <span class="o">--</span><span class="n">chunkSize</span> <span class="p">(</span><span class="o">-</span><span class="n">chunk</span><span class="p">)</span> <span class="n">chunkSize</span>           <span class="n">The</span> <span class="n">chunkSize</span> <span class="n">in</span> <span class="n">MegaBytes</span><span class="p">.</span>    
+                                                 <span class="n">Defaults</span> <span class="n">to</span> 64                 
+    <span class="o">--</span><span class="n">fileFilterClass</span> <span class="p">(</span><span class="o">-</span><span class="n">filter</span><span class="p">)</span> <span class="n">fFilterClass</span> <span class="n">The</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">class</span> <span class="n">to</span> <span class="n">use</span>   
+                                                 <span class="k">for</span> <span class="n">file</span> <span class="n">parsing</span><span class="p">.</span> <span class="n">Default</span><span class="p">:</span>     
+                                                 <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">PrefixAdditionFilter</span>                   
+    <span class="o">--</span><span class="n">keyPrefix</span> <span class="p">(</span><span class="o">-</span><span class="n">prefix</span><span class="p">)</span> <span class="n">keyPrefix</span>          <span class="n">The</span> <span class="n">prefix</span> <span class="n">to</span> <span class="n">be</span> <span class="n">prepended</span> <span class="n">to</span>  
+                                                 <span class="n">the</span> <span class="n">key</span>                        
+    <span class="o">--</span><span class="n">charset</span> <span class="p">(</span><span class="o">-</span><span class="n">c</span><span class="p">)</span> <span class="n">charset</span>                   <span class="n">The</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">character</span>      
+                                                 <span class="n">encoding</span> <span class="n">of</span> <span class="n">the</span> <span class="n">input</span> <span class="n">files</span><span class="p">.</span>   
+                                                 <span class="n">Default</span> <span class="n">to</span> <span class="n">UTF</span><span class="o">-</span>8 <span class="p">{</span><span class="n">accepts</span><span class="p">:</span> <span class="n">cp1252</span><span class="o">|</span><span class="n">ascii</span><span class="p">...}</span>             
+    <span class="o">--</span><span class="n">method</span> <span class="p">(</span><span class="o">-</span><span class="n">xm</span><span class="p">)</span> <span class="n">method</span>                    <span class="n">The</span> <span class="n">execution</span> <span class="n">method</span> <span class="n">to</span> <span class="n">use</span><span class="p">:</span>   
+                                                 <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="p">.</span>       
+                                             <span class="n">Default</span> <span class="n">is</span> <span class="n">mapreduce</span>           
+    <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span>                        <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span>      
+                                                 <span class="n">output</span> <span class="n">directory</span> <span class="n">before</span>        
+                                                 <span class="n">running</span> <span class="n">job</span>                    
+    <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span>                              <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>                 
+    <span class="o">--</span><span class="n">tempDir</span> <span class="n">tempDir</span>                        <span class="n">Intermediate</span> <span class="n">output</span> <span class="n">directory</span>  
+    <span class="o">--</span><span class="n">startPhase</span> <span class="n">startPhase</span>                  <span class="n">First</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span>             
+    <span class="o">--</span><span class="n">endPhase</span> <span class="n">endPhase</span>                      <span class="n">Last</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span>  <span class="o">&gt;</span>
 </pre></div>
 
 
 <p><a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a></p>
-<h2 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h2>
-<p>+<em>Mahout_0.3</em>+</p>
+<h4 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h4>
 <p>From the sequence file generated from the above step run the following to
 generate vectors. </p>
-<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seq2sparse
-
-<span class="o">-</span>i <span class="o">&lt;</span>PATH TO THE SEQUENCEFILES<span class="o">&gt;</span>
-
-<span class="o">-</span>o <span class="o">&lt;</span>OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED<span class="o">&gt;</span>
-
-<span class="o">&lt;-</span>wt <span class="o">&lt;</span>WEIGHTING METHOD USED<span class="o">&gt;</span> <span class="p">{</span>tf<span class="o">|</span>tfidf<span class="p">}</span><span class="o">&gt;</span>
-
-<span class="o">&lt;-</span>chunk <span class="o">&lt;</span>MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY<span class="o">&gt;</span> <span class="m">100</span><span class="o">&gt;</span>
-
-<span class="o">&lt;-</span>a <span class="o">&lt;</span>NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT<span class="o">&gt;</span>
-
-org.apache.lucene.analysis.standard.StandardAnalyzer<span class="o">&gt;</span>
-
-<span class="o">&lt;--</span>minSupport <span class="o">&lt;</span>MINIMUM SUPPORT<span class="o">&gt;</span> <span class="m">2</span><span class="o">&gt;</span>
-
-<span class="o">&lt;--</span>minDF <span class="o">&lt;</span>MINIMUM DOCUMENT FREQUENCY<span class="o">&gt;</span> <span class="m">1</span><span class="o">&gt;</span>
-
-<span class="o">&lt;--</span>maxDFPercent <span class="o">&lt;</span>MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN <span class="m">0</span><span class="o">-</span><span class="m">100</span><span class="o">&gt;</span> <span class="m">99</span><span class="o">&gt;</span>
-
-<span class="o">&lt;--</span>norm <span class="o">&lt;</span>REFER TO L_2 NORM ABOVE<span class="o">&gt;</span><span class="p">{</span>INF<span class="o">|</span>integer <span class="o">&gt;=</span> <span class="m">0</span><span class="p">}</span><span class="o">&gt;</span><span class="s">&quot;</span>
-
-<span class="s">&lt;-seq &lt;Create SequentialAccessVectors&gt;{false|true required for running some algorithms(LDA,Lanczos)}&gt;&quot;</span>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seq2sparse</span>
+    <span class="o">--</span><span class="n">minSupport</span> <span class="p">(</span><span class="o">-</span><span class="n">s</span><span class="p">)</span> <span class="n">minSupport</span>      <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Minimum</span> <span class="n">Support</span><span class="p">.</span> <span class="n">Default</span>       
+                                          <span class="n">Value</span><span class="p">:</span> 2                                  
+    <span class="o">--</span><span class="n">analyzerName</span> <span class="p">(</span><span class="o">-</span><span class="n">a</span><span class="p">)</span> <span class="n">analyzerName</span>  <span class="n">The</span> <span class="n">class</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">analyzer</span>            
+    <span class="o">--</span><span class="n">chunkSize</span> <span class="p">(</span><span class="o">-</span><span class="n">chunk</span><span class="p">)</span> <span class="n">chunkSize</span>    <span class="n">The</span> <span class="n">chunkSize</span> <span class="n">in</span> <span class="n">MegaBytes</span><span class="p">.</span> <span class="n">Default</span>       
+                                          <span class="n">Value</span><span class="p">:</span> 100<span class="n">MB</span>                              
+    <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span>              <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span> <span class="n">output</span><span class="p">.</span>        
+    <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span>                <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span>              
+    <span class="o">--</span><span class="n">minDF</span> <span class="p">(</span><span class="o">-</span><span class="n">md</span><span class="p">)</span> <span class="n">minDF</span>               <span class="n">The</span> <span class="n">minimum</span> <span class="n">document</span> <span class="n">frequency</span><span class="p">.</span>  <span class="n">Default</span>  
+                                          <span class="n">is</span> 1                                      
+    <span class="o">--</span><span class="n">maxDFSigma</span> <span class="p">(</span><span class="o">-</span><span class="n">xs</span><span class="p">)</span> <span class="n">maxDFSigma</span>     <span class="n">What</span> <span class="n">portion</span> <span class="n">of</span> <span class="n">the</span> <span class="n">tf</span> <span class="p">(</span><span class="n">tf</span><span class="o">-</span><span class="n">idf</span><span class="p">)</span> <span class="n">vectors</span>   
+                                          <span class="n">to</span> <span class="n">be</span> <span class="n">used</span><span class="p">,</span> <span class="n">expressed</span> <span class="n">in</span> <span class="n">times</span> <span class="n">the</span>        
+                                          <span class="n">standard</span> <span class="n">deviation</span> <span class="p">(</span><span class="n">sigma</span><span class="p">)</span> <span class="n">of</span> <span class="n">the</span>         
+                                          <span class="n">document</span> <span class="n">frequencies</span> <span class="n">of</span> <span class="n">these</span> <span class="n">vectors</span><span class="p">.</span>    
+                                          <span class="n">Can</span> <span class="n">be</span> <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span> <span class="n">high</span>         
+                                          <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span> <span class="n">Expressed</span> <span class="n">as</span> <span class="n">a</span> <span class="n">double</span>    
+                                          <span class="n">value</span><span class="p">.</span> <span class="n">Good</span> <span class="n">value</span> <span class="n">to</span> <span class="n">be</span> <span class="n">specified</span> <span class="n">is</span> 3<span class="p">.</span>0<span class="p">.</span> 
+                                          <span class="n">In</span> <span class="k">case</span> <span class="n">the</span> <span class="n">value</span> <span class="n">is</span> <span class="n">less</span> <span class="n">than</span> 0 <span class="n">no</span>       
+                                          <span class="n">vectors</span> <span class="n">will</span> <span class="n">be</span> <span class="n">filtered</span> <span class="n">out</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span>  
+                                          <span class="o">-</span>1<span class="p">.</span>0<span class="p">.</span>  <span class="n">Overrides</span> <span class="n">maxDFPercent</span>             
+    <span class="o">--</span><span class="n">maxDFPercent</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxDFPercent</span>  <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span> <span class="n">docs</span> <span class="k">for</span> <span class="n">the</span> <span class="n">DF</span><span class="p">.</span>    
+                                          <span class="n">Can</span> <span class="n">be</span> <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span> <span class="n">high</span>         
+                                          <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span> <span class="n">Expressed</span> <span class="n">as</span> <span class="n">an</span> <span class="n">integer</span>  
+                                          <span class="n">between</span> 0 <span class="n">and</span> 100<span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> 99<span class="p">.</span>  <span class="n">If</span>     
+                                          <span class="n">maxDFSigma</span> <span class="n">is</span> <span class="n">also</span> <span class="n">set</span><span class="p">,</span> <span class="n">it</span> <span class="n">will</span> <span class="n">override</span>  
+                                          <span class="n">this</span> <span class="n">value</span><span class="p">.</span>                               
+    <span class="o">--</span><span class="n">weight</span> <span class="p">(</span><span class="o">-</span><span class="n">wt</span><span class="p">)</span> <span class="n">weight</span>             <span class="n">The</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">weight</span> <span class="n">to</span> <span class="n">use</span><span class="p">.</span> <span class="n">Currently</span> <span class="n">TF</span>   
+                                          <span class="n">or</span> <span class="n">TFIDF</span><span class="p">.</span> <span class="n">Default</span><span class="p">:</span> <span class="n">TFIDF</span>                  
+    <span class="o">--</span><span class="n">norm</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">norm</span>                  <span class="n">The</span> <span class="n">norm</span> <span class="n">to</span> <span class="n">use</span><span class="p">,</span> <span class="n">expressed</span> <span class="n">as</span> <span class="n">either</span> <span class="n">a</span>    
+                                          <span class="n">float</span> <span class="n">or</span> &quot;<span class="n">INF</span>&quot; <span class="k">if</span> <span class="n">you</span> <span class="n">want</span> <span class="n">to</span> <span class="n">use</span> <span class="n">the</span>     
+                                          <span class="n">Infinite</span> <span class="n">norm</span><span class="p">.</span>  <span class="n">Must</span> <span class="n">be</span> <span class="n">greater</span> <span class="n">or</span> <span class="n">equal</span>  
+                                          <span class="n">to</span> 0<span class="p">.</span>  <span class="n">The</span> <span class="n">default</span> <span class="n">is</span> <span class="n">not</span> <span class="n">to</span> <span class="n">normalize</span>    
+    <span class="o">--</span><span class="n">minLLR</span> <span class="p">(</span><span class="o">-</span><span class="n">ml</span><span class="p">)</span> <span class="n">minLLR</span>             <span class="p">(</span><span class="n">Optional</span><span class="p">)</span><span class="n">The</span> <span class="n">minimum</span> <span class="n">Log</span> <span class="n">Likelihood</span>      
+                                          <span class="n">Ratio</span><span class="p">(</span><span class="n">Float</span><span class="p">)</span>  <span class="n">Default</span> <span class="n">is</span> 1<span class="p">.</span>0              
+    <span class="o">--</span><span class="n">numReducers</span> <span class="p">(</span><span class="o">-</span><span class="n">nr</span><span class="p">)</span> <span class="n">numReducers</span>   <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Number</span> <span class="n">of</span> <span class="n">reduce</span> <span class="n">tasks</span><span class="p">.</span>        
+                                          <span class="n">Default</span> <span class="n">Value</span><span class="p">:</span> 1                          
+    <span class="o">--</span><span class="n">maxNGramSize</span> <span class="p">(</span><span class="o">-</span><span class="n">ng</span><span class="p">)</span> <span class="n">ngramSize</span>    <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">The</span> <span class="n">maximum</span> <span class="nb">size</span> <span class="n">of</span> <span class="n">ngrams</span> <span class="n">to</span>  
+                                          <span class="n">create</span> <span class="p">(</span>2 <span class="p">=</span> <span class="n">bigrams</span><span class="p">,</span> 3 <span class="p">=</span> <span class="n">trigrams</span><span class="p">,</span> <span class="n">etc</span><span class="p">)</span>   
+                                          <span class="n">Default</span> <span class="n">Value</span><span class="p">:</span>1                           
+    <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span>                 <span class="n">If</span> <span class="n">set</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> <span class="n">output</span> <span class="n">directory</span>    
+    <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span>                           <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>                            
+    <span class="o">--</span><span class="n">sequentialAccessVector</span> <span class="p">(</span><span class="o">-</span><span class="n">seq</span><span class="p">)</span>   <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span>  
+                                          <span class="n">be</span> <span class="n">SequentialAccessVectors</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> <span class="n">false</span><span class="p">;</span>
+                                          <span class="n">true</span> <span class="n">required</span> <span class="k">for</span> <span class="n">running</span> <span class="n">some</span> <span class="n">algorithms</span>
+                                          <span class="p">(</span><span class="n">LDA</span><span class="p">,</span><span class="n">Lanczos</span><span class="p">)</span>                                
+    <span class="o">--</span><span class="n">namedVector</span> <span class="p">(</span><span class="o">-</span><span class="n">nv</span><span class="p">)</span>               <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span>  
+                                          <span class="n">be</span> <span class="n">NamedVectors</span><span class="p">.</span> <span class="n">If</span> <span class="n">set</span> <span class="n">true</span> <span class="k">else</span> <span class="n">false</span>   
+    <span class="o">--</span><span class="n">logNormalize</span> <span class="p">(</span><span class="o">-</span><span class="n">lnorm</span><span class="p">)</span>           <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span>  
+                                          <span class="n">be</span> <span class="n">logNormalize</span><span class="p">.</span> <span class="n">If</span> <span class="n">set</span> <span class="n">true</span> <span class="k">else</span> <span class="n">false</span>
 </pre></div>
 
 
-<p>--minSupport is the min frequency for the word to  be considered as a
-feature. --minDF is the min number of documents the word needs to be in
---maxDFPercent is the max value of the expression (document frequency of a
-word/total number of document) to be considered as good feature to be in
-the document. This helps remove high frequency features like stop words</p>
-<p><a name="CreatingVectorsfromText-Background"></a></p>
-<h1 id="background">Background</h1>
+<p>--minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. These options are helpful in removing high frequency features like stop words.
+<a name="CreatingVectorsfromText-Background"></a></p>
+<h2 id="background">Background</h2>
 <ul>
 <li><a href="http://markmail.org/thread/l5zi3yk446goll3o">Discussion on centroid calculations with sparse vectors</a></li>
 </ul>