You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2014/05/02 20:00:38 UTC
svn commit: r907792 - in /websites/staging/mahout/trunk/content: ./
users/basics/creating-vectors-from-text.html
Author: buildbot
Date: Fri May 2 18:00:37 2014
New Revision: 907792
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Fri May 2 18:00:37 2014
@@ -1 +1 @@
-1591731
+1591989
Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original)
+++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Fri May 2 18:00:37 2014
@@ -236,7 +236,6 @@
<div id="content-wrap" class="clearfix">
<div id="main">
<h1 id="creating-vectors-from-text">Creating vectors from text</h1>
-<p>available starting <em>Mahout_0.2</em></p>
<p><a name="CreatingVectorsfromText-Introduction"></a></p>
<h1 id="introduction">Introduction</h1>
<p>For clustering and classifying documents it is usually necessary to convert the raw text
@@ -254,10 +253,10 @@ representations from a Lucene (and Solr,
<p>For this, we assume you know how to build a Lucene/Solr index. For those
who don't, it is probably easiest to get up and running using <a href="http://lucene.apache.org/solr">Solr</a>
as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
-index. For those wanting to use just Lucene, see the <a href="http://lucene.apache.org/java">Lucene website</a>
+index. For those wanting to use just Lucene, see the <a href="http://lucene.apache.org/core">Lucene website</a>
or check out <em>Lucene In Action</em> by Erik Hatcher, Otis Gospodnetic and Mike
McCandless.</p>
-<p>To get started, make sure you get a fresh copy of Mahout from <a href="../developers/buildingmahout.html">SVN</a>
+<p>To get started, make sure you get a fresh copy of Mahout from <a href="http://mahout.apache.org/developers/buildingmahout.html">SVN</a>
and are comfortable building it. It defines interfaces and implementations
for efficiently iterating over a Data Source (it only supports Lucene
currently, but should be extensible to databases, Solr, etc.) and produces
@@ -267,28 +266,69 @@ in the org.apache.mahout.utils.vectors p
several input options, which can be displayed by specifying the --help
option. Examples of running the Driver are included below:</p>
<p><a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p>
-<h2 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h2>
-<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout lucene.vector <span class="o"><</span>PATH TO DIRECTORY CONTAINING LUCENE INDEX<span class="o">></span>
-
- <span class="o">--</span>output <span class="o"><</span>PATH TO OUTPUT LOCATION<span class="o">></span>
-
- <span class="o">--</span>field <span class="o"><</span>NAME OF FIELD IN INDEX<span class="o">></span>
-
- <span class="o">--</span>dictOut <span class="o"><</span>PATH TO FILE TO OUTPUT THE DICTIONARY TO<span class="o">></span>
-
- <span class="o"><--</span>max <span class="o"><</span>Number of vectors to output<span class="o">>></span> <span class="o"><--</span>norm <span class="p">{</span>INF<span class="o">|</span>integer <span class="o">>=</span> <span class="m">0</span><span class="p">}</span><span class="o">></span>
-
- <span class="o"><--</span>idField <span class="o"><</span>Name of the idField in the Lucene index<span class="o">>></span>
+<h4 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h4>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span>
+ <span class="o">--</span><span class="n">dir</span> <span class="p">(</span><span class="o">-</span><span class="n">d</span><span class="p">)</span> <span class="n">dir</span> <span class="n">The</span> <span class="n">Lucene</span> <span class="n">directory</span>
+ <span class="o">--</span><span class="n">idField</span> <span class="n">idField</span> <span class="n">The</span> <span class="n">field</span> <span class="n">in</span> <span class="n">the</span> <span class="n">index</span>
+ <span class="n">containing</span> <span class="n">the</span> <span class="n">index</span><span class="p">.</span> <span class="n">If</span>
+ <span class="n">null</span><span class="p">,</span> <span class="n">then</span> <span class="n">the</span> <span class="n">Lucene</span>
+ <span class="n">internal</span> <span class="n">doc</span> <span class="n">id</span> <span class="n">is</span> <span class="n">used</span>
+ <span class="n">which</span> <span class="n">is</span> <span class="n">prone</span> <span class="n">to</span> <span class="n">error</span>
+ <span class="k">if</span> <span class="n">the</span> <span class="n">underlying</span> <span class="n">index</span>
+ <span class="n">changes</span>
+ <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">output</span> <span class="n">file</span>
+ <span class="o">--</span><span class="n">delimiter</span> <span class="p">(</span><span class="o">-</span><span class="n">l</span><span class="p">)</span> <span class="n">delimiter</span> <span class="n">The</span> <span class="n">delimiter</span> <span class="k">for</span>
+ <span class="n">outputting</span> <span class="n">the</span> <span class="n">dictionary</span>
+ <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>
+ <span class="o">--</span><span class="n">field</span> <span class="p">(</span><span class="o">-</span><span class="n">f</span><span class="p">)</span> <span class="n">field</span> <span class="n">The</span> <span class="n">field</span> <span class="n">in</span> <span class="n">the</span> <span class="n">index</span>
+ <span class="o">--</span><span class="n">max</span> <span class="p">(</span><span class="o">-</span><span class="n">m</span><span class="p">)</span> <span class="n">max</span> <span class="n">The</span> <span class="n">maximum</span> <span class="n">number</span> <span class="n">of</span>
+ <span class="n">vectors</span> <span class="n">to</span> <span class="n">output</span><span class="p">.</span> <span class="n">If</span>
+ <span class="n">not</span> <span class="n">specified</span><span class="p">,</span> <span class="n">then</span> <span class="n">it</span>
+ <span class="n">will</span> <span class="n">loop</span> <span class="n">over</span> <span class="n">all</span> <span class="n">docs</span>
+ <span class="o">--</span><span class="n">dictOut</span> <span class="p">(</span><span class="o">-</span><span class="n">t</span><span class="p">)</span> <span class="n">dictOut</span> <span class="n">The</span> <span class="n">output</span> <span class="n">of</span> <span class="n">the</span>
+ <span class="n">dictionary</span>
+ <span class="o">--</span><span class="n">seqDictOut</span> <span class="p">(</span><span class="o">-</span><span class="n">st</span><span class="p">)</span> <span class="n">seqDictOut</span> <span class="n">The</span> <span class="n">output</span> <span class="n">of</span> <span class="n">the</span>
+ <span class="n">dictionary</span> <span class="n">as</span> <span class="n">sequence</span>
+ <span class="n">file</span>
+ <span class="o">--</span><span class="n">norm</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">norm</span> <span class="n">The</span> <span class="n">norm</span> <span class="n">to</span> <span class="n">use</span><span class="p">,</span>
+ <span class="n">expressed</span> <span class="n">as</span> <span class="n">either</span> <span class="n">a</span>
+ <span class="n">double</span> <span class="n">or</span> "<span class="n">INF</span>" <span class="k">if</span> <span class="n">you</span>
+ <span class="n">want</span> <span class="n">to</span> <span class="n">use</span> <span class="n">the</span> <span class="n">Infinite</span>
+ <span class="n">norm</span><span class="p">.</span> <span class="n">Must</span> <span class="n">be</span> <span class="n">greater</span> <span class="n">or</span>
+ <span class="n">equal</span> <span class="n">to</span> 0<span class="p">.</span> <span class="n">The</span> <span class="n">default</span>
+ <span class="n">is</span> <span class="n">not</span> <span class="n">to</span> <span class="n">normalize</span>
+ <span class="o">--</span><span class="n">maxDFPercent</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxDFPercent</span> <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span>
+ <span class="n">docs</span> <span class="k">for</span> <span class="n">the</span> <span class="n">DF</span><span class="p">.</span> <span class="n">Can</span> <span class="n">be</span>
+ <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span>
+ <span class="n">high</span> <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span>
+ <span class="n">Expressed</span> <span class="n">as</span> <span class="n">an</span> <span class="n">integer</span>
+ <span class="n">between</span> 0 <span class="n">and</span> 100<span class="p">.</span>
+ <span class="n">Default</span> <span class="n">is</span> 99<span class="p">.</span>
+ <span class="o">--</span><span class="n">weight</span> <span class="p">(</span><span class="o">-</span><span class="n">w</span><span class="p">)</span> <span class="n">weight</span> <span class="n">The</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">weight</span> <span class="n">to</span>
+ <span class="n">use</span><span class="p">.</span> <span class="n">Currently</span> <span class="n">TF</span> <span class="n">or</span>
+ <span class="n">TFIDF</span>
+ <span class="o">--</span><span class="n">minDF</span> <span class="p">(</span><span class="o">-</span><span class="n">md</span><span class="p">)</span> <span class="n">minDF</span> <span class="n">The</span> <span class="n">minimum</span> <span class="n">document</span>
+ <span class="n">frequency</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> 1
+ <span class="o">--</span><span class="n">maxPercentErrorDocs</span> <span class="p">(</span><span class="o">-</span><span class="n">err</span><span class="p">)</span> <span class="n">mErr</span> <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span>
+ <span class="n">docs</span> <span class="n">that</span> <span class="n">can</span> <span class="n">have</span> <span class="n">a</span> <span class="n">null</span>
+ <span class="n">term</span> <span class="n">vector</span><span class="p">.</span> <span class="n">These</span> <span class="n">are</span>
+ <span class="n">noise</span> <span class="n">document</span> <span class="n">and</span> <span class="n">can</span>
+ <span class="n">occur</span> <span class="k">if</span> <span class="n">the</span> <span class="n">analyzer</span>
+ <span class="n">used</span> <span class="n">strips</span> <span class="n">out</span> <span class="n">all</span> <span class="n">terms</span>
+ <span class="n">in</span> <span class="n">the</span> <span class="n">target</span> <span class="n">field</span><span class="p">.</span> <span class="n">This</span>
+ <span class="n">percentage</span> <span class="n">is</span> <span class="n">expressed</span>
+ <span class="n">as</span> <span class="n">a</span> <span class="n">value</span> <span class="n">between</span> 0 <span class="n">and</span>
+ 1<span class="p">.</span> <span class="n">The</span> <span class="n">default</span> <span class="n">is</span> 0<span class="p">.</span>
</pre></div>
-<p><a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a></p>
-<h3 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h3>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
-
+<h4 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h4>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span>
+ <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span>
+ <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
<span class="o">--</span><span class="n">dictOut</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
-
- <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50
+ <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span>
+ <span class="o">--</span><span class="n">max</span> 50
</pre></div>
@@ -296,83 +336,126 @@ option. Examples of running the Driver
out the info to the output dir and the dictionary to dict.txt. It only
outputs 50 vectors. If you don't specify --max, then all the documents in
the index are output.</p>
-<p><a name="CreatingVectorsfromText-Normalize50VectorsfromaLuceneIndexusingthe<a href="http://en.wikipedia.org/wiki/Lp_space">L_2Norm</a>"></a></p>
-<h3 id="normalize-50-vectors-from-a-lucene-index-using-the-l_2-normhttpenwikipediaorgwikilp_space">Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]</h3>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
-
- <span class="o">--</span><span class="n">dictOut</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
-
- <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50 <span class="o">--</span><span class="n">norm</span> 2
+<p><a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a></p>
+<h4 id="creating-50-normalized-vectors-from-a-lucene-index-using-the-l_2-norm">Creating 50 Normalized Vectors from a Lucene Index using the <a href="http://en.wikipedia.org/wiki/Lp_space">L_2 Norm</a></h4>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span>
+ <span class="o">--</span><span class="n">dir</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span>
+ <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
+ <span class="o">--</span><span class="n">dictOut</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
+ <span class="o">--</span><span class="n">output</span> <span class="o"><</span><span class="n">PATH</span><span class="o">>/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span>
+ <span class="o">--</span><span class="n">max</span> 50
+ <span class="o">--</span><span class="n">norm</span> 2
</pre></div>
<p><a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a></p>
-<h1 id="from-directory-of-text-documents">From Directory of Text documents</h1>
+<h2 id="from-a-directory-of-text-documents">From A Directory of Text documents</h2>
<p>Mahout has utilities to generate Vectors from a directory of text
documents. Before creating the vectors, you need to convert the documents
to SequenceFile format. SequenceFile is a hadoop class which allows us to
write arbitary key,value pairs into it. The DocumentVectorizer requires the
key to be a Text with a unique document id, and value to be the Text
content in UTF-8 format.</p>
-<p>You may find Tika (http://lucene.apache.org/tika) helpful in converting
+<p>You may find <a href="http://tika.apache.org/">Tika</a> helpful in converting
binary documents to text.</p>
<p><a name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a></p>
-<h2 id="converting-directory-of-documents-to-sequencefile-format">Converting directory of documents to SequenceFile format</h2>
+<h4 id="converting-directory-of-documents-to-sequencefile-format">Converting directory of documents to SequenceFile format</h4>
<p>Mahout has a nifty utility which reads a directory path including its
sub-directories and creates the SequenceFile in a chunked manner for us.
the document id generated is <PREFIX><RELATIVE PATH FROM
PARENT>/document.txt</p>
-<p>From the examples directory run</p>
-<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seqdirectory
-
-<span class="o">--</span>input <span class="o"><</span>PARENT DIR WHERE DOCS ARE LOCATED<span class="o">></span> <span class="o">--</span>output <span class="o"><</span>OUTPUT DIRECTORY<span class="o">></span>
-
-<span class="o"><-</span>c <span class="o"><</span>CHARSET NAME OF THE INPUT DOCUMENTS<span class="o">></span> <span class="p">{</span>UTF<span class="o">-</span><span class="m">8</span><span class="o">|</span>cp1252<span class="o">|</span>ascii...<span class="p">}</span><span class="o">></span>
-
-<span class="o"><-</span>chunk <span class="o"><</span>MAX SIZE OF EACH CHUNK in Megabytes<span class="o">></span> <span class="m">64</span><span class="o">></span>
-
-<span class="o"><-</span>prefix <span class="o"><</span>PREFIX TO ADD TO THE DOCUMENT ID<span class="o">>></span>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seqdirectory</span>
+ <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span> <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span>
+ <span class="n">output</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span>
+ <span class="n">output</span> <span class="n">directory</span> <span class="n">before</span>
+ <span class="n">running</span> <span class="n">job</span>
+ <span class="o">--</span><span class="n">method</span> <span class="p">(</span><span class="o">-</span><span class="n">xm</span><span class="p">)</span> <span class="n">method</span> <span class="n">The</span> <span class="n">execution</span> <span class="n">method</span> <span class="n">to</span> <span class="n">use</span><span class="p">:</span>
+ <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="p">.</span>
+ <span class="n">Default</span> <span class="n">is</span> <span class="n">mapreduce</span>
+ <span class="o">--</span><span class="n">chunkSize</span> <span class="p">(</span><span class="o">-</span><span class="n">chunk</span><span class="p">)</span> <span class="n">chunkSize</span> <span class="n">The</span> <span class="n">chunkSize</span> <span class="n">in</span> <span class="n">MegaBytes</span><span class="p">.</span>
+ <span class="n">Defaults</span> <span class="n">to</span> 64
+ <span class="o">--</span><span class="n">fileFilterClass</span> <span class="p">(</span><span class="o">-</span><span class="n">filter</span><span class="p">)</span> <span class="n">fFilterClass</span> <span class="n">The</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">class</span> <span class="n">to</span> <span class="n">use</span>
+ <span class="k">for</span> <span class="n">file</span> <span class="n">parsing</span><span class="p">.</span> <span class="n">Default</span><span class="p">:</span>
+ <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span class="p">.</span><span class="n">text</span><span class="p">.</span><span class="n">PrefixAdditionFilter</span>
+ <span class="o">--</span><span class="n">keyPrefix</span> <span class="p">(</span><span class="o">-</span><span class="n">prefix</span><span class="p">)</span> <span class="n">keyPrefix</span> <span class="n">The</span> <span class="n">prefix</span> <span class="n">to</span> <span class="n">be</span> <span class="n">prepended</span> <span class="n">to</span>
+ <span class="n">the</span> <span class="n">key</span>
+ <span class="o">--</span><span class="n">charset</span> <span class="p">(</span><span class="o">-</span><span class="n">c</span><span class="p">)</span> <span class="n">charset</span> <span class="n">The</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">character</span>
+ <span class="n">encoding</span> <span class="n">of</span> <span class="n">the</span> <span class="n">input</span> <span class="n">files</span><span class="p">.</span>
+ <span class="n">Default</span> <span class="n">to</span> <span class="n">UTF</span><span class="o">-</span>8 <span class="p">{</span><span class="n">accepts</span><span class="p">:</span> <span class="n">cp1252</span><span class="o">|</span><span class="n">ascii</span><span class="p">...}</span>
+ <span class="o">--</span><span class="n">method</span> <span class="p">(</span><span class="o">-</span><span class="n">xm</span><span class="p">)</span> <span class="n">method</span> <span class="n">The</span> <span class="n">execution</span> <span class="n">method</span> <span class="n">to</span> <span class="n">use</span><span class="p">:</span>
+ <span class="n">sequential</span> <span class="n">or</span> <span class="n">mapreduce</span><span class="p">.</span>
+ <span class="n">Default</span> <span class="n">is</span> <span class="n">mapreduce</span>
+ <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">present</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span>
+ <span class="n">output</span> <span class="n">directory</span> <span class="n">before</span>
+ <span class="n">running</span> <span class="n">job</span>
+ <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>
+ <span class="o">--</span><span class="n">tempDir</span> <span class="n">tempDir</span> <span class="n">Intermediate</span> <span class="n">output</span> <span class="n">directory</span>
+ <span class="o">--</span><span class="n">startPhase</span> <span class="n">startPhase</span> <span class="n">First</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span>
+ <span class="o">--</span><span class="n">endPhase</span> <span class="n">endPhase</span> <span class="n">Last</span> <span class="n">phase</span> <span class="n">to</span> <span class="n">run</span> <span class="o">></span>
</pre></div>
<p><a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a></p>
-<h2 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h2>
-<p>+<em>Mahout_0.3</em>+</p>
+<h4 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h4>
<p>From the sequence file generated from the above step run the following to
generate vectors. </p>
-<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seq2sparse
-
-<span class="o">-</span>i <span class="o"><</span>PATH TO THE SEQUENCEFILES<span class="o">></span>
-
-<span class="o">-</span>o <span class="o"><</span>OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED<span class="o">></span>
-
-<span class="o"><-</span>wt <span class="o"><</span>WEIGHTING METHOD USED<span class="o">></span> <span class="p">{</span>tf<span class="o">|</span>tfidf<span class="p">}</span><span class="o">></span>
-
-<span class="o"><-</span>chunk <span class="o"><</span>MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY<span class="o">></span> <span class="m">100</span><span class="o">></span>
-
-<span class="o"><-</span>a <span class="o"><</span>NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT<span class="o">></span>
-
-org.apache.lucene.analysis.standard.StandardAnalyzer<span class="o">></span>
-
-<span class="o"><--</span>minSupport <span class="o"><</span>MINIMUM SUPPORT<span class="o">></span> <span class="m">2</span><span class="o">></span>
-
-<span class="o"><--</span>minDF <span class="o"><</span>MINIMUM DOCUMENT FREQUENCY<span class="o">></span> <span class="m">1</span><span class="o">></span>
-
-<span class="o"><--</span>maxDFPercent <span class="o"><</span>MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN <span class="m">0</span><span class="o">-</span><span class="m">100</span><span class="o">></span> <span class="m">99</span><span class="o">></span>
-
-<span class="o"><--</span>norm <span class="o"><</span>REFER TO L_2 NORM ABOVE<span class="o">></span><span class="p">{</span>INF<span class="o">|</span>integer <span class="o">>=</span> <span class="m">0</span><span class="p">}</span><span class="o">></span><span class="s">"</span>
-
-<span class="s"><-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"</span>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seq2sparse</span>
+ <span class="o">--</span><span class="n">minSupport</span> <span class="p">(</span><span class="o">-</span><span class="n">s</span><span class="p">)</span> <span class="n">minSupport</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Minimum</span> <span class="n">Support</span><span class="p">.</span> <span class="n">Default</span>
+ <span class="n">Value</span><span class="p">:</span> 2
+ <span class="o">--</span><span class="n">analyzerName</span> <span class="p">(</span><span class="o">-</span><span class="n">a</span><span class="p">)</span> <span class="n">analyzerName</span> <span class="n">The</span> <span class="n">class</span> <span class="n">name</span> <span class="n">of</span> <span class="n">the</span> <span class="n">analyzer</span>
+ <span class="o">--</span><span class="n">chunkSize</span> <span class="p">(</span><span class="o">-</span><span class="n">chunk</span><span class="p">)</span> <span class="n">chunkSize</span> <span class="n">The</span> <span class="n">chunkSize</span> <span class="n">in</span> <span class="n">MegaBytes</span><span class="p">.</span> <span class="n">Default</span>
+ <span class="n">Value</span><span class="p">:</span> 100<span class="n">MB</span>
+ <span class="o">--</span><span class="n">output</span> <span class="p">(</span><span class="o">-</span><span class="n">o</span><span class="p">)</span> <span class="n">output</span> <span class="n">The</span> <span class="n">directory</span> <span class="n">pathname</span> <span class="k">for</span> <span class="n">output</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">input</span> <span class="p">(</span><span class="o">-</span><span class="nb">i</span><span class="p">)</span> <span class="n">input</span> <span class="n">Path</span> <span class="n">to</span> <span class="n">job</span> <span class="n">input</span> <span class="n">directory</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">minDF</span> <span class="p">(</span><span class="o">-</span><span class="n">md</span><span class="p">)</span> <span class="n">minDF</span> <span class="n">The</span> <span class="n">minimum</span> <span class="n">document</span> <span class="n">frequency</span><span class="p">.</span> <span class="n">Default</span>
+ <span class="n">is</span> 1
+ <span class="o">--</span><span class="n">maxDFSigma</span> <span class="p">(</span><span class="o">-</span><span class="n">xs</span><span class="p">)</span> <span class="n">maxDFSigma</span> <span class="n">What</span> <span class="n">portion</span> <span class="n">of</span> <span class="n">the</span> <span class="n">tf</span> <span class="p">(</span><span class="n">tf</span><span class="o">-</span><span class="n">idf</span><span class="p">)</span> <span class="n">vectors</span>
+ <span class="n">to</span> <span class="n">be</span> <span class="n">used</span><span class="p">,</span> <span class="n">expressed</span> <span class="n">in</span> <span class="n">times</span> <span class="n">the</span>
+ <span class="n">standard</span> <span class="n">deviation</span> <span class="p">(</span><span class="n">sigma</span><span class="p">)</span> <span class="n">of</span> <span class="n">the</span>
+ <span class="n">document</span> <span class="n">frequencies</span> <span class="n">of</span> <span class="n">these</span> <span class="n">vectors</span><span class="p">.</span>
+ <span class="n">Can</span> <span class="n">be</span> <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span> <span class="n">high</span>
+ <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span> <span class="n">Expressed</span> <span class="n">as</span> <span class="n">a</span> <span class="n">double</span>
+ <span class="n">value</span><span class="p">.</span> <span class="n">Good</span> <span class="n">value</span> <span class="n">to</span> <span class="n">be</span> <span class="n">specified</span> <span class="n">is</span> 3<span class="p">.</span>0<span class="p">.</span>
+ <span class="n">In</span> <span class="k">case</span> <span class="n">the</span> <span class="n">value</span> <span class="n">is</span> <span class="n">less</span> <span class="n">than</span> 0 <span class="n">no</span>
+ <span class="n">vectors</span> <span class="n">will</span> <span class="n">be</span> <span class="n">filtered</span> <span class="n">out</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span>
+ <span class="o">-</span>1<span class="p">.</span>0<span class="p">.</span> <span class="n">Overrides</span> <span class="n">maxDFPercent</span>
+ <span class="o">--</span><span class="n">maxDFPercent</span> <span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="n">maxDFPercent</span> <span class="n">The</span> <span class="n">max</span> <span class="n">percentage</span> <span class="n">of</span> <span class="n">docs</span> <span class="k">for</span> <span class="n">the</span> <span class="n">DF</span><span class="p">.</span>
+ <span class="n">Can</span> <span class="n">be</span> <span class="n">used</span> <span class="n">to</span> <span class="n">remove</span> <span class="n">really</span> <span class="n">high</span>
+ <span class="n">frequency</span> <span class="n">terms</span><span class="p">.</span> <span class="n">Expressed</span> <span class="n">as</span> <span class="n">an</span> <span class="n">integer</span>
+ <span class="n">between</span> 0 <span class="n">and</span> 100<span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> 99<span class="p">.</span> <span class="n">If</span>
+ <span class="n">maxDFSigma</span> <span class="n">is</span> <span class="n">also</span> <span class="n">set</span><span class="p">,</span> <span class="n">it</span> <span class="n">will</span> <span class="n">override</span>
+ <span class="n">this</span> <span class="n">value</span><span class="p">.</span>
+ <span class="o">--</span><span class="n">weight</span> <span class="p">(</span><span class="o">-</span><span class="n">wt</span><span class="p">)</span> <span class="n">weight</span> <span class="n">The</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">weight</span> <span class="n">to</span> <span class="n">use</span><span class="p">.</span> <span class="n">Currently</span> <span class="n">TF</span>
+ <span class="n">or</span> <span class="n">TFIDF</span><span class="p">.</span> <span class="n">Default</span><span class="p">:</span> <span class="n">TFIDF</span>
+ <span class="o">--</span><span class="n">norm</span> <span class="p">(</span><span class="o">-</span><span class="n">n</span><span class="p">)</span> <span class="n">norm</span> <span class="n">The</span> <span class="n">norm</span> <span class="n">to</span> <span class="n">use</span><span class="p">,</span> <span class="n">expressed</span> <span class="n">as</span> <span class="n">either</span> <span class="n">a</span>
+ <span class="n">float</span> <span class="n">or</span> "<span class="n">INF</span>" <span class="k">if</span> <span class="n">you</span> <span class="n">want</span> <span class="n">to</span> <span class="n">use</span> <span class="n">the</span>
+ <span class="n">Infinite</span> <span class="n">norm</span><span class="p">.</span> <span class="n">Must</span> <span class="n">be</span> <span class="n">greater</span> <span class="n">or</span> <span class="n">equal</span>
+ <span class="n">to</span> 0<span class="p">.</span> <span class="n">The</span> <span class="n">default</span> <span class="n">is</span> <span class="n">not</span> <span class="n">to</span> <span class="n">normalize</span>
+ <span class="o">--</span><span class="n">minLLR</span> <span class="p">(</span><span class="o">-</span><span class="n">ml</span><span class="p">)</span> <span class="n">minLLR</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span><span class="n">The</span> <span class="n">minimum</span> <span class="n">Log</span> <span class="n">Likelihood</span>
+ <span class="n">Ratio</span><span class="p">(</span><span class="n">Float</span><span class="p">)</span> <span class="n">Default</span> <span class="n">is</span> 1<span class="p">.</span>0
+ <span class="o">--</span><span class="n">numReducers</span> <span class="p">(</span><span class="o">-</span><span class="n">nr</span><span class="p">)</span> <span class="n">numReducers</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Number</span> <span class="n">of</span> <span class="n">reduce</span> <span class="n">tasks</span><span class="p">.</span>
+ <span class="n">Default</span> <span class="n">Value</span><span class="p">:</span> 1
+ <span class="o">--</span><span class="n">maxNGramSize</span> <span class="p">(</span><span class="o">-</span><span class="n">ng</span><span class="p">)</span> <span class="n">ngramSize</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">The</span> <span class="n">maximum</span> <span class="nb">size</span> <span class="n">of</span> <span class="n">ngrams</span> <span class="n">to</span>
+ <span class="n">create</span> <span class="p">(</span>2 <span class="p">=</span> <span class="n">bigrams</span><span class="p">,</span> 3 <span class="p">=</span> <span class="n">trigrams</span><span class="p">,</span> <span class="n">etc</span><span class="p">)</span>
+ <span class="n">Default</span> <span class="n">Value</span><span class="p">:</span>1
+ <span class="o">--</span><span class="n">overwrite</span> <span class="p">(</span><span class="o">-</span><span class="n">ow</span><span class="p">)</span> <span class="n">If</span> <span class="n">set</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">the</span> <span class="n">output</span> <span class="n">directory</span>
+ <span class="o">--</span><span class="n">help</span> <span class="p">(</span><span class="o">-</span><span class="n">h</span><span class="p">)</span> <span class="n">Print</span> <span class="n">out</span> <span class="n">help</span>
+ <span class="o">--</span><span class="n">sequentialAccessVector</span> <span class="p">(</span><span class="o">-</span><span class="n">seq</span><span class="p">)</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span>
+ <span class="n">be</span> <span class="n">SequentialAccessVectors</span><span class="p">.</span> <span class="n">Default</span> <span class="n">is</span> <span class="n">false</span><span class="p">;</span>
+ <span class="n">true</span> <span class="n">required</span> <span class="k">for</span> <span class="n">running</span> <span class="n">some</span> <span class="n">algorithms</span>
+ <span class="p">(</span><span class="n">LDA</span><span class="p">,</span><span class="n">Lanczos</span><span class="p">)</span>
+ <span class="o">--</span><span class="n">namedVector</span> <span class="p">(</span><span class="o">-</span><span class="n">nv</span><span class="p">)</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span>
+ <span class="n">be</span> <span class="n">NamedVectors</span><span class="p">.</span> <span class="n">If</span> <span class="n">set</span> <span class="n">true</span> <span class="k">else</span> <span class="n">false</span>
+ <span class="o">--</span><span class="n">logNormalize</span> <span class="p">(</span><span class="o">-</span><span class="n">lnorm</span><span class="p">)</span> <span class="p">(</span><span class="n">Optional</span><span class="p">)</span> <span class="n">Whether</span> <span class="n">output</span> <span class="n">vectors</span> <span class="n">should</span>
+ <span class="n">be</span> <span class="n">logNormalize</span><span class="p">.</span> <span class="n">If</span> <span class="n">set</span> <span class="n">true</span> <span class="k">else</span> <span class="n">false</span>
</pre></div>
-<p>--minSupport is the min frequency for the word to be considered as a
-feature. --minDF is the min number of documents the word needs to be in
---maxDFPercent is the max value of the expression (document frequency of a
-word/total number of document) to be considered as good feature to be in
-the document. This helps remove high frequency features like stop words</p>
-<p><a name="CreatingVectorsfromText-Background"></a></p>
-<h1 id="background">Background</h1>
+<p>--minSupport is the min frequency for the word to be considered as a feature. --minDF is the min number of documents the word needs to be in --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. These options are helpful in removing high frequency features like stop words.
+<a name="CreatingVectorsfromText-Background"></a></p>
+<h2 id="background">Background</h2>
<ul>
<li><a href="http://markmail.org/thread/l5zi3yk446goll3o">Discussion on centroid calculations with sparse vectors</a></li>
</ul>