You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2013/11/20 16:52:41 UTC

svn commit: r887366 - in /websites/staging/mahout/trunk/content: ./ users/basics/creating-vectors-from-text.html

Author: buildbot
Date: Wed Nov 20 15:52:41 2013
New Revision: 887366

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 20 15:52:41 2013
@@ -1 +1 @@
-1543848
+1543849

Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original)
+++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Wed Nov 20 15:52:41 2013
@@ -414,7 +414,6 @@ several input options, which can be disp
 option.  Examples of running the Driver are included below:</p>
 <p><a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p>
 <h2 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h2>
-<p>~~~~.html</p>
 <div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout lucene.vector <span class="o">&lt;</span>PATH TO DIRECTORY CONTAINING LUCENE INDEX<span class="o">&gt;</span>
 
     <span class="o">--</span>output <span class="o">&lt;</span>PATH TO OUTPUT LOCATION<span class="o">&gt;</span>
@@ -429,19 +428,15 @@ option.  Examples of running the Driver 
 </pre></div>
 
 
-<p>~~~~</p>
-<p></code></pre></p>
 <p><a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a></p>
 <h3 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h3>
-<blockquote>
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
 
-    $MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body 
+    <span class="o">--</span><span class="n">dictOut</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
 
-        --dictOut <PATH>/solr/wikipedia/dict.txt
-
-        --output <PATH>/solr/wikipedia/out.txt --max 50
+    <span class="o">--</span><span class="n">output</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50
+</pre></div>
 
-</blockquote>
 
 <p>This uses the index specified by --dir and the body field in it and writes
 out the info to the output dir and the dictionary to dict.txt.  It only
@@ -449,15 +444,13 @@ outputs 50 vectors.  If you don't specif
 the index are output.</p>
 <p><a name="CreatingVectorsfromText-Normalize50VectorsfromaLuceneIndexusingthe<a href="http://en.wikipedia.org/wiki/Lp_space">L_2Norm</a>"></a></p>
 <h3 id="normalize-50-vectors-from-a-lucene-index-using-the-l_2-normhttpenwikipediaorgwikilp_space">Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]</h3>
-<blockquote>
-
-    $MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body 
+<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">solr</span><span class="o">/</span><span class="n">data</span><span class="o">/</span><span class="n">index</span> <span class="o">--</span><span class="n">field</span> <span class="n">body</span>
 
-          --dictOut <PATH>/solr/wikipedia/dict.txt
+      <span class="o">--</span><span class="n">dictOut</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">dict</span><span class="p">.</span><span class="n">txt</span>
 
-          --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2
+      <span class="o">--</span><span class="n">output</span> <span class="o">&lt;</span><span class="n">PATH</span><span class="o">&gt;/</span><span class="n">solr</span><span class="o">/</span><span class="n">wikipedia</span><span class="o">/</span><span class="n">out</span><span class="p">.</span><span class="n">txt</span> <span class="o">--</span><span class="n">max</span> 50 <span class="o">--</span><span class="n">norm</span> 2
+</pre></div>
 
-</blockquote>
 
 <p><a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a></p>
 <h1 id="from-directory-of-text-documents">From Directory of Text documents</h1>
@@ -476,53 +469,48 @@ sub-directories and creates the Sequence
 the document id generated is <PREFIX><RELATIVE PATH FROM
 PARENT>/document.txt</p>
 <p>From the examples directory run</p>
-<blockquote>
+<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seqdirectory
 
-    $MAHOUT_HOME/bin/mahout seqdirectory 
-    --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> 
+<span class="o">--</span>input <span class="o">&lt;</span>PARENT DIR WHERE DOCS ARE LOCATED<span class="o">&gt;</span> <span class="o">--</span>output <span class="o">&lt;</span>OUTPUT DIRECTORY<span class="o">&gt;</span>
 
-    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> 
+<span class="o">&lt;-</span>c <span class="o">&lt;</span>CHARSET NAME OF THE INPUT DOCUMENTS<span class="o">&gt;</span> <span class="p">{</span>UTF<span class="o">-</span><span class="m">8</span><span class="o">|</span>cp1252<span class="o">|</span>ascii...<span class="p">}</span><span class="o">&gt;</span>
 
-    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> 
+<span class="o">&lt;-</span>chunk <span class="o">&lt;</span>MAX SIZE OF EACH CHUNK in Megabytes<span class="o">&gt;</span> <span class="m">64</span><span class="o">&gt;</span>
 
-    <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
+<span class="o">&lt;-</span>prefix <span class="o">&lt;</span>PREFIX TO ADD TO THE DOCUMENT ID<span class="o">&gt;&gt;</span>
+</pre></div>
 
-</blockquote>
 
 <p><a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a></p>
 <h2 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h2>
 <p>+<em>Mahout_0.3</em>+</p>
 <p>From the sequence file generated from the above step run the following to
 generate vectors. </p>
-<blockquote>
-    $MAHOUT_HOME/bin/mahout seq2sparse
+<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seq2sparse
 
-    -i <PATH TO THE SEQUENCEFILES> 
+<span class="o">-</span>i <span class="o">&lt;</span>PATH TO THE SEQUENCEFILES<span class="o">&gt;</span>
 
-    -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> 
+<span class="o">-</span>o <span class="o">&lt;</span>OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED<span class="o">&gt;</span>
 
-    <-wt <WEIGHTING METHOD USED> {tf|tfidf}> 
+<span class="o">&lt;-</span>wt <span class="o">&lt;</span>WEIGHTING METHOD USED<span class="o">&gt;</span> <span class="p">{</span>tf<span class="o">|</span>tfidf<span class="p">}</span><span class="o">&gt;</span>
 
-    <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> 
+<span class="o">&lt;-</span>chunk <span class="o">&lt;</span>MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY<span class="o">&gt;</span> <span class="m">100</span><span class="o">&gt;</span>
 
-    <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT>
+<span class="o">&lt;-</span>a <span class="o">&lt;</span>NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT<span class="o">&gt;</span>
 
-</blockquote>
+org.apache.lucene.analysis.standard.StandardAnalyzer<span class="o">&gt;</span>
 
-<blockquote>
-org.apache.lucene.analysis.standard.StandardAnalyzer>
+<span class="o">&lt;--</span>minSupport <span class="o">&lt;</span>MINIMUM SUPPORT<span class="o">&gt;</span> <span class="m">2</span><span class="o">&gt;</span>
 
-    <--minSupport <MINIMUM SUPPORT> 2> 
+<span class="o">&lt;--</span>minDF <span class="o">&lt;</span>MINIMUM DOCUMENT FREQUENCY<span class="o">&gt;</span> <span class="m">1</span><span class="o">&gt;</span>
 
-    <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> 
+<span class="o">&lt;--</span>maxDFPercent <span class="o">&lt;</span>MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN <span class="m">0</span><span class="o">-</span><span class="m">100</span><span class="o">&gt;</span> <span class="m">99</span><span class="o">&gt;</span>
 
-    <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> 
+<span class="o">&lt;--</span>norm <span class="o">&lt;</span>REFER TO L_2 NORM ABOVE<span class="o">&gt;</span><span class="p">{</span>INF<span class="o">|</span>integer <span class="o">&gt;=</span> <span class="m">0</span><span class="p">}</span><span class="o">&gt;</span><span class="s">&quot;</span>
 
-    <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
-
-    <-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"
+<span class="s">&lt;-seq &lt;Create SequentialAccessVectors&gt;{false|true required for running some algorithms(LDA,Lanczos)}&gt;&quot;</span>
+</pre></div>
 
-</blockquote>
 
 <p>--minSupport is the min frequency for the word to  be considered as a
 feature. --minDF is the min number of documents the word needs to be in
@@ -542,12 +530,10 @@ question arises of how to convert the ve
 format. Probably the easiest way to go would be to implement your own
 Iterable<Vector> (called VectorIterable in the example below) and then
 reuse the existing VectorWriter classes:</p>
-<blockquote>
-    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class);
+<div class="codehilite"><pre><span class="n">VectorWriter</span> <span class="n">vectorWriter</span> <span class="p">=</span> <span class="n">SequenceFile</span><span class="p">.</span><span class="n">createWriter</span><span class="p">(</span><span class="n">filesystem</span><span class="p">,</span> <span class="n">configuration</span><span class="p">,</span> <span class="n">outfile</span><span class="p">,</span> <span class="n">LongWritable</span><span class="p">.</span><span class="n">class</span><span class="p">,</span> <span class="n">SparseVector</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
 
-    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
-
-</blockquote>
+<span class="n">long</span> <span class="n">numDocs</span> <span class="p">=</span> <span class="n">vectorWriter</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">new</span> <span class="n">VectorIterable</span><span class="p">(),</span> <span class="n">Long</span><span class="p">.</span><span class="n">MAX_VALUE</span><span class="p">);</span>
+</pre></div>
    </div>
   </div>     
 </div>