You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2013/11/20 16:39:43 UTC

svn commit: r887363 - in /websites/staging/mahout/trunk/content: ./ users/basics/creating-vectors-from-text.html

Author: buildbot
Date: Wed Nov 20 15:39:43 2013
New Revision: 887363

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 20 15:39:43 2013
@@ -1 +1 @@
-1543844
+1543845

Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original)
+++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Wed Nov 20 15:39:43 2013
@@ -415,22 +415,26 @@ option.  Examples of running the Driver 
 <p><a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p>
 <h2 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h2>
 <blockquote>
-    $MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE
-INDEX> \
+    $MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE INDEX>
+
        --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO]
- \
+
        <--max <Number of vectors to output>> <--norm {INF|integer >= 0}>
-<--idField <Name of the idField in the Lucene index>>
+
+       <--idField <Name of the idField in the Lucene index>>
 
 </blockquote>
 
 <p><a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a></p>
 <h3 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h3>
 <blockquote>
-    $MAHOUT_HOME/bin/mahout lucene.vector --dir
-<PATH>/wikipedia/solr/data/index --field body \
-        --dictOut <PATH>/solr/wikipedia/dict.txt --output
-<PATH>/solr/wikipedia/out.txt --max 50
+
+    $MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body 
+
+        --dictOut <PATH>/solr/wikipedia/dict.txt
+
+        --output <PATH>/solr/wikipedia/out.txt --max 50
+
 </blockquote>
 
 <p>This uses the index specified by --dir and the body field in it and writes
@@ -439,13 +443,16 @@ outputs 50 vectors.  If you don't specif
 the index are output.</p>
 <p><a name="CreatingVectorsfromText-Normalize50VectorsfromaLuceneIndexusingthe<a href="http://en.wikipedia.org/wiki/Lp_space">L_2Norm</a>"></a></p>
 <h3 id="normalize-50-vectors-from-a-lucene-index-using-the-l_2-normhttpenwikipediaorgwikilp_space">Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]</h3>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span>
-</pre></div>
+<blockquote>
+
+    $MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index --field body 
 
+          --dictOut <PATH>/solr/wikipedia/dict.txt
+
+          --output <PATH>/solr/wikipedia/out.txt --max 50 --norm 2
+
+</blockquote>
 
-<p><PATH>/wikipedia/solr/data/index --field body \
-          --dictOut <PATH>/solr/wikipedia/dict.txt --output
-<PATH>/solr/wikipedia/out.txt --max 50 --norm 2</p>
 <p><a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a></p>
 <h1 id="from-directory-of-text-documents">From Directory of Text documents</h1>
 <p>Mahout has utilities to generate Vectors from a directory of text
@@ -464,11 +471,16 @@ the document id generated is <PREFIX><RE
 PARENT>/document.txt</p>
 <p>From the examples directory run</p>
 <blockquote>
-    $MAHOUT_HOME/bin/mahout seqdirectory \
-    --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
-    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
-    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
+
+    $MAHOUT_HOME/bin/mahout seqdirectory 
+    --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> 
+
+    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> 
+
+    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> 
+
     <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
+
 </blockquote>
 
 <p><a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a></p>
@@ -477,19 +489,33 @@ PARENT>/document.txt</p>
 <p>From the sequence file generated from the above step run the following to
 generate vectors. </p>
 <blockquote>
-    $MAHOUT_HOME/bin/mahout seq2sparse \
-    -i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND
-DICTIONARY IS GENERATED> \
-    <-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
-    <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
+    $MAHOUT_HOME/bin/mahout seq2sparse
+
+    -i <PATH TO THE SEQUENCEFILES> 
+
+    -o <OUTPUT DIRECTORY WHERE VECTORS AND DICTIONARY IS GENERATED> 
+
+    <-wt <WEIGHTING METHOD USED> {tf|tfidf}> 
+
+    <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> 
+
     <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT>
-org.apache.lucene.analysis.standard.StandardAnalyzer> \
-    <--minSupport <MINIMUM SUPPORT> 2> \
-    <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
-    <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
+
+</blockquote>
+
+<blockquote>
+org.apache.lucene.analysis.standard.StandardAnalyzer>
+
+    <--minSupport <MINIMUM SUPPORT> 2> 
+
+    <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> 
+
+    <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> 
+
     <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
-    <-seq <Create SequentialAccessVectors>{false|true required for running some
-algorithms(LDA,Lanczos)}>"
+
+    <-seq <Create SequentialAccessVectors>{false|true required for running some algorithms(LDA,Lanczos)}>"
+
 </blockquote>
 
 <p>--minSupport is the min frequency for the word to  be considered as a
@@ -511,9 +537,10 @@ format. Probably the easiest way to go w
 Iterable<Vector> (called VectorIterable in the example below) and then
 reuse the existing VectorWriter classes:</p>
 <blockquote>
-    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
-configuration, outfile, LongWritable.class, SparseVector.class);
+    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class);
+
     long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+
 </blockquote>
    </div>
   </div>