You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2013/11/20 16:36:39 UTC

svn commit: r887362 - in /websites/staging/mahout/trunk/content: ./ users/basics/creating-vectors-from-text.html

Author: buildbot
Date: Wed Nov 20 15:36:39 2013
New Revision: 887362

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 20 15:36:39 2013
@@ -1 +1 @@
-1543801
+1543844

Modified: websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html (original)
+++ websites/staging/mahout/trunk/content/users/basics/creating-vectors-from-text.html Wed Nov 20 15:36:39 2013
@@ -381,11 +381,11 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p>+<em>Mahout_0.2</em>+
-{toc:style=disc|indent=20px}</p>
+    <h1 id="creating-vectors-from-text">Creating vectors from text</h1>
+<p>available starting <em>Mahout_0.2</em></p>
 <p><a name="CreatingVectorsfromText-Introduction"></a></p>
 <h1 id="introduction">Introduction</h1>
-<p>For clustering documents it is usually necessary to convert the raw text
+<p>For clustering and classifying documents it is usually necessary to convert the raw text
 into vectors that can then be consumed by the clustering <a href="algorithms.html">Algorithms</a>
 .  These approaches are described below.</p>
 <p><a name="CreatingVectorsfromText-FromLucene"></a></p>
@@ -400,10 +400,10 @@ representations from a Lucene (and Solr,
 <p>For this, we assume you know how to build a Lucene/Solr index.  For those
 who don't, it is probably easiest to get up and running using <a href="http://lucene.apache.org/solr">Solr</a>
  as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
-index.  For those wanting to use just Lucene, see the Lucene [website|http://lucene.apache.org/java]
+index.  For those wanting to use just Lucene, see the <a href="http://lucene.apache.org/java">Lucene website</a>
  or check out <em>Lucene In Action</em> by Erik Hatcher, Otis Gospodnetic and Mike
 McCandless.</p>
-<p>To get started, make sure you get a fresh copy of Mahout from <a href="http://cwiki.apache.org/MAHOUT/buildingmahout.html">SVN</a>
+<p>To get started, make sure you get a fresh copy of Mahout from <a href="../developers/buildingmahout.html">SVN</a>
  and are comfortable building it. It defines interfaces and implementations
 for efficiently iterating over a Data Source (it only supports Lucene
 currently, but should be extensible to databases, Solr, etc.) and produces
@@ -414,24 +414,25 @@ several input options, which can be disp
 option.  Examples of running the Driver are included below:</p>
 <p><a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p>
 <h2 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h2>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">&lt;</span><span class="n">PATH</span> <span class="n">TO</span> <span class="n">DIRECTORY</span> <span class="n">CONTAINING</span> <span class="n">LUCENE</span>
-</pre></div>
+<blockquote>
+    $MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE
+INDEX> \
+       --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO]
+ \
+       <--max <Number of vectors to output>> <--norm {INF|integer >= 0}>
+<--idField <Name of the idField in the Lucene index>>
 
+</blockquote>
 
-<p>INDEX&gt; \
-       --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO]
- \\
-       <--max <Number of vectors to output>&gt; &lt;--norm {INF|integer &gt;= 0}&gt;
-&lt;--idField <Name of the idField in the Lucene index>&gt;</p>
 <p><a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a></p>
 <h3 id="create-50-vectors-from-an-index">Create 50 Vectors from an Index</h3>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">lucene</span><span class="p">.</span><span class="n">vector</span> <span class="o">--</span><span class="n">dir</span>
-</pre></div>
-
-
-<p><PATH>/wikipedia/solr/data/index --field body \
+<blockquote>
+    $MAHOUT_HOME/bin/mahout lucene.vector --dir
+<PATH>/wikipedia/solr/data/index --field body \
         --dictOut <PATH>/solr/wikipedia/dict.txt --output
-<PATH>/solr/wikipedia/out.txt --max 50</p>
+<PATH>/solr/wikipedia/out.txt --max 50
+</blockquote>
+
 <p>This uses the index specified by --dir and the body field in it and writes
 out the info to the output dir and the dictionary to dict.txt.  It only
 outputs 50 vectors.  If you don't specify --max, then all the documents in
@@ -462,35 +463,35 @@ sub-directories and creates the Sequence
 the document id generated is <PREFIX><RELATIVE PATH FROM
 PARENT>/document.txt</p>
 <p>From the examples directory run</p>
-<div class="codehilite"><pre><span class="p">$</span>MAHOUT_HOME<span class="o">/</span>bin<span class="o">/</span>mahout seqdirectory \
-<span class="o">--</span>input <span class="o">&lt;</span>PARENT DIR WHERE DOCS ARE LOCATED<span class="o">&gt;</span> <span class="o">--</span>output <span class="o">&lt;</span>OUTPUT DIRECTORY<span class="o">&gt;</span> \
-<span class="o">&lt;-</span>c <span class="o">&lt;</span>CHARSET NAME OF THE INPUT DOCUMENTS<span class="o">&gt;</span> <span class="p">{</span>UTF<span class="o">-</span><span class="m">8</span><span class="o">|</span>cp1252<span class="o">|</span>ascii...<span class="p">}</span><span class="o">&gt;</span> \
-<span class="o">&lt;-</span>chunk <span class="o">&lt;</span>MAX SIZE OF EACH CHUNK in Megabytes<span class="o">&gt;</span> <span class="m">64</span><span class="o">&gt;</span> \
-<span class="o">&lt;-</span>prefix <span class="o">&lt;</span>PREFIX TO ADD TO THE DOCUMENT ID<span class="o">&gt;&gt;</span>
-</pre></div>
-
+<blockquote>
+    $MAHOUT_HOME/bin/mahout seqdirectory \
+    --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
+    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
+    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
+    <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
+</blockquote>
 
 <p><a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a></p>
 <h2 id="creating-vectors-from-sequencefile">Creating Vectors from SequenceFile</h2>
 <p>+<em>Mahout_0.3</em>+</p>
 <p>From the sequence file generated from the above step run the following to
 generate vectors. </p>
-<div class="codehilite"><pre>$<span class="n">MAHOUT_HOME</span><span class="o">/</span><span class="n">bin</span><span class="o">/</span><span class="n">mahout</span> <span class="n">seq2sparse</span> <span class="o">\</span>
-<span class="o">-</span><span class="nb">i</span> <span class="o">&lt;</span><span class="n">PATH</span> <span class="n">TO</span> <span class="n">THE</span> <span class="n">SEQUENCEFILES</span><span class="o">&gt;</span> <span class="o">-</span><span class="n">o</span> <span class="o">&lt;</span><span class="n">OUTPUT</span> <span class="n">DIRECTORY</span> <span class="n">WHERE</span> <span class="n">VECTORS</span> <span class="n">AND</span>
-</pre></div>
+<blockquote>
+    $MAHOUT_HOME/bin/mahout seq2sparse \
+    -i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND
+DICTIONARY IS GENERATED> \
+    <-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
+    <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
+    <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT>
+org.apache.lucene.analysis.standard.StandardAnalyzer> \
+    <--minSupport <MINIMUM SUPPORT> 2> \
+    <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
+    <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
+    <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
+    <-seq <Create SequentialAccessVectors>{false|true required for running some
+algorithms(LDA,Lanczos)}>"
+</blockquote>
 
-
-<p>DICTIONARY IS GENERATED&gt; \
-    &lt;-wt <WEIGHTING METHOD USED> {tf|tfidf}&gt; \
-    &lt;-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100&gt; \
-    &lt;-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT>
-org.apache.lucene.analysis.standard.StandardAnalyzer&gt; \
-    &lt;--minSupport <MINIMUM SUPPORT> 2&gt; \
-    &lt;--minDF <MINIMUM DOCUMENT FREQUENCY> 1&gt; \
-    &lt;--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99&gt; \
-    &lt;--norm <REFER TO L_2 NORM ABOVE>{INF|integer &gt;= 0}&gt;"
-    &lt;-seq <Create SequentialAccessVectors>{false|true required for running some
-algorithms(LDA,Lanczos)}&gt;"</p>
 <p>--minSupport is the min frequency for the word to  be considered as a
 feature. --minDF is the min number of documents the word needs to be in
 --maxDFPercent is the max value of the expression (document frequency of a
@@ -498,15 +499,9 @@ word/total number of document) to be con
 the document. This helps remove high frequency features like stop words</p>
 <p><a name="CreatingVectorsfromText-Background"></a></p>
 <h1 id="background">Background</h1>
-<p><em>
-http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
-</em>
-http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering</p>
-<p><a name="CreatingVectorsfromText-FromaDatabase"></a></p>
-<h1 id="from-a-database">From a Database</h1>
-<p>+<em>TODO:</em>+</p>
-<p><a name="CreatingVectorsfromText-Other"></a></p>
-<h1 id="other">Other</h1>
+<ul>
+<li><a href="http://markmail.org/thread/l5zi3yk446goll3o">Discussion on centroid calculations with sparse vectors</a></li>
+</ul>
 <p><a name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a></p>
 <h2 id="converting-existing-vectors-to-mahouts-format">Converting existing vectors to Mahout's format</h2>
 <p>If you are in the happy position to already own a document (as in: texts,
@@ -515,12 +510,11 @@ question arises of how to convert the ve
 format. Probably the easiest way to go would be to implement your own
 Iterable<Vector> (called VectorIterable in the example below) and then
 reuse the existing VectorWriter classes:</p>
-<div class="codehilite"><pre><span class="n">VectorWriter</span> <span class="n">vectorWriter</span> <span class="p">=</span> <span class="n">SequenceFile</span><span class="p">.</span><span class="n">createWriter</span><span class="p">(</span><span class="n">filesystem</span><span class="p">,</span>
-</pre></div>
-
-
-<p>configuration, outfile, LongWritable.class, SparseVector.class);
-    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);</p>
+<blockquote>
+    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
+configuration, outfile, LongWritable.class, SparseVector.class);
+    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+</blockquote>
    </div>
   </div>     
 </div>