You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by is...@apache.org on 2013/11/20 16:36:36 UTC
svn commit: r1543844 -
/mahout/site/mahout_cms/trunk/content/users/basics/creating-vectors-from-text.mdtext
Author: isabel
Date: Wed Nov 20 15:36:35 2013
New Revision: 1543844
URL: http://svn.apache.org/r1543844
Log:
MAHOUT-1245 - Fixed formatting, links etc. on creating vectors from text page
Modified:
mahout/site/mahout_cms/trunk/content/users/basics/creating-vectors-from-text.mdtext
Modified: mahout/site/mahout_cms/trunk/content/users/basics/creating-vectors-from-text.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/basics/creating-vectors-from-text.mdtext?rev=1543844&r1=1543843&r2=1543844&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/basics/creating-vectors-from-text.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/basics/creating-vectors-from-text.mdtext Wed Nov 20 15:36:35 2013
@@ -1,11 +1,13 @@
Title: Creating Vectors from Text
-+*Mahout_0.2*+
-{toc:style=disc|indent=20px}
+
+# Creating vectors from text
+
+available starting *Mahout_0.2*
<a name="CreatingVectorsfromText-Introduction"></a>
# Introduction
-For clustering documents it is usually necessary to convert the raw text
+For clustering and classifying documents it is usually necessary to convert the raw text
into vectors that can then be consumed by the clustering [Algorithms](algorithms.html)
. These approaches are described below.
@@ -24,11 +26,11 @@ representations from a Lucene (and Solr,
For this, we assume you know how to build a Lucene/Solr index. For those
who don't, it is probably easiest to get up and running using [Solr](http://lucene.apache.org/solr)
as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
-index. For those wanting to use just Lucene, see the Lucene [website|http://lucene.apache.org/java]
+index. For those wanting to use just Lucene, see the [Lucene website](http://lucene.apache.org/java)
or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike
McCandless.
-To get started, make sure you get a fresh copy of Mahout from [SVN](http://cwiki.apache.org/MAHOUT/buildingmahout.html)
+To get started, make sure you get a fresh copy of Mahout from [SVN](../developers/buildingmahout.html)
and are comfortable building it. It defines interfaces and implementations
for efficiently iterating over a Data Source (it only supports Lucene
currently, but should be extensible to databases, Solr, etc.) and produces
@@ -41,7 +43,7 @@ option. Examples of running the Driver
<a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a>
## Generating an output file from a Lucene Index
-
+<blockquote>
$MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE
INDEX> \
--output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO]
@@ -49,14 +51,17 @@ INDEX> \
<--max <Number of vectors to output>> <--norm {INF|integer >= 0}>
<--idField <Name of the idField in the Lucene index>>
+</blockquote>
<a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a>
### Create 50 Vectors from an Index
+<blockquote>
$MAHOUT_HOME/bin/mahout lucene.vector --dir
<PATH>/wikipedia/solr/data/index --field body \
--dictOut <PATH>/solr/wikipedia/dict.txt --output
<PATH>/solr/wikipedia/out.txt --max 50
+</blockquote>
This uses the index specified by --dir and the body field in it and writes
out the info to the output dir and the dictionary to dict.txt. It only
@@ -93,12 +98,13 @@ PARENT>/document.txt
From the examples directory run
+<blockquote>
$MAHOUT_HOME/bin/mahout seqdirectory \
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
-
+</blockquote>
<a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a>
## Creating Vectors from SequenceFile
@@ -108,6 +114,7 @@ From the examples directory run
From the sequence file generated from the above step run the following to
generate vectors.
+<blockquote>
$MAHOUT_HOME/bin/mahout seq2sparse \
-i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND
DICTIONARY IS GENERATED> \
@@ -121,7 +128,7 @@ org.apache.lucene.analysis.standard.Stan
<--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
<-seq <Create SequentialAccessVectors>{false|true required for running some
algorithms(LDA,Lanczos)}>"
-
+</blockquote>
--minSupport is the min frequency for the word to be considered as a
feature. --minDF is the min number of documents the word needs to be in
@@ -132,18 +139,7 @@ the document. This helps remove high fre
<a name="CreatingVectorsfromText-Background"></a>
# Background
-*
-http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
-*
-http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
-
-<a name="CreatingVectorsfromText-FromaDatabase"></a>
-# From a Database
-
-+*TODO:*+
-
-<a name="CreatingVectorsfromText-Other"></a>
-# Other
+* [Discussion on centroid calculations with sparse vectors](http://markmail.org/thread/l5zi3yk446goll3o)
<a name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a>
## Converting existing vectors to Mahout's format
@@ -156,7 +152,10 @@ Iterable<Vector> (called VectorIterable
reuse the existing VectorWriter classes:
+<blockquote>
VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
configuration, outfile, LongWritable.class, SparseVector.class);
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+</blockquote>
+