You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by sr...@apache.org on 2012/07/12 11:26:03 UTC
svn commit: r1360593 [11/17] - in /mahout/site/trunk: ./ cgi-bin/ content/ content/attachments/ content/attachments/101992/ content/attachments/116559/ content/attachments/22872433/ content/attachments/22872443/ content/attachments/23335706/ content/at...

Added: mahout/site/trunk/content/cluster-dumper.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/cluster-dumper.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/cluster-dumper.mdtext (added)
+++ mahout/site/trunk/content/cluster-dumper.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,79 @@
+Title: Cluster Dumper
+<a name="ClusterDumper-Introduction"></a>
+# Introduction
+
+Clustering tasks in Mahout will output data in the format of a SequenceFile
+(Text, Cluster) and the Text is a cluster identifier string. To analyze
+this output we need to convert the sequence files to a human readable
+format and this is achieved using the clusterdump utility.
+
+<a name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a>
+# Steps for analyzing cluster output using clusterdump utility
+
+After you've executed a clustering tasks (either examples or real-world),
+you can run clusterdumper in 2 modes.
+1. [Hadoop Environment](#hadoopenvironment.html)
+1. [Standalone Java Program ](#standalonejavaprogram.html)
+
+<a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a>
+### Hadoop Environment {anchor:HadoopEnvironment}
+
+If you have setup your HADOOP_HOME environment variable, you can use the
+command line utility "mahout" to execute the ClusterDumper on Hadoop. In
+this case we wont need to get the output clusters to our local machines.
+The utility will read the output clusters present in HDFS and output the
+human-readable cluster values into our local file system. Say you've just
+executed the [synthetic control example ](clustering-of-synthetic-control-data.html)
+ and want to analyze the output, you can execute
+
+    
+    h3. Standalone Java Program {anchor:StandaloneJavaProgram}
+    
+    ClusterDumper can be run using CLI. If your HADOOP_HOME environment
+variable is not set, you can execute ClusterDumper using "mahout" command
+line utility.
+    # get the output data from hadoop into your local machine. For example, in
+the case where you've executed a clustering example use
+
+This will create a folder called output inside your $MAHOUT_HOME/examples
+and will have sub-folders for each cluster outputs and ClusteredPoints
+1. Run the clusterdump utility as follows
+
+    h5. Standalone Java Program through Eclipse
+    If you are using eclipse, setup mahout-utils as a project as specified in [Working with Maven in Eclipse|BuildingMahout#mahout_maven_eclipse]
+.
+    To execute ClusterDumper.java,
+    
+    * Under mahout-utils, Right-Click on ClusterDumper.java
+    * Choose Run-As, Run Configurations
+    * On the left menu, click on Java Application
+    * On the top-bar click on "New Launch Configuration"
+    * A new launch should be automatically created with project as
+"mahout-utils" and Main Class as
+"org.apache.mahout.utils.clustering.ClusterDumper"
+    * In the arguments tab, specify the below arguments
+    \--seqFileDir <MAHOUT_HOME>/examples/output/clusters-10 \--pointsDir
+<MAHOUT_HOME>/examples/output/clusteredPoints \--output
+<MAHOUT_HOME>/examples/output/clusteranalyze.txt
+    replace <MAHOUT_HOME> with the actual path of your $MAHOUT_HOME
+    * Hit run to execute the ClusterDumper using Eclipse.
+    Setting breakpoints etc should just work fine.
+    
+    h3. Reading the output file
+    
+    This will output the clusters into a file called clusteranalyze.txt inside
+$MAHOUT_HOME/examples/output
+    Sample data will look like
+
+CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, 29.937, 29.751, 30.054, 30.039, 30.126, 29.764, 29.835, 30.503, 29.876, 29.990, 29.605, 29.379, 30.120, 29.882, 30.161, 29.825, 30.074, 30.001, 30.421, 29.867, 29.736, 29.760, 30.192, 30.134, 30.082, 29.962, 29.512, 29.736, 29.594, 29.493, 29.761, 29.183, 29.517, 29.273, 29.161, 29.215, 29.731, 29.154, 29.113, 29.348, 28.981, 29.543, 29.192, 29.479, 29.406, 29.715, 29.344, 29.628, 29.074, 29.347, 29.812, 29.058, 29.177, 29.063, 29.607](29.922,-30.407,-30.373,-30.094,-29.886,-29.937,-29.751,-30.054,-30.039,-30.126,-29.764,-29.835,-30.503,-29.876,-29.990,-29.605,-29.379,-30.120,-29.882,-30.161,-29.825,-30.074,-30.001,-30.421,-29.867,-29.736,-29.760,-30.192,-30.134,-30.082,-29.962,-29.512,-29.736,-29.594,-29.493,-29.761,-29.183,-29.517,-29.273,-29.161,-29.215,-29.731,-29.154,-29.113,-29.348,-28.981,-29.543,-29.192,-29.479,-29.406,-29.715,-29.344,-29.628,-29.074,-29.347,-29.812,-29.058,-29.177,-29.063,-29.607.html)
+ r=[3.463, 3.351, 3.452, 3.438, 3.371, 3.569, 3.253, 3.531, 3.439, 3.472,
+3.402, 3.459, 3.320, 3.260, 3.430, 3.452, 3.320, 3.499, 3.302, 3.511,
+3.520, 3.447, 3.516, 3.485, 3.345, 3.178, 3.492, 3.434, 3.619, 3.483,
+3.651, 3.833, 3.812, 3.433, 4.133, 3.855, 4.123, 3.999, 4.467, 4.731,
+4.539, 4.956, 4.644, 4.382, 4.277, 4.918, 4.784, 4.582, 4.915, 4.607,
+4.672, 4.577, 5.035, 5.241, 4.731, 4.688, 4.685, 4.657, 4.912, 4.300] }
+
+    and on...
+    where CL-0 is the Cluster 0 and n=116 refers to the number of points observed by this cluster and c = \[29.922 ...\]
+ refers to the center of Cluster as a vector and r = \[3.463 ..\] refers to
+the radius of the cluster as a vector.

Added: mahout/site/trunk/content/clustering-of-synthetic-control-data.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/clustering-of-synthetic-control-data.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/clustering-of-synthetic-control-data.mdtext (added)
+++ mahout/site/trunk/content/clustering-of-synthetic-control-data.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,116 @@
+Title: Clustering of synthetic control data
+* [Introduction](#Clusteringofsyntheticcontroldata-Introduction)
+* [Problem description](#Clusteringofsyntheticcontroldata-Problemdescription)
+* [Pre-Prep](#Clusteringofsyntheticcontroldata-Pre-Prep)
+* [Perform Clustering](#Clusteringofsyntheticcontroldata-PerformClustering)
+* [Read / Analyze Output](#Clusteringofsyntheticcontroldata-Read/AnalyzeOutput)
+
+<a name="Clusteringofsyntheticcontroldata-Introduction"></a>
+# Introduction
+
+The example will demonstrate clustering of control charts which exhibits a
+time series. [Control charts ](http://en.wikipedia.org/wiki/Control_chart)
+ are tools used to determine whether or not a manufacturing or business
+process is in a state of statistical control. Such control charts are
+generated / simulated over equal time interval and available for use in UCI
+machine learning database. The data is described [here |http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html]
+.
+
+<a name="Clusteringofsyntheticcontroldata-Problemdescription"></a>
+# Problem description
+
+A time series of control charts needs to be clustered into their close knit
+groups. The data set we use is synthetic and so resembles real world
+information in an anonymized format. It contains six different classes
+(Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward
+shift). With these trends occurring on the input data set, the Mahout
+clustering algorithm will cluster the data into their corresponding class
+buckets. At the end of this example, you'll get to learn how to perform
+clustering using Mahout.
+
+<a name="Clusteringofsyntheticcontroldata-Pre-Prep"></a>
+# Pre-Prep
+
+Make sure you have the following covered before you work out the example.
+
+1. Input data set. Download it [here ](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data)
+	1. Sample input data:
+Input consists of 600 rows and 60 columns. The rows from  1 - 100 contains
+Normal data. Rows from 101 - 200 contains cyclic data and so on.. More info [here ](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html)
+. Sample of how the data looks is like below.
+<div class='table-wrap'>
+<table class='confluenceTable'><tbody>
+<tr><th class='confluenceTh'> \_time </th><th class='confluenceTh'> \_time+x </th><th class='confluenceTh'> \_time+2x </th><th class='confluenceTh'> .. </th><th class='confluenceTh'> \_time+60x </th></tr>
+<tr><td class='confluenceTd'> 28.7812 </td> <td class='confluenceTd'> 34.4632 </td> <td class='confluenceTd'> 31.3381 </td> <td class='confluenceTd'> .. </td> <td class='confluenceTd'> 31.2834 </td></tr>
+<tr> <td class='confluenceTd'> 24.8923 </td> <td class='confluenceTd'> 25.741 </td> <td class='confluenceTd'> 27.5532 </td> <td class='confluenceTd'> .. </td> <td class='confluenceTd'> 32.8217 </td></tr>
+</tbody></table>
+<p>..<br/>
+..</p>
+<table class='confluenceTable'><tbody>
+<tr> <td class='confluenceTd'> 35.5351 </td> <td class='confluenceTd'> 41.7067 </td> <td class='confluenceTd'> 39.1705 </td> <td class='confluenceTd'> 48.3964 </td> <td class='confluenceTd'> .. </td> <td class='confluenceTd'> 38.6103 </td></tr>
+<tr> <td class='confluenceTd'> 24.2104 </td> <td class='confluenceTd'> 41.7679 </td> <td class='confluenceTd'> 45.2228 </td> <td class='confluenceTd'> 43.7762 </td> <td class='confluenceTd'> .. </td> <td class='confluenceTd'> 48.8175 </td></tr>
+</tbody></table>
+</div>
+..
+..
+
+2. Setup Hadoop
+	1.  Assuming that you have installed the latest compatible Hadooop, start
+the daemons using <pre><code>$HADOOP_HOME/bin/start-all.sh </pre></code> If you have
+issues starting Hadoop, please reference the [Hadoop quick start guide](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html)
+	2.  Copy the input to HDFS using 
+<pre><code> 
+    $HADOOP_HOME/bin/hadoop fs -mkdir testdata
+    $HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data> testdata</pre></code> 
+(HDFS input directory name should be testdata)
+
+3. Mahout Example job
+Mahout's mahout-examples-$MAHOUT_VERSION.job does the actual clustering
+task and so it needs to be created. This can be done as
+ <pre><code>  cd $MAHOUT_HOME
+    mvn clean install		   // full build including all unit tests
+    mvn clean install -DskipTests=true // fast build without running unit tests</pre></code> 
+
+You will see BUILD SUCCESSFUL once all the corresponding tasks are through.
+The job will be generated in $MAHOUT_HOME/examples/target/ and it's name
+will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.4
+release, the job will be mahout-examples-0.4.job.jar
+This completes the pre-requisites to perform clustering process using
+Mahout.
+
+<a name="Clusteringofsyntheticcontroldata-PerformClustering"></a>
+# Perform Clustering
+
+With all the pre-work done, clustering the control data gets real simple.
+
+1. Depending on which clustering technique to use, you can invoke thecorresponding job as below
+
+	1. For [canopy ](canopy-clustering.html)</br> <pre><code> $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job </pre></code>
+	2. For [kmeans ](k-means-clustering.html)</br> <pre><code> $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job </pre></code>
+	3. For [fuzzykmeans ](fuzzy-k-means.html)</br> <pre><code> $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzymeans.Job </pre></code>
+	4. For [dirichlet ](dirichlet-process-clustering.html)</br> <pre><code> $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job </pre></code>
+	5. For [meanshift ](mean-shift-clustering.html) </br> <pre><code> $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job </pre></code>
+
+2. Get the data out of HDFS [^1] [^2] and have a look [^3] by following the below steps:
+
+
+
+<a name="Clusteringofsyntheticcontroldata-Read/AnalyzeOutput"></a>
+# Read / Analyze Output
+In order to read/analyze the output, you can use [clusterdump](cluster-dumper.html)
+ utility provided by Mahout. If you want to just read the output, follow
+the below steps.
+
+1. Use <pre><code>$HADOOP_HOME/bin/hadoop fs -lsr output </pre></code>to view all
+outputs.
+1. Use <pre><code>$HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
+</pre></code> to copy them all to your local machine and the output data points
+are in vector format. This creates an output folder inside examples
+directory.
+1. Computed clusters are contained in <i>output/clusters-i</i>
+1. All result clustered points are placed into <i>output/clusteredPoints</i>
+
+[^1]: See [HDFS Shell ](http://hadoop.apache.org/core/docs/current/hdfs_shell.html.html)
+[^2]: The output directory is cleared when a new run starts so the results must be retrieved before a new run
+[^3]: All jobs run ClusterDump after clustering with output data sent to the console.
+

Added: mahout/site/trunk/content/clustering-seinfeld-episodes.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/clustering-seinfeld-episodes.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/clustering-seinfeld-episodes.mdtext (added)
+++ mahout/site/trunk/content/clustering-seinfeld-episodes.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,5 @@
+Title: Clustering Seinfeld Episodes
+Below is short tutorial on how to cluster Seinfeld episode transcripts with
+Mahout.
+
+http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/

Added: mahout/site/trunk/content/clusteringyourdata.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/clusteringyourdata.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/clusteringyourdata.mdtext (added)
+++ mahout/site/trunk/content/clusteringyourdata.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,140 @@
+Title: ClusteringYourData
++*Mahout_0.4*+
+
+After you've done the [Quickstart](quickstart.html)
+ and are familiar with the basics of Mahout, it is time to cluster your own
+data. 
+
+The following pieces *may* be useful for in getting started:
+
+<a name="ClusteringYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format
+(which has changed since Mahout 0.1)
+
+* See [Creating Vectors](creating-vectors.html)
+
+<a name="ClusteringYourData-TextPreparation"></a>
+## Text Preparation
+
+* See [Creating Vectors from Text](creating-vectors-from-text.html)
+*
+http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
+
+<a name="ClusteringYourData-RunningtheProcess"></a>
+# Running the Process
+
+<a name="ClusteringYourData-Canopy"></a>
+## Canopy
+
+Background: [canopy ](canopy-clustering.html)
+
+Documentation of running canopy from the command line: [canopy-commandline](canopy-commandline.html)
+
+<a name="ClusteringYourData-kMeans"></a>
+## kMeans
+
+Background: [K-Means Clustering](k-means-clustering.html)
+
+Documentation of running kMeans from the command line: [k-means-commandline](k-means-commandline.html)
+
+Documentation of running fuzzy kMeans from the command line: [fuzzy-k-means-commandline](fuzzy-k-means-commandline.html)
+
+<a name="ClusteringYourData-Dirichlet"></a>
+## Dirichlet
+
+Background: [dirichlet ](dirichlet-process-clustering.html)
+
+Documentation of running dirichlet from the command line: [dirichlet-commandline](dirichlet-commandline.html)
+
+<a name="ClusteringYourData-Mean-shift"></a>
+## Mean-shift
+
+Background:  [meanshift ](mean-shift-clustering.html)
+
+Documentation of running mean shift from the command line: [mean-shift-commandline](mean-shift-commandline.html)
+
+<a name="ClusteringYourData-LatentDirichletAllocation"></a>
+## Latent Dirichlet Allocation
+
+Background and documentation: [LDA](latent-dirichlet-allocation.html)
+
+Documentation of running LDA from the command line: [lda-commandline](lda-commandline.html)
+
+<a name="ClusteringYourData-RetrievingtheOutput"></a>
+# Retrieving the Output
+
+Mahout has a cluster dumper utility that can be used to retrieve and
+evaluate your clustering data.
+
+    ./bin/mahout clusterdump <OPTIONS>
+
+
+<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
+## The cluster dumper options are:
+
+      --help (-h)				   Print out help		    
+      --seqFileDir (-s) seqFileDir		   The directory containing
+Sequence    
+    					   Files for the Clusters	    
+      --output (-o) output			   The output file.  If not
+specified,  
+    					   dumps to the console 	    
+      --substring (-b) substring		   The number of chars of the	    
+    					   asFormatString() to print	    
+      --pointsDir (-p) pointsDir		   The directory containing points  
+    					   sequence files mapping input
+vectors 
+    					   to their cluster.  If specified, 
+    					   then the program will output the 
+    					   points associated with a cluster 
+      --dictionary (-d) dictionary		   The dictionary file. 	    
+      --dictionaryType (-dt) dictionaryType    The dictionary file type	    
+    					   (text|sequencefile)		    
+      --numWords (-n) numWords		   The number of top terms to print 
+
+
+More information on using clusterdump utility can be found [here](cluster-dumper.html)
+
+<a name="ClusteringYourData-ValidatingtheOutput"></a>
+# Validating the Output
+
+From Ted Dunning's response on See
+http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
+
+> A principled approach to cluster evaluation is to measure how well the
+cluster membership captures the structure of unseen data.  A natural
+measure for this is to measure how much of the entropy of the data is
+captured by cluster membership.  For k-means and its natural L_2 metric,
+the natural cluster quality metric is the squared distance from the nearest
+centroid adjusted by the log_2 of the number of clusters.  This can be
+compared to the squared magnitude of the original data or the squared
+deviation from the centroid for all of the data.  The idea is that you are
+changing the representation of the data by allocating some of the bits in
+your original representation to represent which cluster each point is in. 
+If those bits aren't made up by the residue being small then your
+clustering is making a bad trade-off.
+
+> In the past, I have used other more heuristic measures as well.  One of the
+key characteristics that I would like to see out of a clustering is a
+degree of stability.  Thus, I look at the fractions of points that are
+assigned to each cluster or the distribution of distances from the cluster
+centroid. These values should be relatively stable when applied to held-out
+data.
+
+> For text, you can actually compute perplexity which measures how well
+cluster membership predicts what words are used.  This is nice because you
+don't have to worry about the entropy of real valued numbers.
+
+> Manual inspection and the so-called laugh test is also important.  The idea
+is that the results should not be so ludicrous as to make you laugh.
+Unfortunately, it is pretty easy to kid yourself into thinking your system
+is working using this kind of inspection.  The problem is that we are too
+good at seeing (making up) patterns.
+
+
+<a name="ClusteringYourData-References"></a>
+# References
+
+* [Mahout archive references](http://www.lucidimagination.com/search/p:mahout?q=clustering)

Added: mahout/site/trunk/content/collaborative-filtering-with-als-wr.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/collaborative-filtering-with-als-wr.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/collaborative-filtering-with-als-wr.mdtext (added)
+++ mahout/site/trunk/content/collaborative-filtering-with-als-wr.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,92 @@
+Title: Collaborative Filtering with ALS-WR
+<a name="CollaborativeFilteringwithALS-WR-Problemsetting"></a>
+### Problem setting
+
+In Collaborative Filtering we want to predict which items that a user has
+not yet seen might be highly preferred by the user. We assume that users
+give explicit ratings on a fixed scale (say from 1 to 5). We collect all
+these ratings in a matrix A. Only a minority of the cells of A is filled.
+In order to produce recommendations we need to find a way to predict the
+rating of a user to an unseen item. This corresponds to filling the blanks
+in A. We will demonstrate the approach on a toy example.
+
+
+*rating matrix A (users x items)*
+
+<table>
+<tr><td> 5.00 </td><td> 5.00 </td><td> 2.00 </td><td> \- </td></tr>
+<tr><td> 2.00 </td><td> \- </td><td> 3.00 </td><td> 5.00 </td></tr>
+<tr><td> \- </td><td> 5.00 </td><td> \- </td><td> 3.00 </td></tr>
+<tr><td> 3.00 </td><td> \- </td><td> \- </td><td> 5.00 </td></tr>
+</table>
+
+
+<a name="CollaborativeFilteringwithALS-WR-Thelatentfactorapproach"></a>
+### The latent factor approach
+
+A very successful approach to this problem are the so called _latent factor
+models_. These model the users and items as points in a k-dimensional
+feature space. An unknown rating can than simply be estimated by taking the
+dot product between the corresponding user and item feature vectors.
+
+Mathematically spoken, we decompose *A* into two other matrices *U* and *M*
+whose combination is a good approximation of *A* 
+
+*user feature matrix U (users x features)*
+
+<table>
+<tr><td>1.12</td><td>1.49</td><td>0.48</td></tr>
+<tr><td>1.31</td><td>-0.52</td><td>0.59</td></tr>
+<tr><td>1.13</td><td>0.67</td><td>-0.52</td></tr>
+<tr><td>1.39</td><td>0.05</td><td>0.45</td></tr>
+</table>
+
+*item feature matrix M (items x features)*
+
+<table>
+<tr><td>1.81</td><td>1.62</td><td>0.74</td></tr>
+<tr><td>2.66</td><td>1.71</td><td>-1.08</td></tr>
+<tr><td>1.73</td><td>-0.23</td><td>0.78</td></tr>
+<tr><td>3.16</td><td>-0.24</td><td>0.90</td></tr>
+</table>
+
+We can than compute an approximation *A_k* of A which has all cells filled.
+Each previously empty cell now contains a predicted rating.
+
+*prediction matrix A_k (users x items)*
+
+*A_k = UM'*
+
+<table>
+<tr><td>4.78</td><td>4.98</td><td>1.97</td><td>3.61</td></tr>
+<tr><td>1.98</td><td>1.97</td><td>2.85</td><td>4.81</td></tr>
+<tr><td>2.75</td><td>4.71</td><td>1.40</td><td>2.94</td></tr>
+<tr><td>2.94</td><td>3.32</td><td>2.75</td><td>4.79</td></tr>
+</table>
+
+
+<a name="CollaborativeFilteringwithALS-WR-Findingthedecomposition"></a>
+### Finding the decomposition
+
+Mahout uses _Alternating Least Squares with Weighted Lambda-Regularization_
+to find the decomposition. Please refer to [Large-scale Parallel Collaborative Filtering for the Netflix Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf)
+ (PDF) for details of the algorithm.
+
+<a name="CollaborativeFilteringwithALS-WR-Takingthisapproachintoproduction"></a>
+### Taking this approach into production
+
+In order to successfully apply this algorithm, a regularization parameter
+_lambda_ and the number of iterations to run have to be determined.
+Currently you have to figure these out manually using holdout tests.
+
+Mahout includes a toy example in [examples/bin/factorize-movielens-1M.sh](http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/factorize-movielens-1M.sh)
+ that shows how to do a simple holdout test on the Movielens 1 million
+ratings dataset. It also shows how to compute top-N recommendations from
+the resulting factorization:
+
+<a name="CollaborativeFilteringwithALS-WR-Hints"></a>
+### Hints
+
+Please be aware that ALS-WR is an iterative algorithm. Iterative algorithms
+currently show poor performance on Hadoop caused by the enormous overhead
+of scheduling and checkpointing each single iteration.

Added: mahout/site/trunk/content/collections.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/collections.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/collections.mdtext (added)
+++ mahout/site/trunk/content/collections.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,88 @@
+Title: Collections
+TODO: Organize these somehow, add one-line blurbs
+Organize by usage? (classification, recommendation etc.)
+
+<a name="Collections-CollectionsofCollections"></a>
+## Collections of Collections
+
+- [ML Data](http://mldata.org/about/)
+ ... repository supported by Pascal 2.
+- [DBPedia](http://wiki.dbpedia.org/Downloads30)
+- [UCI Machine Learning Repo](http://archive.ics.uci.edu/ml/)
+- [http://mloss.org/community/blog/2008/sep/19/data-sources/](http://mloss.org/community/blog/2008/sep/19/data-sources/)
+- [Linked Library Data](http://ckan.net/group/lld)
+ via CKAN
+- [InfoChimps](http://infochimps.com/)
+ Free and purchasable datasets
+- [http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle](http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle)
+ LinkedIn discussion of lots of data sets
+
+<a name="Collections-CategorizationData"></a>
+## Categorization Data
+
+- [20Newsgroups](http://people.csail.mit.edu/jrennie/20Newsgroups/)
+- [RCV1 data set](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm)
+- [10 years of CLEF Data](http://direct.dei.unipd.it/)
+- http://ece.ut.ac.ir/DBRG/Hamshahri/ (Approximately 160k categorized docs)
+There is a newer beta verson here:
+http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/ (Approximately 320k categorized
+docs)
+- Lending Club load data
+https://www.lendingclub.com/info/download-data.action
+
+<a name="Collections-RecommendationData"></a>
+## Recommendation Data
+
+- [Netflix Prize/Dataset](http://www.netflixprize.com/download)
+- [Book usage and recommendation data from the University of Huddersfield](http://library.hud.ac.uk/data/usagedata/)
+- [Last.fm](http://denoiserthebetter.posterous.com/music-recommendation-datasets)
+ - Non-commercial use only
+- [Amazon Product Review Data via Jindal and Liu](http://www.cs.uic.edu/~liub/fbs/sentiment-analysis.html.html)
+ -- Scroll down
+- [GroupLens/MovieLens Movie Review Dataset](http://www.grouplens.org/node/73)
+
+<a name="Collections-MultilingualData"></a>
+## Multilingual Data
+
+- [http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php](http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php)
+ - 308,000 subtitle files covering about 18,900 movies in 59 languages
+(July 2006 numbers). This is a curated collection of subtitles from an
+aggregation site, [http://www.openSubTitles.org]
+The original site, OpenSubtitles.org, is up to 1.6m subtitles files.
+- [Statistical Machine Translation](http://www.statmt.org/)
+ - devoted to all things language translation. Includes multilingual
+corpuses of European and Canadian legal tomes.
+
+<a name="Collections-Geospatial"></a>
+## Geospatial
+
+- [Natural Earth Data](http://www.naturalearthdata.com/)
+- [Open Street Maps](http://wiki.openstreetmap.org/wiki/Main_Page)
+And other crowd-sourced mapping data sites.
+
+<a name="Collections-Airline"></a>
+## Airline
+- [Open Flights](http://openflights.org/)
+ - Crowd-sourced database of airlines, flights, airports, times, etc.
+- [Airline on-time information - 1987-2008](http://stat-computing.org/dataexpo/2009/)
+ - 120m CSV records, 12G uncompressed
+
+<a name="Collections-GeneralResources"></a>
+## General Resources
+
+- [theinfo](http://theinfo.org/)
+- [WordNet](http://wordnet.princeton.edu/obtain)
+- [Common Crawl](http://www.commoncrawl.org/)
+ - freely available web crawl on EC2
+
+<a name="Collections-Stuff"></a>
+## Stuff
+- [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html](http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html)
+- [4 Universities Data Set](http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/)
+- [Large crawl of Twitter](http://an.kaist.ac.kr/traces/WWW2010.html)
+- [UniProt](http://beta.uniprot.org/)
+- [http://www.icwsm.org/2009/data/](http://www.icwsm.org/2009/data/)
+- http://data.gov
+- http://www.ckan.net/
+- http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world
+- http://data.gov.uk/

Added: mahout/site/trunk/content/collocations.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/collocations.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/collocations.mdtext (added)
+++ mahout/site/trunk/content/collocations.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,390 @@
+Title: Collocations
+<a name="Collocations-CollocationsinMahout"></a>
+# Collocations in Mahout
+
+A collocation is defined as a sequence of words or terms which co-occur
+more often than would be expected by chance. Statistically relevant
+combinations of terms identify additional lexical units which can be
+treated as features in a vector-based representation of a text. A detailed
+discussion of collocations can be found on wikipedia [1](http://en.wikipedia.org/wiki/Collocation)
+.
+ 
+
+<a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a>
+## Log-Likelihood based Collocation Identification
+
+Mahout provides an implementation of a collocation identification algorithm
+which scores collocations using log-likelihood ratio. The log-likelihood
+score indicates the relative usefulness of a collocation with regards other
+term combinations in the text. Collocations with the highest scores in a
+particular corpus will generally be more useful as features.
+
+Calculating the LLR is very straightforward and is described concisely in
+Ted Dunning's blog post [2](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
+. Ted describes the series of counts reqired to calculate the LLR for two
+events A and B in order to determine if they co-occur more often than pure
+chance. These counts include the number of times the events co-occur (k11),
+the number of times the events occur without each other (k12 and k21), and
+the number of times anything occurs. These counts are summarized in the
+following table:
+
+<table>
+<tr><td> </td><td> Event A </td><td> Everything but Event A </td></tr>
+<tr><td> Event B </td><td> A and B together (k11) </td><td>  B but not A (k12) </td></tr>
+<tr><td> Everything but Event B </td><td> A but not B (k21) </td><td> Neither B nor A (k22) </td></tr>
+</table>
+
+For the purposes of collocation identification, it is useful to begin by
+thinking in word pairs, bigrams. In this case the leading or head term from
+the pair corresponds to A from the table above, B corresponds to the
+trailing or tail term, while neither B nor A is the total number of word
+pairs in the corpus less those containing B, A or both B and A.
+
+Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio
+is computed by looking at the number of occurences of that word pair in the
+corpus, the number of word pairs that begin with 'oscillation' but end with
+something other than 'overthruster', the number of word pairs that end with
+'overthruster' begin with something other than 'oscillation' and the number
+of word pairs in the corpus that contain neither 'oscillation' and
+overthruster.
+
+This can be extended from bigrams to trigrams, 4-grams and beyond. In these
+cases, the current algorithm uses the first token of the ngram as the head
+of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it
+were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is
+treated as the head while 'kong cavaliers' is treated as the tail. Future
+versions of this algorithm will allow for variations in which tokens of the
+ngram are treated as the head and tail.
+
+Beyond ngrams, it is often useful to inspect cases where individual words
+occur around other interesting features of the text such as sentence
+boundaries.
+
+<a name="Collocations-GeneratingNGrams"></a>
+## Generating NGrams
+
+The tools that the collocation identification algorithm are embeeded within
+either consume tokenized text as input or provide the ability to specify an
+implementation of the Lucene Analyzer class perform tokenization in order
+to form ngrams. The tokens are passed through a Lucene ShingleFilter to
+produce NGrams of the desired length. 
+
+Given the text "Alice was beginning to get very tired" as an example,
+Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get',
+'very' and 'tired', while the ShingleFilter with a max NGram size set to 3
+produces the shingles 'alice beginning', 'alice beginning get', 'beginning
+get', 'beginning get very', 'get very', 'get very tired' and 'very tired'.
+Note that both bigrams and trigrams are produced here. A future enhancement
+to the existing algorithm would involve limiting the output to a particular
+gram size as opposed to solely specifiying a max ngram size.
+
+<a name="Collocations-RunningtheCollocationIdentificationAlgorithm."></a>
+## Running the Collocation Identification Algorithm.
+
+There are a couple ways to run the llr-based collocation algorithm in
+mahout
+
+<a name="Collocations-Whencreatingvectorsfromasequencefile"></a>
+### When creating vectors from a sequence file
+
+The llr collocation identifier is integrated into the process that is used
+to create vectors from sequence files of text keys and values. Collocations
+are generated when the --maxNGramSize (-ng) option is not specified and
+defaults to 2 or is set to a number of 2 or greater. The --minLLR option
+can be used to control the cutoff that prevents collocations below the
+specified LLR score from being emitted, and the --minSupport argument can
+be used to filter out collocations that appear below a certain number of
+times. 
+
+
+    bin/mahout seq2sparse
+    
+    Usage:									    
+     [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize	    
+    <chunkSize> --output <output> --input <input> --minDF <minDF>
+--maxDFPercent	  
+    <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR>
+--numReducers  
+    <numReducers> --maxNGramSize <ngramSize> --overwrite --help		    
+    --sequentialAccessVector]
+    Options 								    
+      --minSupport (-s) minSupport	      (Optional) Minimum Support. Default   
+    				      Value: 2				    
+      --analyzerName (-a) analyzerName    The class name of the analyzer	    
+      --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
+MB  
+      --output (-o) output		      The output directory		    
+      --input (-i) input		      input dir containing the documents in 
+    				      sequence file format		    
+      --minDF (-md) minDF		      The minimum document frequency. 
+Default  
+    				      is 1				    
+      --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
+DF.    
+    				      Can be used to remove really high     
+    				      frequency terms. Expressed as an
+integer  
+    				      between 0 and 100. Default is 99.     
+      --weight (-wt) weight 	      The kind of weight to use. Currently
+TF   
+    				      or TFIDF				    
+      --norm (-n) norm		      The norm to use, expressed as either
+a    
+    				      float or "INF" if you want to use the 
+    				      Infinite norm.  Must be greater or
+equal  
+    				      to 0.  The default is not to
+normalize    
+      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood  
+    				      Ratio(Float)  Default is 1.0	    
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
+    				      Default Value: 1			    
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
+to  
+    				      create (2 = bigrams, 3 = trigrams,
+etc)   
+    				      Default Value:2			    
+      --overwrite (-w)		      If set, overwrite the output
+directory    
+      --help (-h)			      Print out help			    
+      --sequentialAccessVector (-seq)     (Optional) Whether output vectors
+should	
+    				      be SequentialAccessVectors If set
+true	
+    				      else false 
+
+
+<a name="Collocations-CollocDriver"></a>
+### CollocDriver
+
+*TODO*
+
+
+    bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
+    
+    Usage:									    
+     [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite    
+    --minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers>     
+    --analyzerName <analyzerName> --preprocess --unigram --help]
+    Options 								    
+      --input (-i) input		      The Path for input files. 	    
+      --output (-o) output		      The Path write output to		    
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
+to  
+    				      create (2 = bigrams, 3 = trigrams,
+etc)   
+    				      Default Value:2			    
+      --overwrite (-w)		      If set, overwrite the output
+directory    
+      --minSupport (-s) minSupport	      (Optional) Minimum Support. Default   
+    				      Value: 2				    
+      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood  
+    				      Ratio(Float)  Default is 1.0	    
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
+    				      Default Value: 1			    
+      --analyzerName (-a) analyzerName    The class name of the analyzer	    
+      --preprocess (-p)		      If set, input is
+SequenceFile<Text,Text>  
+    				      where the value is the document, 
+which	
+    				      will be tokenized using the specified 
+    				      analyzer. 			    
+      --unigram (-u)		      If set, unigrams will be emitted in
+the   
+    				      final output alongside collocations   
+      --help (-h)			      Print out help	      
+
+
+<a name="Collocations-Algorithmdetails"></a>
+## Algorithm details
+
+This section describes the implementation of the collocation identification
+algorithm in terms of the map-reduce phases that are used to generate
+ngrams and count the frequencies required to perform the log-likelihood
+calculation. Unless otherwise noted, classes that are indicated in
+CamelCase can be found in the mahout-utils module under the package
+org.apache.mahout.utils.nlp.collocations.llr
+
+The algorithm is implemented in two map-reduce passes:
+
+<a name="Collocations-Pass1:CollocDriver.generateCollocations(...)"></a>
+### Pass 1: CollocDriver.generateCollocations(...)
+
+Generates NGrams and counts frequencies for ngrams, head and tail subgrams.
+
+<a name="Collocations-Map:CollocMapper"></a>
+#### Map: CollocMapper
+
+Input k: Text (documentId), v: StringTuple (tokens) 
+
+Each call to the mapper passes in the full set of tokens for the
+corresponding document using a StringTuple. The ShingleFilter is run across
+these tokens to produce ngrams of the desired length. ngrams and
+frequencies are collected across the entire document.
+
+Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte[](.html)
+ form of the ngram when collecting an ngram as the value.
+
+
+    head_key(EMPTY) -> (head subgram, head frequency)
+    head_key(ngram) -> (ngram, ngram frequency) 
+    tail_key(EMPTY) -> (tail subgram, tail frequency)
+    tail_key(ngram) -> (ngram, ngram frequency)
+
+
+subgram and ngram values are packaged in Gram objects.
+
+For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When
+the pass is complete, this counter will hold the total number of ngrams
+encountered in the input which is used as a part of the LLR calculation.
+
+Output k: GramKey (head or tail subgram), v: Gram (head, tail or ngram with
+frequency)
+
+<a name="Collocations-Combiner:CollocCombiner"></a>
+#### Combiner: CollocCombiner
+
+Input k: GramKey, v:Gram (as above)
+
+This phase merges the counts for unique ngrams or ngram fragments across
+multiple documents. The combiner treats the entire GramKey as the key and
+as such, identical tuples from separate documents are passed into a single
+call to the combiner's reduce method, their frequencies are summed and a
+single tuple is passed out via the collector.
+
+Output k: GramKey, v:Gram
+
+<a name="Collocations-Reduce:CollocReducer"></a>
+#### Reduce: CollocReducer
+
+Input k: GramKey, v: Gram (as above)
+
+The CollocReducer employs the Hadoop secondary sort strategy to avoid
+caching ngram tuples in memory in order to calculate total ngram and
+subgram frequencies. The GramKeyPartitioner ensures that tuples with the
+same primary key are sent to the same reducer while the
+GramKeyGroupComparator ensures that iterator provided by the reduce method
+first returns the subgram and then returns ngram values grouped by ngram.
+This eliminates the need to cache the values returned by the iterator in
+order to calculate total frequencies for both subgrams and ngrams. There
+input will consist of multiple frequencies for each (subgram_key, subgram)
+or (subgram_key, ngram) tuple; one from each map task executed in which the
+particular subgram was found.
+The input will be traversed in the following order:
+
+
+    (head subgram, frequency 1)
+    (head subgram, frequency 2)
+    ... 
+    (head subgram, frequency N)
+    (ngram 1, frequency 1)
+    (ngram 1, frequency 2)
+    ...
+    (ngram 1, frequency N)
+    (ngram 2, frequency 1)
+    (ngram 2, frequency 2)
+    ...
+    (ngram 2, frequency N)
+    ...
+    (ngram N, frequency 1)
+    (ngram N, frequency 2)
+    ...
+    (ngram N, frequency N)
+
+
+Where all of the ngrams above share the same head. Data is presented in the
+same manner for the tail subgrams.
+
+As the values for a subgram or ngram are traversed, frequencies are
+accumulated. Once all values for a subgram or ngram are processed the
+resulting key/value pairs are passed to the collector as long as the ngram
+frequency is equal to or greater than the specified minSupport. When an
+ngram is skipped in this way the Skipped.LESS_THAN_MIN_SUPPORT counter to
+be incremented.
+
+Pairs are passed to the collector in the following format:
+
+
+    ngram, ngram frequency -> subgram subgram frequency
+
+
+In this manner, the output becomes an unsorted version of the following:
+
+
+    ngram 1, frequency -> ngram 1 head, head frequency
+    ngram 1, frequency -> ngram 1 tail, tail frequency
+    ngram 2, frequency -> ngram 2 head, head frequency
+    ngram 2, frequency -> ngram 2 tail, tail frequency
+    ngram N, frequency -> ngram N head, head frequency
+    ngram N, frequency -> ngram N tail, tail frequency
+
+
+Output is in the format k:Gram (ngram, frequency), v:Gram (subgram,
+frequency)
+
+<a name="Collocations-Pass2:CollocDriver.computeNGramsPruneByLLR(...)"></a>
+### Pass 2: CollocDriver.computeNGramsPruneByLLR(...)
+
+Pass 1 has calculated full frequencies for ngrams and subgrams, Pass 2
+performs the LLR calculation.
+
+<a name="Collocations-MapPhase:IdentityMapper(org.apache.hadoop.mapred.lib.IdentityMapper)"></a>
+#### Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper)
+
+This phase is a no-op. The data is passed through unchanged. The rest of
+the work for llr calculation is done in the reduce phase.
+
+<a name="Collocations-ReducePhase:LLRReducer"></a>
+#### Reduce Phase: LLRReducer
+
+Input is k:Gram, v:Gram (as above)
+
+This phase receives the head and tail subgrams and their frequencies for
+each ngram (with frequency) produced for the input:
+
+
+    ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency
+    ngram 2, frequency -> ngram 2 head, frequency; ngram 2 tail, frequency
+    ...
+    ngram 1, frequency -> ngram N head, frequency; ngram N tail, frequency
+
+
+It also reads the full ngram count obtained from the first pass, passed in
+as a configuration option. The parameters to the llr calculation are
+calculated as follows:
+
+k11 = f_n
+k12 = f_h - f_n
+k21 = f_t - f_n
+k22 = N - ((f_h + f_t) - f_n)
+
+Where f_n is the ngram frequency, f_h and f_t the frequency of head and
+tail and N is the total number of ngrams.
+
+Tokens with a llr below that of the specified minimum llr are dropped and
+the Skipped.LESS_THAN_MIN_LLR counter is incremented.
+
+Output is k: Text (ngram), v: DoubleWritable (llr score)
+
+<a name="Collocations-Unigrampass-through."></a>
+### Unigram pass-through.
+
+By default in seq2sparse, or if the -u option is provided to the
+CollocDriver, unigrams (single tokens) will be passed through the job and
+each token's frequency will be calculated. As with ngrams, unigrams are
+subject to filtering with minSupport and minLLR.
+
+<a name="Collocations-References"></a>
+## References
+
+\[1\](1\.html)
+ http://en.wikipedia.org/wiki/Collocation
+\[2\](2\.html)
+ http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
+
+
+<a name="Collocations-Discussion"></a>
+## Discussion
+
+* http://comments.gmane.org/gmane.comp.apache.mahout.user/5685 - Reuters
+example

Added: mahout/site/trunk/content/complementary-naive-bayes.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/complementary-naive-bayes.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/complementary-naive-bayes.mdtext (added)
+++ mahout/site/trunk/content/complementary-naive-bayes.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,16 @@
+Title: Complementary Naive Bayes
+<a name="ComplementaryNaiveBayes-Introduction"></a>
+# Introduction
+
+See ([MAHOUT-60](http://issues.apache.org/jira/browse/MAHOUT-60)
+)
+
+
+
+
+
+<a name="ComplementaryNaiveBayes-OtherResources"></a>
+# Other Resources
+
+See [NaiveBayes](naivebayes.html)
+ ([MAHOUT-9|http://issues.apache.org/jira/browse/MAHOUT-9])

Added: mahout/site/trunk/content/converting-content.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/converting-content.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/converting-content.mdtext (added)
+++ mahout/site/trunk/content/converting-content.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,39 @@
+Title: Converting Content
+* [Intro](#ConvertingContent-Intro)
+* [SequenceFilesFrom*](#ConvertingContent-SequenceFilesFrom*)
+* [RegexConverterDriver](#ConvertingContent-RegexConverterDriver)
+
+<a name="ConvertingContent-Intro"></a>
+# Intro
+
+Mahout has some tools for converting content into formats more consumable
+for Mahout.  While they shouldn't be confused as a full ETL layer, they can
+be useful for things like converting text files and log files.	All of
+these can be accessed via the $MAHOUT_HOME/bin/mahout command line driver.
+
+<a name="ConvertingContent-SequenceFilesFrom*"></a>
+# SequenceFilesFrom*
+
+* SequenceFilesFromDirectory -- Converts a directory of text files to a
+SequenceFile where the key is the name of the file and the value is all of
+the text
+* SequenceFilesFromMailArchives -- Similar to Directory but converts mbox
+files.
+
+
+<a name="ConvertingContent-RegexConverterDriver"></a>
+# RegexConverterDriver
+
+Useful for converting things like log files from one format to another. 
+For instance, you could convert Solr log files containing query requests to
+a format consumable by [FrequentItemsetMining](frequentitemsetmining.html)
+
+For example, the following will extract queries from HTTP request logs to [Solr](http://lucene.apache.org)
+ and prepare them for use by Frequent Itemset Mining.
+
+    bin/mahout regexconverter --input
+/Users/grantingersoll/projects/content/lucid/lucidfind/logs --output
+/tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite
+--transformerClass url --formatterClass fpg
+
+

Added: mahout/site/trunk/content/creating-vectors-from-text.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/creating-vectors-from-text.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/creating-vectors-from-text.mdtext (added)
+++ mahout/site/trunk/content/creating-vectors-from-text.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,162 @@
+Title: Creating Vectors from Text
++*Mahout_0.2*+
+{toc:style=disc|indent=20px}
+
+<a name="CreatingVectorsfromText-Introduction"></a>
+# Introduction
+
+For clustering documents it is usually necessary to convert the raw text
+into vectors that can then be consumed by the clustering [Algorithms](algorithms.html)
+.  These approaches are described below.
+
+<a name="CreatingVectorsfromText-FromLucene"></a>
+# From Lucene
+
+*NOTE: Your Lucene index must be created with the same version of Lucene
+used in Mahout.  Check Mahout's POM file to get the version number,
+otherwise you will likely get "Exception in thread "main"
+org.apache.lucene.index.CorruptIndexException: Unknown format version: -11"
+as an error.*
+
+Mahout has utilities that allow one to easily produce Mahout Vector
+representations from a Lucene (and Solr, since they are they same) index.
+
+For this, we assume you know how to build a Lucene/Solr index.	For those
+who don't, it is probably easiest to get up and running using [Solr](http://lucene.apache.org/solr)
+ as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
+index.	For those wanting to use just Lucene, see the Lucene [website|http://lucene.apache.org/java]
+ or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike
+McCandless.
+
+To get started, make sure you get a fresh copy of Mahout from [SVN](buildingmahout.html)
+ and are comfortable building it. It defines interfaces and implementations
+for efficiently iterating over a Data Source (it only supports Lucene
+currently, but should be extensible to databases, Solr, etc.) and produces
+a Mahout Vector file and term dictionary which can then be used for
+clustering.   The main code for driving this is the Driver program located
+in the org.apache.mahout.utils.vectors package.  The Driver program offers
+several input options, which can be displayed by specifying the --help
+option.  Examples of running the Driver are included below:
+
+<a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a>
+## Generating an output file from a Lucene Index
+
+
+    $MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE
+INDEX> \
+       --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO]
+ \
+       <--max <Number of vectors to output>> <--norm {INF|integer >= 0}>
+<--idField <Name of the idField in the Lucene index>>
+
+
+<a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a>
+### Create 50 Vectors from an Index 
+
+    $MAHOUT_HOME/bin/mahout lucene.vector --dir
+<PATH>/wikipedia/solr/data/index --field body \
+        --dictOut <PATH>/solr/wikipedia/dict.txt --output
+<PATH>/solr/wikipedia/out.txt --max 50
+
+This uses the index specified by --dir and the body field in it and writes
+out the info to the output dir and the dictionary to dict.txt.	It only
+outputs 50 vectors.  If you don't specify --max, then all the documents in
+the index are output.
+
+<a name="CreatingVectorsfromText-Normalize50VectorsfromaLuceneIndexusingthe[L_2Norm](http://en.wikipedia.org/wiki/Lp_space)"></a>
+### Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]
+
+    $MAHOUT_HOME/bin/mahout lucene.vector --dir
+<PATH>/wikipedia/solr/data/index --field body \
+          --dictOut <PATH>/solr/wikipedia/dict.txt --output
+<PATH>/solr/wikipedia/out.txt --max 50 --norm 2
+
+
+<a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a>
+# From Directory of Text documents
+Mahout has utilities to generate Vectors from a directory of text
+documents. Before creating the vectors, you need to convert the documents
+to SequenceFile format. SequenceFile is a hadoop class which allows us to
+write arbitary key,value pairs into it. The DocumentVectorizer requires the
+key to be a Text with a unique document id, and value to be the Text
+content in UTF-8 format.
+
+You may find Tika (http://lucene.apache.org/tika) helpful in converting
+binary documents to text.
+
+<a name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a>
+## Converting directory of documents to SequenceFile format
+Mahout has a nifty utility which reads a directory path including its
+sub-directories and creates the SequenceFile in a chunked manner for us.
+the document id generated is <PREFIX><RELATIVE PATH FROM
+PARENT>/document.txt
+
+From the examples directory run
+
+    $MAHOUT_HOME/bin/mahout seqdirectory \
+    --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
+    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
+    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
+    <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
+
+
+<a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a>
+## Creating Vectors from SequenceFile
+
++*Mahout_0.3*+
+
+From the sequence file generated from the above step run the following to
+generate vectors. 
+
+    $MAHOUT_HOME/bin/mahout seq2sparse \
+    -i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND
+DICTIONARY IS GENERATED> \
+    <-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
+    <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
+    <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT>
+org.apache.lucene.analysis.standard.StandardAnalyzer> \
+    <--minSupport <MINIMUM SUPPORT> 2> \
+    <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
+    <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
+    <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
+    <-seq <Create SequentialAccessVectors>{false|true required for running some
+algorithms(LDA,Lanczos)}>"
+
+
+--minSupport is the min frequency for the word to  be considered as a
+feature. --minDF is the min number of documents the word needs to be in
+--maxDFPercent is the max value of the expression (document frequency of a
+word/total number of document) to be considered as good feature to be in
+the document. This helps remove high frequency features like stop words
+
+<a name="CreatingVectorsfromText-Background"></a>
+# Background
+
+*
+http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
+*
+http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
+
+<a name="CreatingVectorsfromText-FromaDatabase"></a>
+# From a Database
+
++*TODO:*+
+
+<a name="CreatingVectorsfromText-Other"></a>
+# Other
+
+<a name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a>
+## Converting existing vectors to Mahout's format
+
+If you are in the happy position to already own a document (as in: texts,
+images or whatever item you wish to treat) processing pipeline, the
+question arises of how to convert the vectors into the Mahout vector
+format. Probably the easiest way to go would be to implement your own
+Iterable<Vector> (called VectorIterable in the example below) and then
+reuse the existing VectorWriter classes:
+
+
+    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
+configuration, outfile, LongWritable.class, SparseVector.class);
+    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+

Added: mahout/site/trunk/content/creating-vectors.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/creating-vectors.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/creating-vectors.mdtext (added)
+++ mahout/site/trunk/content/creating-vectors.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,6 @@
+Title: Creating Vectors
+<a name="CreatingVectors-UtilitiesforCreatingVectors"></a>
+# Utilities for Creating Vectors
+
+1. [Text](creating-vectors-from-text.html)
+1. [ARFF](creating-vectors-from-weka's-arff-format.html)

Added: mahout/site/trunk/content/data-formats.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/data-formats.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/data-formats.mdtext (added)
+++ mahout/site/trunk/content/data-formats.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,84 @@
+Title: Data Formats
+Mahout uses a few file formats quite a bit in its various job
+implementations.
+   * [File formats](#DataFormats-Fileformats)
+      * [Raw formats for import](#DataFormats-Rawformatsforimport)
+      * [Raw formats for export](#DataFormats-Rawformatsforexport)
+   * [Who Stores What in a SequenceFile?](#DataFormats-WhoStoresWhatinaSequenceFile?)
+      * ["Simple" Text Vectors](#DataFormats-"Simple"TextVectors)
+      * [Encoded Text Vectors](#DataFormats-EncodedTextVectors)
+      * [Directories](#DataFormats-Directories)
+      * [Matrices](#DataFormats-Matrices)
+      * [Clusters](#DataFormats-Clusters)
+      * [FPGrowth Clusters](#DataFormats-FPGrowthClusters)
+   * [Life cycle](#DataFormats-Lifecycle)
+<a name="DataFormats-Fileformats"></a>
+## File formats
+<a name="DataFormats-Rawformatsforimport"></a>
+### Raw formats for import
+* Text files
+** can be parsed into SequenceFiles of:
+*** (line number, text of line)
+*** (file name, full contents of file)
+*** (line number, parts of line extracted with regex patterns)
+** can also be parsed into Lucene indexes:
+*** _precise index design ???_
+* ARFF files
+** Weka project text data format
+** Parsed into SequenceFile of <Int,Vector>
+* Mailbox files
+** can be parsed into SequenceFiles of:
+*** (mail message id, text body of mail message)
+*** no html or attachment support
+* CSV files
+** generally without column or row headers
+** no "multiple values per column" options
+* Hadoop SequenceFile
+** canonical, no variations. Currently no use of metadata.
+* Lucene indexes
+** translated into SequenceFiles
+*** _precise index design ???_
+
+<a name="DataFormats-Rawformatsforexport"></a>
+### Raw formats for export
+* SequenceFiles
+* Text lines, mostly of the toString() variety
+* MatrixWritable for ConfusionMatrix
+* CSV for MatrixWritable
+* A special CSV format for Clusters
+* [GraphML XML](http://graphml.graphdrawing.org/)
+ for Clusters
+
+<a name="DataFormats-WhoStoresWhatinaSequenceFile?"></a>
+## Who Stores What in a SequenceFile? 
+<a name="DataFormats-"Simple"TextVectors"></a>
+### "Simple" Text Vectors
+Simple text vectors represent documents. The dimensions are the set of
+terms in the entire document set. Each document vector stores a number in
+the position of each term it contains. This number may be derived from the
+count of the term inside the document.
+<a name="DataFormats-EncodedTextVectors"></a>
+### Encoded Text Vectors
+Each vector represents a document. However, term dimensions are "collapsed"
+stochastically, meaning each term in the full term set is mapped randomly
+to several smaller indexes. 
+<a name="DataFormats-Directories"></a>
+### Directories
+<Integer,Text> pairs which match matrix rows to input text keys like movie
+names, document file names etc. These are made by RowIdJob.
+<a name="DataFormats-Matrices"></a>
+### Matrices
+Matrices are almost universally stored as LongWritable/VectorWritable
+pairs, where VectorWritable can be sparse or dense.
+<a name="DataFormats-Clusters"></a>
+### Clusters
+Clusters are stored in complex data structures.
+<a name="DataFormats-FPGrowthClusters"></a>
+### FPGrowth Clusters
+These are stored in a custom data structure.
+
+<a name="DataFormats-Lifecycle"></a>
+## Life cycle
+All Mahout jobs generally assume that files generated have no lifespan. All
+Writable formats may change, and some may disappear. There are no file
+compatibility requirements.

Added: mahout/site/trunk/content/data-processing.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/data-processing.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/data-processing.mdtext (added)
+++ mahout/site/trunk/content/data-processing.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,25 @@
+Title: Data Processing
+Mahout has several programs which do useful operations on formats used in
+Mahout. These are Mahout jobs, mostly map/reduce capable.
+<a name="DataProcessing-Mathoperations"></a>
+### Math operations
+* bin/mahout transpose
+** transpose a matrix
+* bin/mahout matrixmult
+** multiply two or more matrices
+* bin/mahout svd
+** Lanzcos algorithm for SVD
+* bin/mahout ssvd
+** Stochastic SVD
+* bin/mahout cleansvd
+** Something useful 
+
+*Note:* many of these have options to only create some outputs, or limit
+the accuracy of outputs in the name of efficiency & tractability
+<a name="DataProcessing-Otheroperations"></a>
+### Other operations
+* bin/mahout split
+** Split an input set into training and test parts
+** Can take text inputs or SequenceFiles
+* bin/mahout splitDataset
+** Split a recommender rating dataset into training and test parts

Added: mahout/site/trunk/content/database-integrations.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/database-integrations.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/database-integrations.mdtext (added)
+++ mahout/site/trunk/content/database-integrations.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,6 @@
+Title: Database Integrations
+The Recommender suite has a few database integrations for reading
+preference data: JDBC, MongoDB and Cassandra. There are also integrations
+for SlopeOne delta data, including subclasses for various databases.
+Otherwise there are no explicit JDBC or NoSQL integrations in Mahout. File
+format drivers for Hadoop should work with Mahout.

Added: mahout/site/trunk/content/developer-resources.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/developer-resources.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/developer-resources.mdtext (added)
+++ mahout/site/trunk/content/developer-resources.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,42 @@
+Title: Developer Resources
+<a name="DeveloperResources-MakingaContribution"></a>
+## Making a Contribution
+
+Mahout is always looking for contributions, especially in the areas of
+documentation. See this Wiki for details on contributing: [How to contribute](how-to-contribute.html)
+
+We provide a list of items in JIRA that are aimed at new contributors and
+should be fairly easy to start on, these are labeled with [MAHOUT_INTRO_CONTRIBUTE](https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=labels%20%3D%20MAHOUT_INTRO_CONTRIBUTE)
+.
+
+<a name="DeveloperResources-SourceCode"></a>
+## Source Code
+
+The source files are now stored using Subversion (see [http://subversion.tigris.org/](http://subversion.tigris.org/)
+ and [http://svnbook.red-bean.com/]
+).
+
+svn checkout [http://svn.apache.org/repos/asf/mahout/trunk](http://svn.apache.org/repos/asf/mahout/trunk)
+ mahout/trunk
+
+<a name="DeveloperResources-Documentation"></a>
+## Documentation
+
+Javadoc documentation is available online in our code reports:
+
+* [Latest Javadoc](https://builds.apache.org/hudson/job/Mahout-Quality/javadoc/)
+* [Data Formats](https://cwiki.apache.org/confluence/display/MAHOUT/Data+Formats)
+* [File Format Integrations](https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations)
+* [Database Integrations](https://cwiki.apache.org/confluence/display/MAHOUT/Database+Integrations)
+
+<a name="DeveloperResources-Issues"></a>
+## Issues
+
+All bugs, improvements, patches, etc. should be logged in [JIRA](https://issues.apache.org/jira/browse/MAHOUT)
+.
+
+<a name="DeveloperResources-ContinuousIntegration"></a>
+## Continuous Integration
+
+Mahout is continuously built on an hourly basis on the Apache [Jenkins](https://builds.apache.org/job/Mahout-Quality/)
+ build system.

Added: mahout/site/trunk/content/dimensional-reduction.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/dimensional-reduction.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/dimensional-reduction.mdtext (added)
+++ mahout/site/trunk/content/dimensional-reduction.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,457 @@
+Title: Dimensional Reduction
+Matrix algebra underpins the way many Big Data algorithms and data
+structures are composed: full-text search can be viewed as doing matrix
+multiplication of the term-document matrix by the query vector (giving a
+vector over documents where the components are the relevance score),
+computing co-occurrences in a collaborative filtering context (people who
+viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest)
+is taking the squaring the user-item interaction matrix, calculating users
+who are k-degrees separated from each other in a social network or
+web-graph can be found by looking at the k-fold product of the graph
+adjacency matrix, and the list goes on (and these are all cases where the
+linear structure of the matrix is preserved!)
+
+Each of these examples deal with cases of matrices which tend to be
+tremendously large (often millions to tens of millions to hundreds of
+millions of rows or more, by sometimes a comparable number of columns), but
+also rather sparse. Sparse matrices are nice in some respects: dense
+matrices which are 10^7 on a side would have 100 trillion non-zero entries!
+But the sparsity is often problematic, because any given two rows (or
+columns) of the matrix may have zero overlap. Additionally, any
+machine-learning work done on the data which comprises the rows has to deal
+with what is known as "the curse of dimensionality", and for example, there
+are too many columns to train most regression or classification problems on
+them independently.
+
+One of the more useful approaches to dealing with such huge sparse data
+sets is the concept of dimensionality reduction, where a lower dimensional
+space of the original column (feature) space of your data is found /
+constructed, and your rows are mapped into that subspace (or sub-manifold).
+ In this reduced dimensional space, "important" components to distance
+between points are exaggerated, and unimportant ones washed away, and
+additionally, sparsity of your rows is traded for drastically reduced
+dimensional, but dense "signatures". While this loss of sparsity can lead
+to its own complications, a proper dimensionality reduction can help reveal
+the most important features of your data, expose correlations among your
+supposedly independent original variables, and smooth over the zeroes in
+your correlation matrix.
+
+One of the most straightforward techniques for dimensionality reduction is
+the matrix decomposition: singular value decomposition, eigen
+decomposition, non-negative matrix factorization, etc. In their truncated
+form these decompositions are an excellent first approach toward linearity
+preserving unsupervised feature selection and dimensional reduction. Of
+course, sparse matrices which don't fit in RAM need special treatment as
+far as decomposition is concerned. Parallelizable and/or stream-oriented
+algorithms are needed.
+
+<a name="DimensionalReduction-SingularValueDecomposition"></a>
+# Singular Value Decomposition
+
+Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 applied), are two scalable implementations of SVD, a stream-oriented implementation using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb's paper ([Gorrell and Webb 2005](http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html)
+); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm]
+ implementation, both single-threaded, and in the
+o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce
+(series of) job(s) in o.a.m.math.hadoop.decomposer package (core module).
+Coming soon: stochastic decomposition.
+
+See also:
+
+ *
+https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition
+
+<a name="DimensionalReduction-Lanczos"></a>
+## Lanczos
+
+The Lanczos algorithm is designed for eigen-decomposition, but like any
+such algorithm, getting singular vectors out of it is immediate (singular
+vectors of matrix A are just the eigenvectors of A^t * A or A * A^t). 
+Lanczos works by taking a starting seed vector *v* (with cardinality equal
+to the number of columns of the matrix A), and repeatedly multiplying A by
+the result: *v'* = A.times(*v*) (and then subtracting off what is
+proportional to previous *v'*'s, and building up an auxiliary matrix of
+projections).  In the case where A is not square (in general: not
+symmetric), then you actually want to repeatedly multiply A*A^t by *v*:
+*v'* = (A * A^t).times(*v*), or equivalently, in Mahout,
+A.timesSquared(*v*) (timesSquared is merely an optimization: by changing
+the order of summation in A*A^t.times(*v*), you can do the same computation
+as one pass over the rows of A instead of two).
+
+After *k* iterations of *v_i* = A.timesSquared(*v_(i-1)*), a *k*- by -*k*
+tridiagonal matrix has been created (the auxiliary matrix mentioned above),
+out of which a good (often extremely good) approximation to *k* of the
+singular values (and with the basis spanned by the *v_i*, the *k* singular
+*vectors* may also be extracted) of A may be efficiently extracted.  Which
+*k*?  It's actually a spread across the entire spectrum: the first few will
+most certainly be the largest singular values, and the bottom few will be
+the smallest, but you have no guarantee that just because you have the n'th
+largest singular value of A, that you also have the (n-1)'st as well.  A
+good rule of thumb is to try and extract out the top 3k singular vectors
+via Lanczos, and then discard the bottom two thirds, if you want primarily
+the largest singular values (which is the case for using Lanczos for
+dimensional reduction).
+
+<a name="DimensionalReduction-ParallelizationStragegy"></a>
+### Parallelization Stragegy
+
+Lanczos is "embarassingly parallelizable": matrix multiplication of a
+matrix by a vector may be carried out row-at-a-time without communication
+until at the end, the results of the intermediate matrix-by-vector outputs
+are accumulated on one final vector.  When it's truly A.times(*v*), the
+final accumulation doesn't even have collision / synchronization issues
+(the outputs are individual separate entries on a single vector), and
+multicore approaches can be very fast, and there should also be tricks to
+speed things up on Hadoop.  In the asymmetric case, where the operation is
+A.timesSquared(*v*), the accumulation does require synchronization (the
+vectors to be summed have nonzero elements all across their range), but
+delaying writing to disk until Mapper close(), and remembering that having
+a Combiner be the same as the Reducer, the bottleneck in accumulation is
+nowhere near a single point.
+
+<a name="DimensionalReduction-Mahoutusage"></a>
+### Mahout usage
+
+The Mahout DistributedLanzcosSolver is invoked by the
+<MAHOUT_HOME>/bin/mahout svd command. This command takes the following
+arguments (which can be reproduced by just entering the command with no
+arguments):
+
+
+    Job-Specific Options:							    
+      --input (-i) input			  Path to job input directory.	    
+      --output (-o) output			  The directory pathname for
+output.    
+      --numRows (-nr) numRows		  Number of rows of the input
+matrix	  
+      --numCols (-nc) numCols		  Number of columns of the input
+matrix 
+      --rank (-r) rank			  Desired decomposition rank (note: 
+    					  only roughly 1/4 to 1/3 of these
+will 
+    					  have the top portion of the
+spectrum) 
+      --symmetric (-sym) symmetric		  Is the input matrix square and    
+    					  symmetric?			    
+      --cleansvd (-cl) cleansvd		  Run the EigenVerificationJob to
+clean 
+    					  the eigenvectors after SVD	    
+      --maxError (-err) maxError		  Maximum acceptable error	    
+      --minEigenvalue (-mev) minEigenvalue	  Minimum eigenvalue to keep the
+vector 
+    					  for				    
+      --inMemory (-mem) inMemory		  Buffer eigen matrix into memory
+(if   
+    					  you have enough!)		    
+      --help (-h)				  Print out help		    
+      --tempDir tempDir			  Intermediate output directory     
+      --startPhase startPhase		  First phase to run		    
+      --endPhase endPhase			  Last phase to run		    
+
+
+The short form invocation may be used to perform the SVD on the input data: 
+
+      <MAHOUT_HOME>/bin/mahout svd \
+      --input (-i) <Path to input matrix> \   
+      --output (-o) <The directory pathname for output> \	
+      --numRows (-nr) <Number of rows of the input matrix> \   
+      --numCols (-nc) <Number of columns of the input matrix> \
+      --rank (-r) <Desired decomposition rank> \
+      --symmetric (-sym) <Is the input matrix square and symmetric>    
+
+
+The --input argument is the location on HDFS where a
+SequenceFile<Writable,VectorWritable> (preferably
+SequentialAccessSparseVectors instances) lies which you wish to decompose. 
+Each vector of which has --numcols entries.  --numRows is the number of
+input rows and is used to properly size the matrix data structures.
+
+After execution, the --output directory will have a file named
+"rawEigenvectors" containing the raw eigenvectors. As the
+DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
+eigenvalues aren't valid, and also scales all of the eigenvalues down by
+the max eignenvalue (to avoid floating point overflow), there is an
+additional step which spits out the nice correctly scaled (and
+non-spurious) eigenvector/value pairs. This is done by the "cleansvd" shell
+script step (c.f. EigenVerificationJob).
+
+If you have run he short form svd invocation above and require this
+"cleaning" of the eigen/singular output you can run "cleansvd" as a
+separate command:
+
+      <MAHOUT_HOME>/bin/mahout cleansvd \
+      --eigenInput <path to raw eigenvectors> \
+      --corpusInput <path to corpus> \
+      --output <path to output directory> \
+      --maxError <maximum allowed error. Default is 0.5> \
+      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
+      --inMemory <true if the eigenvectors can all fit into memory. Default
+false>
+
+
+The --corpusInput is the input path from the previous step, --eigenInput is
+the output from the previous step (<output>/rawEigenvectors), and --output
+is the desired output path (same as svd argument). The two "cleaning"
+params are --maxError - the maximum allowed 1-cosAngle(v,
+A.timesSquared(v)), and --minEigenvalue.  Eigenvectors which have too large
+error, or too small eigenvalue are discarded.  Optional argument:
+--inMemory, if you have enough memory on your local machine (not on the
+hadoop cluster nodes!) to load all eigenvectors into memory at once (at
+least 8 bytes/double * rank * numCols), then you will see some speedups on
+this cleaning process.
+
+After execution, the --output directory will have a file named
+"cleanEigenvectors" containing the clean eigenvectors. 
+
+These two steps can also be invoked together by the svd command by using
+the long form svd invocation:
+
+      <MAHOUT_HOME>/bin/mahout svd \
+      --input (-i) <Path to input matrix> \   
+      --output (-o) <The directory pathname for output> \	
+      --numRows (-nr) <Number of rows of the input matrix> \   
+      --numCols (-nc) <Number of columns of the input matrix> \
+      --rank (-r) <Desired decomposition rank> \
+      --symmetric (-sym) <Is the input matrix square and symmetric> \  
+      --cleansvd "true"   \
+      --maxError <maximum allowed error. Default is 0.5> \
+      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
+      --inMemory <true if the eigenvectors can all fit into memory. Default
+false>
+
+
+After execution, the --output directory will contain two files: the
+"rawEigenvectors" and the "cleanEigenvectors".
+
+TODO: also allow exclusion based on improper orthogonality (currently
+computed, but not checked against constraints).
+
+<a name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a>
+#### Example: SVD of ASF Mail Archives on Amazon Elastic MapReduce
+
+This section walks you through a complete example of running the Mahout SVD
+job on Amazon Elastic MapReduce cluster and then preparing the output to be
+used for clustering. This example was developed as part of the effort to
+benchmark Mahout's clustering algorithms using a large document set (see [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
+). Specifically, we use the ASF mail archives located at
+http://aws.amazon.com/datasets/7791434387204566.  You will need to likely
+run seq2sparse on these first.	See
+$MAHOUT_HOME/examples/bin/build-asf-email.sh (on trunk) for examples of
+processing this data.
+
+At a high level, the steps we're going to perform are:
+
+bin/mahout svd (original -> svdOut)
+bin/mahout cleansvd ...
+bin/mahout transpose svdOut -> svdT
+bin/mahout transpose original -> originalT
+bin/mahout matrixmult originalT svdT -> newMatrix
+bin/mahout kmeans newMatrix
+
+The bulk of the content for this section was extracted from the Mahout user
+mailing list, see: [Using SVD with Canopy/KMeans](http://search.lucidimagination.com/search/document/6e5889ee6f0f253b/using_svd_with_canopy_kmeans#66a50fe017cebbe8)
+ and [Need a little help with using SVD|http://search.lucidimagination.com/search/document/748181681ae5238b/need_a_little_help_with_using_svd#134fb2771fd52928]
+
+Note: Some of this work is due in part to credits donated by the Amazon
+Elastic MapReduce team.
+
+<a name="DimensionalReduction-1.LaunchEMRCluster"></a>
+##### 1. Launch EMR Cluster
+
+For a detailed explanation of the steps involved in launching an Amazon
+Elastic MapReduce cluster for running Mahout jobs, please read the
+"Building Vectors for Large Document Sets" section of [Mahout on Elastic MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)
+.
+
+In the remaining steps below, remember to replace JOB_ID with the Job ID of
+your EMR cluster.
+
+<a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a>
+##### 2. Load Mahout 0.5+ JAR into S3
+
+These steps were created with the mahout-0.5-SNAPSHOT because they rely on
+the patch for [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
+
+<a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a>
+##### 3. Copy TFIDF Vectors into HDFS
+
+Before running your SVD job on the vectors, you need to copy them from S3
+to your EMR cluster's HDFS.
+
+
+    elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
+      --arg
+s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
+\
+      --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors \
+      -j JOB_ID
+
+
+<a name="DimensionalReduction-4.RuntheSVDJob"></a>
+##### 4. Run the SVD Job
+
+Now you're ready to run the SVD job on the vectors stored in HDFS:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg svd \
+      --arg -i --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors
+\
+      --arg -o --arg /asf-mail-archives/mahout/svd \
+      --arg --rank --arg 100 \
+      --arg --numCols --arg 20444 \
+      --arg --numRows --arg 6076937 \
+      --arg --cleansvd --arg "true" \
+      -j JOB_ID
+
+
+This will run 100 iterations of the LanczosSolver SVD job to produce 87
+eigenvectors in:
+
+
+    /asf-mail-archives/mahout/svd/cleanEigenvectors
+
+
+Only 87 eigenvectors were produced because of the cleanup step, which
+removes any duplicate eigenvectors caused by convergence issues and numeric
+overflow and any that don't appear to be "eigen" enough (ie, they don't
+satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix
+
+<a name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a>
+##### 5. Transform your TFIDF Vectors into Mahout Matrix
+
+The tfidf vectors created by the seq2sparse job are
+SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these
+vectors into a matrix form that is a
+SequenceFile<IntWritable,VectorWritable> and a
+SequenceFile<IntWritable,Text> (where the original one is the join of these
+new ones, on the new int key).
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg rowid \
+      --arg
+-Dmapred.input.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors
+\
+      --arg
+-Dmapred.output.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix
+\
+      -j JOB_ID
+
+
+This is not a distributed job and will only run on the master server in
+your EMR cluster. The job produces the following output:
+
+
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/docIndex
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix
+
+
+where docIndex is the SequenceFile<IntWritable,Text> and matrix is
+SequenceFile<IntWritable,VectorWritable>.
+
+<a name="DimensionalReduction-6.TransposetheMatrix"></a>
+##### 6. Transpose the Matrix
+
+Our ultimate goal is to multiply the TFIDF vector matrix times our SVD
+eigenvectors. For the mathematically inclined, from the rowid job, we now
+have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E
+is p x n (p=87, n=20444). So to multiply these two matrices, I need to
+transpose E so that the number of columns in T equals the number of rows in
+E (i.e. E^T is n x p) the result of the matrixmult would give me an m x p
+matrix (m=6076937, p=87).
+
+However, in practice, computing the matrix product of two matrices as a
+map-reduce job is efficiently done as a map-side join on two row-based
+matrices with the same number of rows, and the columns are the ones which
+are different.	In particular, if you take a matrix X which is represented
+as a set of numRowsX rows, each of which has numColsX, and another matrix
+with numRowsY == numRowsX, each of which has numColsY (!= numColsX), then
+by summing the outer-products of each of the numRowsX pairs of vectors, you
+get a matrix of with numRowsZ == numColsX, and numColsZ == numColsY (if you
+instead take the reverse outer product of the vector pairs, you can end up
+with the transpose of this final result, with numRowsZ == numColsY, and
+numColsZ == numColsX). - Jake Mannix
+
+Thus, we need to transpose the matrix using Mahout's Transpose Job:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg transpose \
+      --arg -i --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix \
+      --arg --numRows --arg 6076937 \
+      --arg --numCols --arg 20444 \
+      --arg --tempDir --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
+      -j JOB_ID
+
+
+This job requires the patch to [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
+
+The job creates the following output:
+
+
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose
+
+
+<a name="DimensionalReduction-7.TransposeEigenvectors"></a>
+##### 7. Transpose Eigenvectors
+
+If you followed Jake's explanation in step 6 above, then you know that we
+also need to transpose the eigenvectors:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg transpose \
+      --arg -i --arg /asf-mail-archives/mahout/svd/cleanEigenvectors \
+      --arg --numRows --arg 87 \
+      --arg --numCols --arg 20444 \
+      --arg --tempDir --arg /asf-mail-archives/mahout/svd/transpose \
+      -j JOB_ID
+
+
+Note: You need to use the same number of reducers that was used for
+transposing the matrix you are multiplying the vectors with.
+
+The job creates the following output:
+
+
+    /asf-mail-archives/mahout/svd/transpose
+
+
+<a name="DimensionalReduction-8.MatrixMultiplication"></a>
+##### 8. Matrix Multiplication
+
+Lastly, we need to multiply the transposed vectors using Mahout's
+matrixmult job:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg matrixmult \
+      --arg --numRowsA --arg 20444 \
+      --arg --numColsA --arg 6076937 \
+      --arg --numRowsB --arg 20444 \
+      --arg --numColsB --arg 87 \
+      --arg --inputPathA --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
+      --arg --inputPathB --arg /asf-mail-archives/mahout/svd/transpose \
+      -j JOB_ID
+
+
+This job produces output such as:
+
+
+    /user/hadoop/productWith-189
+
+
+<a name="DimensionalReduction-Resources"></a>
+# Resources
+
+* http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm
+*
+http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

Added: mahout/site/trunk/content/dirichlet-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/dirichlet-commandline.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/dirichlet-commandline.mdtext (added)
+++ mahout/site/trunk/content/dirichlet-commandline.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,91 @@
+Title: dirichlet-commandline
+<a name="dirichlet-commandline-RunningDirichletProcessClusteringfromtheCommandLine"></a>
+# Running Dirichlet Process Clustering from the Command Line
+Mahout's Dirichlet clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Dirichlet on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout dirichlet <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="dirichlet-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
+
+
+<a name="dirichlet-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="dirichlet-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input				Path to job input
+directory.    
+    						Must be a SequenceFile of   
+    						VectorWritable		    
+      --output (-o) output				The directory pathname for  
+    						output. 		    
+      --overwrite (-ow)				If present, overwrite the   
+    						output directory before
+running 
+    						job			    
+      --modelDistClass (-md) modelDistClass 	The ModelDistribution class 
+    						name. Defaults to	    
+    						NormalModelDistribution     
+      --modelPrototypeClass (-mp) prototypeClass	The ModelDistribution
+prototype 
+    						Vector class name. Defaults
+to  
+    						RandomAccessSparseVector    
+      --maxIter (-x) maxIter			The maximum number of	    
+    						iterations.		    
+      --alpha (-m) alpha				The alpha0 value for the    
+    						DirichletDistribution.
+Defaults 
+    						to 1.0			    
+      --k (-k) k					The number of clusters to   
+    						create			    
+      --help (-h)					Print out help		    
+      --maxRed (-r) maxRed				The number of reduce tasks. 
+    						Defaults to 2		    
+      --clustering (-cl)				If present, run clustering  
+    						after the iterations have
+taken 
+    						place			    
+      --emitMostLikely (-e) emitMostLikely		True if clustering should
+emit  
+    						the most likely point only, 
+    						false for threshold
+clustering. 
+    						Default is true 	    
+      --threshold (-t) threshold			The pdf threshold used for  
+    						cluster determination.
+Default  
+    						is 0			    
+