You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by sr...@apache.org on 2012/07/12 11:26:03 UTC
svn commit: r1360593 [13/17] - in /mahout/site/trunk: ./ cgi-bin/ content/ content/attachments/ content/attachments/101992/ content/attachments/116559/ content/attachments/22872433/ content/attachments/22872443/ content/attachments/23335706/ content/at...

Added: mahout/site/trunk/content/itembased-collaborative-filtering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/itembased-collaborative-filtering.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/itembased-collaborative-filtering.mdtext (added)
+++ mahout/site/trunk/content/itembased-collaborative-filtering.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,122 @@
+Title: Itembased Collaborative Filtering
+Itembased Collaborative Filtering is a popular way of doing Recommendation
+Mining.
+
+<a name="ItembasedCollaborativeFiltering-Terminology"></a>
+### Terminology
+
+We have *users* that interact with *items* (which can be pretty much
+anything like books, videos, news, other users,...). Those users express
+*preferences* towards the items which can either be boolean (just modelling
+that a user likes an item) or numeric (by having a rating value assigned to
+the preference). Typically only a small number of preferences is known for
+each single user.
+
+<a name="ItembasedCollaborativeFiltering-Algorithmicproblems"></a>
+### Algorithmic problems
+
+Collaborative Filtering algorithms aim to solve the *prediction* problem
+where the task is to estimate the preference of a user towards an item
+which he/she has not yet seen.
+
+Once an algorithm can predict preferences it can also be used to do
+*Top-N-Recommendation* where the task is to find the N items a given user
+might like best. This is usually done by isolating a set of candidate
+items, computing the predicted preferences of the given user towards them
+and returning the highest scoring ones.
+
+If we look at the problem from a mathematical perspective, a
+*user-item-matrix* is created from the preference data and the task is to
+predict the missing entries by finding patterns in the known entries.
+
+<a name="ItembasedCollaborativeFiltering-ItembasedCollaborativeFiltering"></a>
+### Itembased Collaborative Filtering
+
+A popular approach called "Itembased Collaborative Filtering" estimates a
+user's preference towards an item by looking at his/her preferences towards
+similar items, be aware that similarity must be thought of as similarity of
+rating behaviour not similarity of content in this context.
+
+The standard procedure is to pairwisely compare the columns of the
+user-item-matrix (the item-vectors) using a similarity measure like
+pearson-correlation, cosine or loglikelihood to obtain similar items and
+use those together with the user's ratings to predict his/her preference
+towards unknown items.
+
+
+<a name="ItembasedCollaborativeFiltering-Map/Reduceimplementations"></a>
+### Map/Reduce implementations
+
+Mahout offers two Map/Reduce jobs aimed to support Itembased Collaborative
+Filtering.
+
+*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob*
+computes all similar items. It expects a .csv file with the preference data
+as input, where each line represents a single preference in the form
+_userID,itemID,value_ and outputs pairs of itemIDs with their associated
+similarity value.
+
+_job specific options_
+
+<table>
+<tr><td>input</td><td>path to input directory</td></tr>
+<tr><td>input</td><td>path to output directory</td></tr>
+<tr><td>similarityClassname</td><td>Name of distributed similarity class to instantiate,  
+							alternatively use
+one of the predefined similarities					   
+		  (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE, 
+							 
+SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,		   
+				    SIMILARITY_TANIMOTO_COEFFICIENT,
+SIMILARITY_UNCENTERED_COSINE,						   
+	     SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)</td></tr>
+<tr><td>maxSimilaritiesPerItem</td><td>try to cap the number of similar items per item to
+this number</td></tr>
+<tr><td>maxPrefsPerUser</td><td>max number of preferences to consider per user, users with
+more preferences will be sampled down</td></tr>
+<tr><td>minPrefsPerUser</td><td>ignore users with less preferences than this</td></tr>
+<tr><td>booleanData</td><td>treat input as having no preference values</td></tr>
+<tr><td>threshold</td><td>discard item pairs with a similarity value below this</td></tr>
+</table>
+
+*org.apache.mahout.cf.taste.hadoop.item.RecommenderJob* is a completely
+distributed itembased recommender. It expects a .csv file with the
+preference data as input, where each line represents a single preference in
+the form _userID,itemID,value_ and outputs userIDs with associated
+recommended itemIDs and their scores.
+
+_job specific options_
+
+<table>
+<tr><td>input</td><td>path to input directory</td></tr>
+<tr><td>input</td><td>path to output directory</td></tr>
+<tr><td>numRecommendations</td><td>number of recommendations per user</td></tr>
+<tr><td>usersFile</td><td>file of users to recommend for</td></tr>
+<tr><td>itemsFile</td><td>file of items to recommend for</td></tr>
+<tr><td>filterFile</td><td>file containing comma-separated userID,itemID pairs. Used to
+exclude the item from the recommendations for that user (optional)</td></tr>
+<tr><td>maxPrefsPerUser</td><td>maximum number of preferences considered per user in final
+recommendation phase</td></tr>
+<tr><td>similarityClassname</td><td>Name of distributed similarity class to instantiate,  
+							alternatively use
+one of the predefined similarities					   
+		  (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE, 
+							 
+SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,		   
+				    SIMILARITY_TANIMOTO_COEFFICIENT,
+SIMILARITY_UNCENTERED_COSINE,						   
+	     SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)</td></tr>
+<tr><td>maxSimilaritiesPerItem</td><td>try to cap the number of similar items per item to
+this number</td></tr>
+<tr><td>maxPrefsPerUserInItemSimilarity</td><td>max number of preferences to consider per
+user, users with more preferences will be sampled down</td></tr>
+<tr><td>minPrefsPerUser</td><td>ignore users with less preferences than this</td></tr>
+<tr><td>booleanData</td><td>treat input as having no preference values</td></tr>
+<tr><td>threshold</td><td>discard item pairs with a similarity value below this</td></tr>
+</table>
+
+<a name="ItembasedCollaborativeFiltering-Resources"></a>
+### Resources
+
+* [Sarwar et al.:Item-Based Collaborative Filtering Recommendation Algorithms ](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf)
+* [Slides: Distributed Itembased Collaborative Filtering with Apache Mahout](http://www.slideshare.net/sscdotopen/mahoutcf)

Added: mahout/site/trunk/content/k-means-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/k-means-clustering.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/k-means-clustering.mdtext (added)
+++ mahout/site/trunk/content/k-means-clustering.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,208 @@
+Title: K-Means Clustering
+k-Means is a rather <b>simple</b> but well known algorithms
+for grouping objects, clustering. Again all objects need to be represented
+as a set of numerical features. In addition the user has to specify the
+number of groups (referred to as _k_) he wishes to identify.
+
+Each object can be thought of as being represented by some feature vector
+in an _n_ dimensional space, _n_ being the number of all features used to
+describe the objects to cluster. The algorithm then randomly chooses _k_
+points in that vector space, these point serve as the initial centers of
+the clusters. Afterwards all objects are each assigned to center they are
+closest to. Usually the distance measure is chosen by the user and
+determined by the learning task.
+
+After that, for each cluster a new center is computed by averaging the
+feature vectors of all objects assigned to it. The process of assigning
+objects and recomputing centers is repeated until the process converges.
+The algorithm can be proven to converge after a finite number of
+iterations.
+
+Several tweaks concerning distance measure, initial center choice and
+computation of new average centers have been explored, as well as the
+estimation of the number of clusters _k_. Yet the main principle always
+remains the same.
+
+
+
+<a name="K-MeansClustering-Quickstart"></a>
+## Quickstart
+
+[Here](k-means-clustering^quickstart-kmeans.sh.html)
+ is a short shell script outline that will get you started quickly with
+k-Means. This does the following:
+
+* Get the Reuters dataset
+* Run org.apache.lucene.benchmark.utils.ExtractReuters to generate
+reuters-out from reuters-sgm(the downloaded archive)
+* Run seqdirectory to convert reuters-out to SequenceFile format
+* Run seq2sparse to convert SequenceFiles to sparse vector format
+* Finally, run kMeans with 20 clusters.
+
+After following through the output that scrolls past, reading the code will
+offer you a better understanding.
+
+
+<a name="K-MeansClustering-Strategyforparallelization"></a>
+## Strategy for parallelization
+
+Some ideas can be found in [Cluster computing and MapReduce](http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html)
+ lecture video series \[by Google(r)\]; k-Means clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]
+. Slides can be found [here|http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt]
+.
+
+Interestingly, Hadoop based implementation using [Canopy-clustering](http://en.wikipedia.org/wiki/Canopy_clustering_algorithm)
+ seems to be here: [http://code.google.com/p/canopy-clustering/]
+ (GPL 3 licence)
+
+Here's another useful paper [http://www2.chass.ncsu.edu/garson/PA765/cluster.htm](http://www2.chass.ncsu.edu/garson/PA765/cluster.htm)
+.
+
+<a name="K-MeansClustering-Designofimplementation"></a>
+## Design of implementation
+
+The implementation accepts two input directories: one for the data points
+and one for the initial clusters. The data directory contains multiple
+input files of SequenceFile(key, VectorWritable), while the clusters
+directory contains one or more SequenceFiles(Text, Cluster \| Canopy)
+containing _k_ initial clusters or canopies. None of the input directories
+are modified by the implementation, allowing experimentation with initial
+clustering and convergence values.
+
+The program iterates over the input points and clusters, outputting a new
+directory "clusters-N" containing SequenceFile(Text, Cluster) files for
+each iteration N. This process uses a mapper/combiner/reducer/driver as
+follows:
+* KMeansMapper - reads the input clusters during its setup() method, then
+assigns and outputs each input point to its nearest cluster as defined by
+the user-supplied distance measure. Output key is: cluster identifier.
+Output value is: ClusterObservation.
+* KMeansCombiner - receives all key:value pairs from the mapper and
+produces partial sums of the input vectors for each cluster. Output key is:
+cluster identifier. Output value is ClusterObservation.
+* KMeansReducer - a single reducer receives all key:value pairs from all
+combiners and sums them to produce a new centroid for the cluster which is
+output. Output key is: encoded cluster identifier. Output value is:
+Cluster. The reducer encodes unconverged clusters with a 'Cn' cluster Id
+and converged clusters with 'Vn' clusterId.
+* KMeansDriver - iterates over the points and clusters until all output
+clusters have converged (Vn clusterIds) or until a maximum number of
+iterations has been reached. During iterations, a new clusters directory
+"clusters-N" is produced with the output clusters from the previous
+iteration used for input to the next. A final optional pass over the data
+using the KMeansClusterMapper clusters all points to an output directory
+"clusteredPoints" and has no combiner or reducer steps.
+
+Canopy clustering can be used to compute the initial clusters for k-KMeans:
+
+> // run the CanopyDriver job
+> CanopyDriver.runJob("testdata", "output"
+> ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);
+
+> // now run the KMeansDriver job
+> KMeansDriver.runJob("testdata", "output/clusters-0", "output",
+> EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);
+
+In the above example, the input data points are stored in 'testdata' and
+the CanopyDriver is configured to output to the 'output/clusters-0'
+directory. Once the driver executes it will contain the canopy definition
+files. Upon running the KMeansDriver the output directory will have two or
+more new directories: 'clusters-N'' containining the clusters for each
+iteration and 'clusteredPoints' will contain the clustered data points.
+
+<a name="K-MeansClustering-Runningk-MeansClustering"></a>
+## Running k-Means Clustering
+
+The k-Means clustering algorithm may be run using a command-line invocation
+on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout kmeans \
+        -i <input vectors directory> \
+        -c <input clusters directory> \
+        -o <output working directory> \
+        -k <optional number of initial clusters to sample from input vectors> \
+        -dm <DistanceMeasure> \
+        -x <maximum number of iterations> \
+        -cd <optional convergence delta. Default is 0.5> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Canopies>
+        -xm <execution method: sequential or mapreduce>
+
+
+Note: if the \-k argument is supplied, any clusters in the \-c directory
+will be overwritten and \-k random points will be sampled from the input
+vectors to become the initial cluster centers.
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. clusters: a file path string to a directory containing the initial
+clusters, a SequenceFile(key, Cluster \| Canopy). Both KMeans clusters and
+Canopy canopies may be used for the initial clusters.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. distanceMeasure: the fully-qualified class name of an instance of
+DistanceMeasure which will be used for the clustering.
+1. convergenceDelta: a double value used to determine if the algorithm has
+converged (clusters have not moved more than the value in the last
+iteration)
+1. maxIter: the maximum number of iterations to run, independent of the
+convergence specified
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. runSequential: a boolean indicating, if true, that the k-means sequential
+implementation is to be used to process the input data.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, Cluster) produced
+by the algorithm for each iteration. The Text _key_ is a cluster identifier
+string.
+1. clusteredPoints: (if \--clustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weight indicates
+the probability that the vector is a member of the cluster. For k-Means
+clustering, the weights are computed as 1/(1+distance) where the distance
+is between the cluster center and the vector using the chosen
+DistanceMeasure.
+
+<a name="K-MeansClustering-Examples"></a>
+# Examples
+
+The following images illustrate k-Means clustering applied to a set of
+randomly-generated 2-d data points. The points are generated using a normal
+distribution centered at a mean location and with a constant standard
+deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=[1.0, 1.0]
+ sd=3.0
+* 300 samples m=[1.0, 0.0]
+ sd=0.5
+* 300 samples m=[0.0, 2.0]
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed.
+
+![SampleData](attachments/75159/23527474.png)
+
+In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in \[orange, yellow, green, blue, violet and gray\](orange,-yellow,-green,-blue,-violet-and-gray\.html)
+. Although it misses a lot of the points and cannot capture the original,
+superimposed cluster centers, it does a decent job of clustering this data.
+
+![KMeans](attachments/75159/23527477.png)
+
+The third image shows the results of running k-Means on a different data
+set (see [Dirichlet Process Clustering](dirichlet-process-clustering.html)
+ for details) which is generated using asymmetrical standard deviations.
+K-Means does a fair job handling this data set as well.
+
+![2dKMeans](attachments/75159/23527478.png)

Added: mahout/site/trunk/content/k-means-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/k-means-commandline.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/k-means-commandline.mdtext (added)
+++ mahout/site/trunk/content/k-means-commandline.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,99 @@
+Title: k-means-commandline
+<a name="k-means-commandline-Introduction"></a>
+# Introduction
+
+This quick start page describes how to run the kMeans clustering algorithm
+on a Hadoop cluster. 
+
+<a name="k-means-commandline-Steps"></a>
+# Steps
+
+Mahout's k-Means clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run k-Means on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout kmeans <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+
+<a name="k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			       Path to job input directory. 
+    					       Must be a SequenceFile of    
+    					       VectorWritable		    
+      --clusters (-c) clusters		       The input centroids, as
+Vectors. 
+    					       Must be a SequenceFile of    
+    					       Writable, Cluster/Canopy. 
+If k  
+    					       is also specified, then a
+random 
+    					       set of vectors will be
+selected  
+    					       and written out to this path 
+    					       first			    
+      --output (-o) output			       The directory pathname for   
+    					       output.			    
+      --distanceMeasure (-dm) distanceMeasure      The classname of the	    
+    					       DistanceMeasure. Default is  
+    					       SquaredEuclidean 	    
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
+    					       Default is 0.5		    
+      --maxIter (-x) maxIter		       The maximum number of	    
+    					       iterations.		    
+      --maxRed (-r) maxRed			       The number of reduce tasks.  
+    					       Defaults to 2		    
+      --k (-k) k				       The k in k-Means.  If
+specified, 
+    					       then a random selection of k 
+    					       Vectors will be chosen as
+the    
+    					       Centroid and written to the  
+    					       clusters input path.	    
+      --overwrite (-ow)			       If present, overwrite the
+output 
+    					       directory before running job 
+      --help (-h)				       Print out help		    
+      --clustering (-cl)			       If present, run clustering
+after 
+    					       the iterations have taken
+place  
+

Added: mahout/site/trunk/content/latent-dirichlet-allocation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/latent-dirichlet-allocation.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/latent-dirichlet-allocation.mdtext (added)
+++ mahout/site/trunk/content/latent-dirichlet-allocation.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,115 @@
+Title: Latent Dirichlet Allocation
+<a name="LatentDirichletAllocation-Overview"></a>
+# Overview
+
+Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
+algorithm for automatically and jointly clustering words into "topics" and
+documents into mixtures of topics. It has been successfully applied to
+model change in scientific fields over time (Griffiths and Steyvers, 2004;
+Hall, et al. 2008). 
+
+A topic model is, roughly, a hierarchical Bayesian model that associates
+with each document a probability distribution over "topics", which are in
+turn distributions over words. For instance, a topic in a collection of
+newswire might include words about "sports", such as "baseball", "home
+run", "player", and a document about steroid use in baseball might include
+"sports", "drugs", and "politics". Note that the labels "sports", "drugs",
+and "politics", are post-hoc labels assigned by a human, and that the
+algorithm itself only assigns associate words with probabilities. The task
+of parameter estimation in these models is to learn both what the topics
+are, and which documents employ them in what proportions.
+
+Another way to view a topic model is as a generalization of a mixture model
+like [Dirichlet Process Clustering](dirichlet-process-clustering.html)
+. Starting from a normal mixture model, in which we have a single global
+mixture of several distributions, we instead say that _each_ document has
+its own mixture distribution over the globally shared mixture components.
+Operationally in Dirichlet Process Clustering, each document has its own
+latent variable drawn from a global mixture that specifies which model it
+belongs to, while in LDA each word in each document has its own parameter
+drawn from a document-wide mixture.
+
+The idea is that we use a probabilistic mixture of a number of models that
+we use to explain some observed data. Each observed data point is assumed
+to have come from one of the models in the mixture, but we don't know
+which.	The way we deal with that is to use a so-called latent parameter
+which specifies which model each data point came from.
+
+<a name="LatentDirichletAllocation-InvocationandUsage"></a>
+# Invocation and Usage
+
+Mahout's implementation of LDA operates on a collection of SparseVectors of
+word counts. These word counts should be non-negative integers, though
+things will-- probably --work fine if you use non-negative reals. (Note
+that the probabilistic model doesn't make sense if you do!) To create these
+vectors, it's recommended that you follow the instructions in [Creating Vectors From Text](creating-vectors-from-text.html)
+, making sure to use TF and not TFIDF as the scorer.
+
+Invocation takes the form:
+
+
+    bin/mahout lda \
+        -i <input vectors directory> \
+        -o <output working directory> \
+        -k <numTopics> \
+        -v <number of words> \
+        -a <optional topic smoothing. Default: 50/numTopics> \
+        -x <optional number of iterations. Default is -1 (until convergence)> \
+
+
+Topic smoothing should generally be about 50/K, where K is the number of
+topics. The number of words in the vocabulary can be an upper bound, though
+it shouldn't be too high (for memory concerns). 
+
+Choosing the number of topics is more art than science, and it's
+recommended that you try several values.
+
+After running LDA you can obtain an output of the computed topics using the
+LDAPrintTopics utility:
+
+
+    bin/mahout ldatopics \
+        -i <input vectors directory> \
+        -d <input dictionary file> \
+        -o <optional output working directory. Default is to console> \
+        -dt <optional dictionary type (text|sequencefile). Default is text>
+
+
+
+<a name="LatentDirichletAllocation-Example"></a>
+# Example
+
+An example is located in mahout/examples/bin/build-reuters.sh. The script
+automatically downloads the Reuters-21578 corpus, builds a Lucene index and
+converts the Lucene index to vectors. By uncommenting the last two lines
+you can then cause it to run LDA on the vectors and finally print the
+resultant topics to the console. 
+
+To adapt the example yourself, you should note that Lucene has specialized
+support for Reuters, and that building your own index will require some
+adaptation. The rest should hopefully not differ too much.
+
+<a name="LatentDirichletAllocation-ParameterEstimation"></a>
+# Parameter Estimation
+
+We use mean field variational inference to estimate the models. Variational
+inference can be thought of as a generalization of [EM](expectation-maximization.html)
+ for hierarchical Bayesian models. The E-Step takes the form of, for each
+document, inferring the posterior probability of each topic for each word
+in each document. We then take the sufficient statistics and emit them in
+the form of (log) pseudo-counts for each word in each topic. The M-Step is
+simply to sum these together and (log) normalize them so that we have a
+distribution over the entire vocabulary of the corpus for each topic. 
+
+In implementation, the E-Step is implemented in the Map, and the M-Step is
+executed in the reduce step, with the final normalization happening as a
+post-processing step.
+
+<a name="LatentDirichletAllocation-References"></a>
+# References
+
+[David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent Dirichlet Allocation. JMLR.](http://www.cs.princeton.edu/~blei/papers/bleingjordan2003.pdf.html)
+
+[Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS.  ](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)
+
+[David Hall, Dan Jurafsky, and Christopher D. Manning. 2008. Studying the History of Ideas Using Topic Models ](http://www.aclweb.org/anthology-new/d/d08/d08-1038.pdf.html)

Added: mahout/site/trunk/content/lda-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/lda-commandline.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/lda-commandline.mdtext (added)
+++ mahout/site/trunk/content/lda-commandline.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,69 @@
+Title: lda-commandline
+<a name="lda-commandline-RunningLatentDirichletAllocationfromtheCommandLine"></a>
+# Running Latent Dirichlet Allocation from the Command Line
+Mahout's LDA can be launched from the same command line invocation whether
+you are running on a single machine in stand-alone mode or on a larger
+Hadoop cluster. The difference is determined by the $HADOOP_HOME and
+$HADOOP_CONF_DIR environment variables. If both are set to an operating
+Hadoop cluster on the target machine then the invocation will run LDA on
+that cluster. If either of the environment variables are missing then the
+stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout lda <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout lda -i testdata <OTHER OPTIONS>
+
+
+<a name="lda-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout lda -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="lda-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			  Path to job input directory. Must
+be  
+    					  a SequenceFile of VectorWritable  
+      --output (-o) output			  The directory pathname for
+output.    
+      --numTopics (-k) numTopics		  The total number of topics in the 
+    					  corpus			    
+      --numWords (-v) numWords		  The total number of words in the  
+    					  corpus (can be approximate, needs
+to  
+    					  exceed the actual value)	    
+      --topicSmoothing (-a) topicSmoothing	  Topic smoothing parameter.
+Default is 
+    					  50/numTopics. 		    
+      --maxIter (-x) maxIter		  The maximum number of iterations. 
+      --maxRed (-r) maxRed			  The number of reduce tasks.
+Defaults  
+    					  to 2				    
+      --overwrite (-ow)			  If present, overwrite the output  
+    					  directory before running job	    
+      --help (-h)				  Print out help		    
+

Added: mahout/site/trunk/content/llr---log-likelihood-ratio.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/llr---log-likelihood-ratio.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/llr---log-likelihood-ratio.mdtext (added)
+++ mahout/site/trunk/content/llr---log-likelihood-ratio.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,37 @@
+Title: LLR - Log-likelihood Ratio
+{excerpt}Likelihood ratio test is used to compare the fit of two models one
+of which is nested within the other.{excerpt}
+
+In the context of machine learning and the Mahout project in particular,
+the term LLR is usually meant to refer to a test of significance for two
+binomial distributions, also known as the G squared statistic.	This is a
+special case of the multinomial test and is closely related to mutual
+information.  The value of this statistic is not normally used in this
+context as a true frequentist test of significance since there would be
+obvious and dreadful problems to do with multiple comparisons, but rather
+as a heuristic score to order pairs of items with the most interestingly
+connected items having higher scores.  In this usage, the LLR has proven
+very useful for discriminating pairs of features that have interesting
+degrees of cooccurrence and those that do not with usefully small false
+positive and false negative rates.  The LLR is typically far more suitable
+in the case of small than many other measures such as Pearson's
+correlation, Pearson's chi squared statistic or z statistics.  The LLR as
+stated does not, however, make any use of rating data which can limit its
+applicability in problems such as the Netflix competition. 
+
+The actual value of the LLR is not usually very helpful other than as a way
+of ordering pairs of items.  As such, it is often used to determine a
+sparse set of coefficients to be estimated by other means such as TF-IDF. 
+Since the actual estimation of these coefficients can be done in a way that
+is independent of the training data such as by general corpus statistics,
+and since the ordering imposed by the LLR is relatively robust to counting
+fluctuation, this technique can provide very strong results in very sparse
+problems where the potential number of features vastly out-numbers the
+number of training examples and where features are highly interdependent.
+
+ See Also: 
+ * http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
+ * http://en.wikipedia.org/wiki/G-test
+ * http://en.wikipedia.org/wiki/Likelihood-ratio_test
+
+      
\ No newline at end of file

Added: mahout/site/trunk/content/locally-weighted-linear-regression.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/locally-weighted-linear-regression.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/locally-weighted-linear-regression.mdtext (added)
+++ mahout/site/trunk/content/locally-weighted-linear-regression.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,20 @@
+Title: Locally Weighted Linear Regression
+
+<a name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a>
+# Locally Weighted Linear Regression
+
+Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians,
+use the data to build a parameterized model. After training, the model is
+used for predictions and the data are generally discarded. In contrast,
+"memory-based" methods are non-parametric approaches that explicitly retain
+the training data, and use it each time a prediction needs to be made.
+Locally weighted regression (LWR) is a memory-based method that performs a
+regression around a point of interest using only training data that are
+"local" to that point. Source:
+http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html
+
+<a name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a>
+## Strategy for parallel regression
+
+<a name="LocallyWeightedLinearRegression-Designofpackages"></a>
+## Design of packages

Added: mahout/site/trunk/content/logistic-regression.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/logistic-regression.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/logistic-regression.mdtext (added)
+++ mahout/site/trunk/content/logistic-regression.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,108 @@
+Title: Logistic Regression
+<a name="LogisticRegression-LogisticRegression(SGD)"></a>
+# Logistic Regression (SGD)
+
+Logistic regression is a model used for prediction of the probability of
+occurrence of an event. It makes use of several predictor variables that
+may be either numerical or categories.
+
+Logistic regression is the standard industry workhorse that underlies many
+production fraud detection and advertising quality and targeting products. 
+The Mahout implementation uses Stochastic Gradient Descent (SGD) to all
+large training sets to be used.
+
+For a more detailed analysis of the approach, have a look at the thesis of
+Paul Komarek:
+
+http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en
+
+See MAHOUT-228 for the main JIRA issue for SGD.
+
+
+<a name="LogisticRegression-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+The bad news is that SGD is an inherently sequential algorithm.  The good
+news is that it is blazingly fast and thus it is not a problem for Mahout's
+implementation to handle training sets of tens of millions of examples. 
+With the down-sampling typical in many data-sets, this is equivalent to a
+dataset with billions of raw training examples.
+
+The SGD system in Mahout is an online learning algorithm which means that
+you can learn models in an incremental fashion and that you can do
+performance testing as your system runs.  Often this means that you can
+stop training when a model reaches a target level of performance.  The SGD
+framework includes classes to do on-line evaluation using cross validation
+(the CrossFoldLearner) and an evolutionary system to do learning
+hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). 
+The AdaptiveLogisticRegression system makes heavy use of threads to
+increase machine utilization.  The way it works is that it runs 20
+CrossFoldLearners in separate threads, each with slightly different
+learning parameters.  As better settings are found, these new settings are
+propagating to the other learners.
+
+<a name="LogisticRegression-Designofpackages"></a>
+## Design of packages
+
+There are three packages that are used in Mahout's SGD system.	These
+include
+
+* The vector encoding package (found in
+org.apache.mahout.vectorizer.encoders)
+
+* The SGD learning package (found in org.apache.mahout.classifier.sgd)
+
+* The evolutionary optimization system (found in org.apache.mahout.ep)
+
+<a name="LogisticRegression-Featurevectorencoding"></a>
+### Feature vector encoding
+
+Because the SGD algorithms need to have fixed length feature vectors and
+because it is a pain to build a dictionary ahead of time, most SGD
+applications use the hashed feature vector encoding system that is rooted
+at FeatureVectorEncoder.
+
+The basic idea is that you create a vector, typically a
+RandomAccessSparseVector, and then you use various feature encoders to
+progressively add features to that vector.  The size of the vector should
+be large enough to avoid feature collisions as features are hashed.
+
+There are specialized encoders for a variety of data types.  You can
+normally encode either a string representation of the value you want to
+encode or you can encode a byte level representation to avoid string
+conversion.  In the case of ContinuousValueEncoder and
+ConstantValueEncoder, it is also possible to encode a null value and pass
+the real value in as a weight.	This avoids numerical parsing entirely in
+case you are getting your training data from a system like Avro.
+
+Here is a class diagram for the encoders package:
+
+![vector-class-hierarchy](attachments/75687/24346644.png)
+
+<a name="LogisticRegression-SGDLearning"></a>
+### SGD Learning
+
+For the simplest applications, you can construct an
+OnlineLogisticRegression and be off and running.  Typically, though, it is
+nice to have running estimates of performance on held out data.  To do
+that, you should use a CrossFoldLearner which keeps a stable of five (by
+default) OnlineLogisticRegression objects.  Each time you pass a training
+example to a CrossFoldLearner, it passes this example to all but one of its
+children as training and passes the example to the last child to evaluate
+current performance.  The children are used for evaluation in a round-robin
+fashion so, if you are using the default 5 way split, all of the children
+get 80% of the training data for training and get 20% of the data for
+evaluation.
+
+To avoid the pesky need to configure learning rates, regularization
+parameters and annealing schedules, you can use the
+AdaptiveLogisticRegression.  This class maintains a pool of
+CrossFoldLearners and adapts learning rates and regularization on the fly
+so that you don't have to.
+
+Here is a class diagram for the classifiers.sgd package.  As you can see,
+the number of twiddlable knobs is pretty large.  For some examples, see the
+TrainNewsGroups example code.
+
+![sgd-class-hierarchy](attachments/75687/24346645.png)
+

Added: mahout/site/trunk/content/machine-learning-resources.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/machine-learning-resources.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/machine-learning-resources.mdtext (added)
+++ mahout/site/trunk/content/machine-learning-resources.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,10 @@
+Title: Machine Learning Resources
+<a name="MachineLearningResources-MachineLearningingeneral"></a>
+## Machine Learning in general
+* [Machine Learning Videos](http://www.ml-class.org)
+ by Andrew Ng
+
+<a name="MachineLearningResources-AboutMahout"></a>
+## About Mahout
+* [Mahout in Action](http://www.manning.com/owen)
+ by Sean Owen, et. al.

Added: mahout/site/trunk/content/mahout-benchmarks.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahout-benchmarks.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahout-benchmarks.mdtext (added)
+++ mahout/site/trunk/content/mahout-benchmarks.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,152 @@
+Title: Mahout Benchmarks
+<a name="MahoutBenchmarks-Introduction"></a>
+# Introduction
+
+TODO:  YMMV
+
+<a name="MahoutBenchmarks-Recommenders"></a>
+# Recommenders
+
+<a name="MahoutBenchmarks-ARuleofThumb"></a>
+## A Rule of Thumb
+
+100M preferences are about the data set size where non-distributed
+recommenders will outgrow a normal-sized machine (32-bit, <= 4GB RAM). Your
+mileage will vary significantly with the nature of the data.
+
+<a name="MahoutBenchmarks-Distributedrecommendervs.Wikipedialinks(May272010)"></a>
+## Distributed recommender vs. Wikipedia links (May 27 2010)
+
+From the mailing list:
+
+I just finished running a set of recommendations based on the Wikipedia
+link graph, for book purposes (yeah, it's unconventional). I ran on my
+laptop, but it ought to be crudely representative of how it runs in a real
+cluster.
+
+The input is 1058MB as a text file, and contains, 130M article-article
+associations, from 5.7M articles to 3.8M distinct articles ("users" and
+"items", respectively). I estimate cost based on Amazon's North
+American small Linux-based instance pricing of $0.085/hour. I ran on a
+dual-core laptop with plenty of RAM, allowing 1GB per worker, so this is
+valid.
+
+In this run, I run recommendations for all 5.7M "users". You can certainly
+run for any subset of all users of course.
+
+Phase 1 (Item ID to item index mapping)
+29 minutes CPU time
+$0.05
+60MB output
+
+Phase 2 (Create user vectors)
+88 minutes CPU time
+$0.13
+Output: 1159MB
+
+Phase 3 (Count co-occurrence)
+77 hours CPU time
+$6.54
+Output: 23.6GB
+
+Phase 4 (Partial multiply prep)
+10.5 hours CPU time
+$0.90
+Output: 24.6GB
+
+Phase 5 (Aggregate and recommend)
+about 600 hours
+about $51.00
+about 10GB
+(I estimated these rather than let it run at home for days!)
+
+
+Note that phases 1 and 3 may be run less frequently, and need not be run
+every time. But the cost is dominated by the last step, which is most of
+the work. I've ignored storage costs.
+
+This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 user
+recommendations. That's not bad if, say, you want to update recs for you
+site's 100,000 daily active users for a dollar.
+
+There are several levers one could pull internally to sacrifice accuracy
+for speed, but it's currently set to pretty normal values. So this is just
+one possibility.
+
+Now that's not terrible, but it is about 8x more computing than would be
+needed by a non-distributed implementation *if* you could fit the whole
+data set into a very large instance's memory, which is still possible at
+this scale but needs a pretty big instance. That's a very apples-to-oranges
+comparison of course; different algorithms, entirely different
+environments. This is about the amount of overhead I'd expect from
+distributing -- interesting to note how non-trivial it is.
+
+<a name="MahoutBenchmarks-Non-distributedrecommendervs.KDDCupdataset(March2011)"></a>
+## Non-distributed recommender vs. KDD Cup data set (March 2011)
+
+(From the user@mahout.apache.org mailing list)
+
+I've been test-driving a simple application of Mahout recommenders (the
+non-distributed kind) on Amazon EC2 on the new Yahoo KDD Cup data set
+(kddcup.yahoo.com).
+
+In the spirit of open-source, like I mentioned, I'm committing the extra
+code to mahout-examples that can be used to run a Recommender on the input
+and output the right format. And, I'd like to publish the rough timings
+too. Find all the source in org.apache.mahout.cf.taste.example.kddcup
+
+<a name="MahoutBenchmarks-Track1"></a>
+### Track 1
+
+* m2.2xlarge instance, 34.2GB RAM / 4 cores
+* Steady state memory consumption: ~19GB
+* Computation time: 30 hours (wall clock-time)
+* CPU time per user: ~0.43 sec
+* Cost on EC2: $34.20 (!)
+
+(Helpful hint on cost I realized after the fact: you can almost surely get
+spot instances for cheaper. The maximum price this sort of instance has
+gone for as a spot instance is about $0.60/hour, vs "retail price" of
+$1.14/hour.)
+
+Resulted in an RMSE of 29.5618 (the rating scale is 0-100), which is only
+good enough for 29th place at the moment. Not terrible for "out of the box"
+performance -- it's just using an item-based recommender with uncentered
+cosine similarity. But not really good in absolute terms. A winning
+solution is going to try to factor in time, and apply more sophisticated
+techniques. The best RMSE so far is about 23.
+
+<a name="MahoutBenchmarks-Track2"></a>
+### Track 2
+
+* c1.xlarge instance: 7GB RAM / 8 cores
+* Steady state memory consumption: ~3.8GB
+* Computation time: 4.1 hours (wall clock-time)
+* CPU time per user: ~1.1 sec
+* Cost on EC2: $3.20
+
+For this I bothered to write a simplistic item-item similarity metric to
+take into account the additional info that is available: track, artist,
+album, genre. The result was comparatively better: 17.92% error rate, good
+enough for 4th place at the moment.
+
+Of course, the next task is to put this through the actual distributed
+processing -- that's really the appropriate solution.
+
+This shows you can still tackle fairly impressive scale with a
+non-distributed solution. These results suggest that the largest instances
+available from EC2 would accomodate almost 1 billion ratings in memory.
+However at that scale running a user's full recommendations would easily be
+measured in seconds, not milliseconds.
+
+<a name="MahoutBenchmarks-Clustering"></a>
+# Clustering
+
+See [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
+
+<a name="MahoutBenchmarks-Classification"></a>
+# Classification
+
+<a name="MahoutBenchmarks-FrequentPatternsetMining"></a>
+# Frequent Patternset Mining
+

Added: mahout/site/trunk/content/mahout-collections.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahout-collections.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahout-collections.mdtext (added)
+++ mahout/site/trunk/content/mahout-collections.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,52 @@
+Title: mahout-collections
+<a name="mahout-collections-Introduction"></a>
+# Introduction
+
+The Mahout Collections library is a set of container classes that address
+some limitations of the standard collections in Java. [This presentation](http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/oopsla08%20memory-efficient%20java%20slides.pdf)
+ describes a number of performance problems with the standard collections. 
+
+Mahout collections addresses two of the more glaring: the lack of support
+for primitive types and the lack of open hashing.
+
+<a name="mahout-collections-PrimitiveTypes"></a>
+# Primitive Types
+
+The most visible feature of Mahout Collections is the large collection of
+primitive type collections. Given Java's asymmetrical support for the
+primitive types, the only efficient way to handle them is with many
+classes. So, there are ArrayList-like containers for all of the primitive
+types, and hash maps for all the useful combinations of primitive type and
+object keys and values.
+
+These classes do not, in general, implement interfaces from *java.util*.
+Even when the *java.util* interfaces could be type-compatible, they tend
+to include requirements that are not consistent with efficient use of
+primitive types.
+
+<a name="mahout-collections-OpenAddressing"></a>
+# Open Addressing
+
+All of the sets and maps in Mahout Collections are open-addressed hash
+tables. Open addressing has a much smaller memory footprint than chaining.
+Since the purpose of these collections is to avoid the memory cost of
+autoboxing, open addressing is a consistent design choice.
+
+<a name="mahout-collections-Sets"></a>
+# Sets
+
+Mahout Collections includes open hash sets. Unlike *java.util*, a set is
+not a recycled hash table; the sets are separately implemented and do not
+have any additional storage usage for unused keys.
+
+<a name="mahout-collections-CreditwhereCreditisdue"></a>
+# Credit where Credit is due
+
+The implementation of Mahout Collections is derived from [Cern Colt](http://acs.lbl.gov/~hoschek/colt/)
+.
+
+
+
+
+
+

Added: mahout/site/trunk/content/mahout-on-amazon-ec2.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahout-on-amazon-ec2.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahout-on-amazon-ec2.mdtext (added)
+++ mahout/site/trunk/content/mahout-on-amazon-ec2.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,226 @@
+Title: Mahout on Amazon EC2
+Amazon EC2 is a compute-on-demand platform sold by Amazon.com that allows
+users to purchase one or more host machines on an hourly basis and execute
+applications.  Since Hadoop can run on EC2, it is also possible to run
+Mahout on EC2.	The following sections will detail how to create a Hadoop
+cluster from the ground up. Alternatively, you can use an existing Hadoop
+AMI, in which case, please see [Use an Existing Hadoop AMI](use-an-existing-hadoop-ami.html)
+.
+
+  
+<a name="MahoutonAmazonEC2-Prerequisites"></a>
+# Prerequisites
+
+To run Mahout on EC2 you need to start up a Hadoop cluster on one or more
+instances of a Hadoop-0.20.2 compatible Amazon Machine Instance (AMI).
+Unfortunately, there do not currently exist any public AMIs that support
+Hadoop-0.20.2; you will have to create one. The following steps begin with
+a public Cloudera Ubuntu AMI that comes with Java installed on it. You
+could use any other AMI with Java installed or you could use a clean AMI
+and install Java yourself. These instructions assume some familiarity with
+Amazon EC2 concepts and terminology. See the Amazon EC2 User Guide, in
+References below.
+
+1. From the [AWS Management Console](https://console.aws.amazon.com/ec2/home#c=EC2&s=Home)
+/AMIs, start the following AMI (_ami-8759bfee_)
+
+    cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-x86_64.manifest.xml 
+
+1. From the AWS Console/Instances, select the instance and right-click
+'Connect" to get the connect string which contains your <instance public
+DNS name>
+
+    > ssh -i <gsg-keypair.pem> root@<instance public DNS name>
+
+1. In the root home directory evaluate:
+
+    # apt-get update
+    # apt-get upgrade  // This is optional, but probably advisable since the
+AMI is over a year old.
+    # apt-get install python-setuptools
+    # easy_install "simplejson==2.0.9"
+    # easy_install "boto==1.8d"
+    # apt-get install ant
+    # apt-get install subversion
+    # apt-get install maven2
+
+1. Add the following to your .profile
+
+    export JAVA_HOME=/usr/lib/jvm/java-6-sun
+    export HADOOP_HOME=/usr/local/hadoop-0.20.2
+    export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
+    export MAHOUT_HOME=/usr/local/mahout-0.4
+    export MAHOUT_VERSION=0.4-SNAPSHOT
+    export MAVEN_OPTS=-Xmx1024m
+
+1. Upload the Hadoop distribution and configure it. This distribution is not
+available on the Hadoop site. You can download a beta version from [Cloudera's CH3 distribution](http://archive.cloudera.com/cdh/3/)
+
+    > scp -i <gsg-keypair.pem>  <where>/hadoop-0.20.2.tar.gz root@<instance
+public DNS name>:.
+    
+    # tar -xzf hadoop-0.20.2.tar.gz
+    # mv hadoop-0.20.2 /usr/local/.
+
+1. Configure Hadoop for temporary single node operation
+1. # add the following to $HADOOP_HOME/conf/hadoop-env.sh
+
+    # The java implementation to use.  Required.
+    export JAVA_HOME=/usr/lib/jvm/java-6-sun
+    
+    # The maximum amount of heap to use, in MB. Default is 1000.
+    export HADOOP_HEAPSIZE=2000
+
+1. # add the following to $HADOOP_HOME/conf/core-site.xml and also
+$HADOOP_HOME/conf/mapred-site.xml
+
+    <configuration>
+      <property>
+        <name>fs.default.name</name>
+        <value>hdfs://localhost:9000</value>
+      </property>
+    
+      <property>
+        <name>mapred.job.tracker</name>
+        <value>localhost:9001</value>
+      </property>
+    
+      <property>
+        <name>dfs.replication</name>
+        <value>1</value>
+    	<!-- set to 1 to reduce warnings when 
+    	running on a single node -->
+      </property>
+    </configuration>
+
+1. # set up authorized keys for localhost login w/o passwords and format your
+name node
+
+    # ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
+    # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
+    # $HADOOP_HOME/bin/hadoop namenode -format
+
+1. Checkout and build Mahout from trunk. Alternatively, you can upload a
+Mahout release tarball and install it as we did with the Hadoop tarball
+(Don't forget to update your .profile accordingly).
+
+    # svn co http://svn.apache.org/repos/asf/mahout/trunk mahout 
+    # cd mahout
+    # mvn clean install
+    # cd ..
+    # mv mahout /usr/local/mahout-0.4
+
+1. Run Hadoop, just to prove you can, and test Mahout by building the
+Reuters dataset on it. Finally, delete the files and shut it down.
+
+    # $HADOOP_HOME/bin/hadoop namenode -format
+    # $HADOOP_HOME/bin/start-all.sh
+    # jps	  // you should see all 5 Hadoop processes (NameNode,
+SecondaryNameNode, DataNode, JobTracker, TaskTracker)
+    # cd $MAHOUT_HOME
+    # ./examples/bin/build-reuters.sh
+    
+    # $HADOOP_HOME/bin/stop-all.sh
+    # rm -rf /tmp/* 		  // delete the Hadoop files
+
+1. Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml
+and $HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy
+with the other conf file settings. The Hadoop startup scripts will not make
+any changes to them. In particular, upping the Java heap size is required
+for many of the Mahout jobs.
+
+       // $HADOOP_HOME/conf/mapred-site.xml
+       <property>
+         <name>mapred.child.java.opts</name>
+         <value>-Xmx2000m</value>
+       </property>
+
+1. Bundle your image into a new AMI, upload it to S3 and register it so it
+can be launched multiple times to construct a Mahout-ready Hadoop cluster.
+(See Amazon's [Preparing And Creating AMIs](http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?PreparingAndCreatingAMIs.html)
+ for details). 
+
+    // copy your AWS private key file and certificate file to /mnt on your
+instance (you don't want to leave these around in the AMI).
+    > scp -i <gsg-keypair.pem> <your AWS cert directory>/*.pem root@<instance
+public DNS name>:/mnt/.
+    
+    # Note that ec2-bundle-vol may fail if EC2_HOME is set.  So you may want to
+temporarily unset EC2_HOME before running the bundle command.  However the
+shell will need to have the correct value of EC2_HOME set before running
+the ec2-register step.
+    
+    # ec2-bundle-vol -k /mnt/pk*.pem -c /mnt/cert*.pem -u <your-AWS-user_id> -d
+/mnt -p mahout
+    # ec2-upload-bundle -b <your-s3-bucket> -m /mnt/mahout.manifest.xml -a
+<your-AWS-access_key> -s <your-AWS-secret_key> 
+    # ec2-register -K /mnt/pk-*.pem -C /mnt/cert-*.pem
+<your-s3-bucket>/mahout.manifest.xml
+
+<a name="MahoutonAmazonEC2-GettingStarted"></a>
+# Getting Started
+
+1. Now you can go back to your AWS Management Console and try launching a
+single instance of your image. Once this launches, make sure you can
+connect to it and test it by re-running the test code.	If you removed the
+single host configuration added in step 6(b) above, you will need to re-add
+it before you can run this test.  To test run (again):
+
+    # $HADOOP_HOME/bin/hadoop namenode -format
+    # $HADOOP_HOME/bin/start-all.sh
+    # jps	  // you should see all 5 Hadoop processes (NameNode,
+SecondaryNameNode, DataNode, JobTracker, TaskTracker)
+    # cd $MAHOUT_HOME
+    # ./examples/bin/build-reuters.sh
+    
+    # $HADOOP_HOME/bin/stop-all.sh
+    # rm -rf /tmp/* 		  // delete the Hadoop files
+
+
+1. Now that you have a working Mahout-ready AMI, follow [Hadoop's instructions](http://wiki.apache.org/hadoop/AmazonEC2)
+ to configure their scripts for your environment.
+1. # edit bin/hadoop-ec2-env.sh, setting the following environment variables:
+
+    AWS_ACCOUNT_ID
+    AWS_ACCESS_KEY_ID
+    AWS_SECRET_ACCESS_KEY
+    S3_BUCKET
+    (and perhaps others depending upon your environment)
+
+1. # edit bin/launch-hadoop-master and bin/launch-hadoop-slaves, setting:
+
+    AMI_IMAGE
+
+1. # finally, launch your cluster and log in
+
+    > bin/hadoop-ec2 launch-cluster test-cluster 2
+    > bin/hadoop-ec2 login test-cluster
+    # ...  
+    # exit
+    > bin/hadoop-ec2 terminate-cluster test-cluster     // when you are done
+with it
+
+
+<a name="MahoutonAmazonEC2-RunningtheExamples"></a>
+# Running the Examples
+1. Submit the Reuters test job
+
+    # cd $MAHOUT_HOME
+    # ./examples/bin/build-reuters.sh
+    // the warnings about configuration files do not seem to matter
+
+1. See the Mahout [Quickstart](quickstart.html)
+ page for more examples
+<a name="MahoutonAmazonEC2-References"></a>
+# References
+
+[Amazon EC2 User Guide](http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html)
+[Hadoop's instructions](http://wiki.apache.org/hadoop/AmazonEC2)
+
+
+
+<a name="MahoutonAmazonEC2-Recognition"></a>
+# Recognition
+
+Some of the information available here was possible through the "Amazon Web
+Services Apache Projects Testing Program".

Added: mahout/site/trunk/content/mahout-on-elastic-mapreduce.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahout-on-elastic-mapreduce.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahout-on-elastic-mapreduce.mdtext (added)
+++ mahout/site/trunk/content/mahout-on-elastic-mapreduce.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,522 @@
+Title: Mahout on Elastic MapReduce
+<a name="MahoutonElasticMapReduce-Introduction"></a>
+# Introduction
+
+This page details the set of steps that was necessary to get an example of
+k-Means clustering running on Amazon's [Elastic MapReduce](http://aws.amazon.com/elasticmapreduce/)
+ (EMR). 
+
+Note: Some of this work is due in part to credits donated by Amazon Web
+Services Apache Projects Testing Program.
+
+<a name="MahoutonElasticMapReduce-GettingStarted"></a>
+# Getting Started
+
+   * Get yourself an EMR account.  If you're already using EC2, then you
+can do this from [Amazon's AWS Managment Console](https://console.aws.amazon.com/)
+, which has a tab for running EMR.
+   * Get the [ElasticFox](https://addons.mozilla.org/en-US/firefox/addon/11626)
+ and [S3Fox|https://addons.mozilla.org/en-US/firefox/search?q=s3fox&cat=all]
+ Firefox extensions.  These will make it easy to monitor running EMR
+instances, upload code and data, and download results.
+   * Download the [Ruby command line client for EMR](http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264&categoryID=262)
+.  You can do things from the GUI, but when you're in the midst of trying
+to get something running, the CLI client will make life a lot easier.
+   * Have a look at [Common Problems Running Job Flows](http://developer.amazonwebservices.com/connect/thread.jspa?messageID=124694&#124694)
+ and [Developing and Debugging Job Flows|http://developer.amazonwebservices.com/connect/message.jspa?messageID=124695#124695]
+ in the EMR forum at Amazon.  They were tremendously useful.
+   * Make sure that you're up to date with the Mahout source.  The fix for [Issue 118](http://issues.apache.org/jira/browse/MAHOUT-118)
+ is required to get things running when you're sending output to an S3
+bucket.
+   * Build the Mahout core and examples.
+
+Note that the Hadoop that's running on EMR is version of Hadoop 0.20.0. 
+The EMR GUI in the AWS Management Console provides a number of examples of
+using EMR, and you might want to try running one of these to get started.
+
+One big gotcha that I discovered is that the S3N file system for Hadoop has
+a couple of weird cases that boil down to the following advice:  if you're
+naming a directory in an s3n URI, make sure that it ends in a slash and you
+should not try to use a top-level S3 bucket name as the place where your
+Mahout output will be going, you should always include a subdirectory.
+
+<a name="MahoutonElasticMapReduce-UploadingCodeandData"></a>
+# Uploading Code and Data
+
+I decided that I would use separate S3 buckets for the Mahout code, the
+input for the clustering (I used the synthetic control data, you can find
+it easily from the [Quickstart](quickstart.html)
+ page), and the output of the clustering.  
+
+You will need to upload:
+1. The Mahout Job jar.  For the example here, we are using
+*mahout-core-0.4-SNAPSHOT.job*
+1. The data.  In this example, we uploaded two files: dictionary.txt and
+part-out.vec.  The latter is the main vector file and the former is the
+dictionary that maps words to columns.	It was created by converting a
+Lucene index to Mahout vectors.
+
+
+<a name="MahoutonElasticMapReduce-Runningk-meansClustering"></a>
+# Running k-means Clustering
+
+EMR offers two modes for running MapReduce jobs.  The first is a
+"streaming" mode where you provide the source for single-step mapper and
+reducer functions (you can use languages other than Java for this).  The
+second mode is called "Custom Jar" and it gives you full control over the
+job steps that will run.  This is the mode that we need to use to run
+Mahout.  
+
+In order to run in Custom Jar mode, you need to look at the example that
+you want to run and figure out the arguments that you need to provide to
+the job.  Essentially, you need to know the command line that you would
+give to bin/hadoop in order to run the job, including whatever parameters
+the job needs to run.  
+
+<a name="MahoutonElasticMapReduce-UsingtheGUI"></a>
+## Using the GUI
+
+The EMR GUI is an easy way to start up a Custom Jar run, but it doesn't
+have the full functionality of the CLI.  Basically, you tell the GUI where
+in S3 the jar file is using a Hadoop s3n URI like
+*s3n://PATH/mahout-core-0.4-SNAPSHOT.job*.  The GUI will check and make
+sure that the given file exists, which is a nice sanity check.	You can
+then provide the arguments for the job just as you would on the command
+line.  The arguments for the k-means job that were as follows:
+
+
+    org.apache.mahout.clustering.kmeans.KMeansDriver --input
+s3n://news-vecs/part-out.vec --clusters
+s3n://news-vecs/kmeans/clusters-9-11/ -k 10 --output
+s3n://news-vecs/out-9-11/ --distanceMeasure
+org.apache.mahout.common.distance.CosineDistanceMeasure --convergenceDelta
+0.001 --overwrite --maxIter 50 --clustering
+
+
+TODO: Screenshot
+
+The main failing with the GUI mode is that you can only specify a single
+job to run, and you can't run another job in the same set of instances. 
+Recall that on AWS you pay for partial hours at the hourly rate, so if your
+job fails in the first 10 seconds, you pay for the full hour and if you try
+again, you're going to paying for another hour.
+
+Because of this, using a command line interface (CLI) is strongly
+recommended.
+
+<a name="MahoutonElasticMapReduce-UsingtheCLI"></a>
+## Using the CLI
+
+If you're in development mode, and trying things out, EMR allows you to set
+up a set of instances and leave them running.  Once you've done this, you
+can add job steps to the set of instances as you like.	This solves the "10
+second failure" problem that I described above and lets you get full value
+for your EMR dollar.  Amazon has pretty good [documentation for the CLI](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?CHAP_UsingEMR.html)
+, which you'll need to read to figure out how to do things like set up your
+AWS credentials for the EMR CLI.
+
+You can start up a job flow that will keep running using an invocation like
+the following:
+
+
+    ./elastic-mapreduce --create --alive \
+       --log-uri s3n://PATH_FOR_LOGS/ --key_pair YOUR_KEY \
+       --num-instances 2 --name NAME_HERE
+
+
+Fill in the name, key pair and path for logs as appropriate. This call
+returns the name of the job flow, and you'll need that for subsequent calls
+to add steps to the job flow. You can, however, retrieve it at any time by
+calling:
+
+    ./elastic-mapreduce --list
+
+
+Let's list our job flows:
+
+
+    [stgreen@dhcp-ubur02-74-153 14:16:15 emr]
+$ ./elastic-mapreduce --list
+    j-3JB4UF7CQQ025     WAITING	  
+ec2-174-129-90-97.compute-1.amazonaws.com    kmeans
+
+
+At this point, everything's started up, and it's waiting for us to add a
+step to the job.  When we started the job flow, we specified a key pair
+that we created earlier so that we can log into the master while the job
+flow is running:
+
+
+     elastic-mapreduce --ssh -j j-3JB4UF7CQQ025
+
+
+Let's add a step to run a job:
+
+
+     elastic-mapreduce -j j-3JB4UF7CQQ025  --jar
+s3n://PATH/mahout-core-0.4-SNAPSHOT.job  --main-class
+org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg
+s3n://PATH/part-out.vec --arg --clusters --arg s3n://PATH/kmeans/clusters/
+--arg -k --arg 10 --arg --output --arg s3n://PATH/out-9-11/ --arg
+--distanceMeasure --arg 
+org.apache.mahout.common.distance.CosineDistanceMeasure --arg
+--convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50
+--arg --clustering
+
+
+When you do this, the job flow goes into the *RUNNING* state for a while
+and then returns to *WAITING* once the step has finished.  You can use
+the CLI or the GUI to monitor the step while it runs.  Once you've finished
+with your job flow, you can shut it down the following way:
+
+
+    ./elastic-mapreduce -j j-3JB4UF7CQQ025 --terminate
+
+
+and go look in your S3 buckets to find your output and logs.
+
+
+<a name="MahoutonElasticMapReduce-Troubleshooting"></a>
+# Troubleshooting
+
+The primary means for understanding what went wrong is via the logs and
+stderr/stdout.	When running on EMR, stderr and stdout are captured to
+files in your log directories.	Additionally, logging is setup to write out
+to a file called syslog.  To view these in the AWS Console, go to your logs
+directory, then the folder with the same JobFlow id as above
+(j-3JB4UF7CQQ025), then the steps folder and then the appropriate step
+number (usually 1 for this case).
+
+That is, go to the folder s3n://PATH_TO_LOGS/j-3JB4UF7CQQ025/steps/1.  In
+this directory, you will find stdout, stderr, syslog and potentially a few
+other logs. 
+
+
+See [resulting thread](http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945&tstart=15)
+ for some early user experience with Mahout on EMR
+
+<a name="MahoutonElasticMapReduce-BuildingVectorsforLargeDocumentSets"></a>
+## Building Vectors for Large Document Sets
+
+Use the following steps as a guide to using Elastic MapReduce (EMR) to
+create sparse vectors needed for running Mahout clustering algorithms on
+large document sets. This section evolved from benchmarking Mahout's
+clustering algorithms using a large document set. Specifically, we used the
+ASF mail archives that have been parsed and converted to the Hadoop
+SequenceFile format (block-compressed) and saved to a public S3 folder:
+*s3://asf-mail-archives/mahout-0.4/sequence-files*. Overall, there are
+6,094,444 key-value pairs in 283 files taking around 5.7GB of disk.
+
+<a name="MahoutonElasticMapReduce-1.Setupelastic-mapreduce-ruby"></a>
+#### 1. Setup elastic-mapreduce-ruby
+
+As discussed previously, make sure you install the *elastic-mapreduce-ruby*
+tool. On Debian-based Linux like Ubuntu, use the following commands to
+install elastic-mapreduce-ruby's dependencies:
+
+
+    apt-get install ruby1.8
+    apt-get install libopenssl-ruby1.8
+    apt-get install libruby1.8-extras
+
+
+Once these dependencies are installed, download and extract the
+elastic-mapreduce-ruby application. We use */mnt/dev* as the base working
+directory because this process was originally conducted on an EC2 instance;
+be sure to replace this path with the correct path for your environment as
+you work through these steps.
+
+
+    mkdir -p /mnt/dev/elastic-mapreduce /mnt/dev/downloads
+    cd /mnt/dev/downloads
+    wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
+    cd /mnt/dev/elastic-mapreduce
+    unzip /mnt/dev/downloads/elastic-mapreduce-ruby.zip
+
+
+Please refer to [Amazon Elastic MapReduce Ruby Client](http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1)
+ for a detailed explanation, but to get running quickly, all you need to do
+is create a file named *credentials.json* in the elastic-mapreduce
+directory, such as */mnt/dev/elastic-mapreduce/credentials.json*. The
+credentials.json should contain the following information (change to match
+your environment):
+
+
+    { 
+      "access-id": "YOUR_ACCESS_KEY",
+      "private-key": "YOUR_SECRET_KEY", 
+      "key-pair": "gsg-keypair", 
+      "key-pair-file": "/mnt/dev/aws/gsg-keypair.pem", 
+      "region": "us-east-1", 
+      "log-uri": "s3n://BUCKET/asf-mail-archives/logs/"
+    }
+
+  
+If you are confused about any of these parameters, please read: [Understanding Access Credentials for AWS/EC2](http://alestic.com/2009/11/ec2-credentials)
+. Also, it's a good idea to add the elastic-mapreduce directory to your
+PATH. To verify it is working correctly, simply do:
+
+
+    elastic-mapreduce --list
+
+
+<a name="MahoutonElasticMapReduce-2.Setups3cmdandCreateaBucket"></a>
+#### 2. Setup s3cmd and Create a Bucket
+
+It's also beneficial when working with EMR and S3 to install [s3cmd](http://s3tools.org/s3cmd)
+, which helps you interact with S3 using easy to understand command-line
+options. To install on Ubuntu, simply do:
+
+
+    sudo apt-get install s3cmd
+
+
+Once installed, configure s3cmd by doing:
+
+
+    s3cmd --configure
+
+
+If you don't have an S3 bucket to work with, then please create one using:
+
+
+    s3cmd mb s3://BUCKET
+
+
+Replace this bucket name in the remaining steps whenever you see
+*s3://BUCKET* in the steps below.
+
+<a name="MahoutonElasticMapReduce-3.LaunchEMRCluster"></a>
+#### 3. Launch EMR Cluster
+
+Once elastic-mapreduce is installed, start a cluster with no jobflow steps:
+
+
+    elastic-mapreduce --create --alive \
+      --log-uri s3n://BUCKET/emr/logs/ \
+      --key-pair gsg-keypair \
+      --slave-instance-type m1.xlarge \
+      --master-instance-type m1.xlarge \
+      --num-instances # \
+      --name mahout-0.4-vectorize
+
+
+This will create an EMR Job Flow named "mahout-0.4-vectorize" in the
+US-East region using EC2 xlarge instances. Take note of the Job ID returned
+as you will need it to add the "seq2sparse" step to the Job Flow. It can
+take a few minutes for the cluster to start; the job flow enters a
+"waiting" status when it is ready. We launch the EMR instances in the
+*us-east-1* region so that we don't incur data transfer charges to/from
+US-Standard S3 buckets (credentials.json => "region":"us-east-1").
+
+When vectorizing large document sets, you need to distribute processing
+across as many reducers as possible. This also helps keep the size of the
+vector files more manageable. I'll leave it to you to decide how many
+instances to allocate, but keep in mind that one will be dedicated as the
+master (Hadoop NameNode). Also, it took about 75 minutes to run the
+seq2sparse job on 19 xlarge instances when using *maxNGramSize=2* (~190
+normalized instance hours â not cheap). I think you'll be safe to use
+about 10-13 instances and still finish in under 2 hours. Also, if you are
+not creating bi-grams, then you won't need as much horse-power; a four node
+cluster with 3 reducers per node is sufficient for generating vectors with
+*maxNGramSize = 1* in less than 30 minutes.
+
+_Tip: Amazon provides a bootstrap action to configure the cluster for
+running memory intensive jobs. For more information about this, see: [http://buyitnw.appspot.com/forums.aws.amazon.com/ann.jspa?annID=834](http://buyitnw.appspot.com/forums.aws.amazon.com/ann.jspa?annID=834)
+_
+
+<a name="MahoutonElasticMapReduce-4.CopyMahoutJARtoS3"></a>
+#### 4. Copy Mahout JAR to S3
+
+The Mahout 0.4 JAR containing a custom Lucene Analyzer
+(*org.apache.mahout.text.MailArchivesClusteringAnalyzer*) is available
+at:
+
+
+    s3://asf-mail-archives/mahout-0.4/mahout-examples-0.4-job-ext.jar 
+
+
+The source code is available at [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
+.
+
+If you need to use your own Mahout JAR, use s3cmd to copy it to your S3
+bucket:
+
+
+    s3cmd put JAR_FILE s3://BUCKET/
+
+
+<a name="MahoutonElasticMapReduce-5.Vectorize"></a>
+#### 5. Vectorize
+
+Schedule a jobflow step to vectorize (1-grams only) using Mahout's
+seq2sparse job:
+
+
+    elastic-mapreduce --jar
+s3://asf-mail-archives/mahout-0.4/mahout-examples-0.4-job-ext.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg seq2sparse \
+      --arg -i --arg s3n://asf-mail-archives/mahout-0.4/sequence-files/ \
+      --arg -o --arg /asf-mail-archives/mahout-0.4/vectors/ \
+      --arg --weight --arg tfidf \
+      --arg --minSupport --arg 500 \
+      --arg --maxDFPercent --arg 70 \
+      --arg --norm --arg 2 \
+      --arg --numReducers --arg # \
+      --arg --analyzerName --arg
+org.apache.mahout.text.MailArchivesClusteringAnalyzer \
+      --arg --maxNGramSize --arg 1 \
+      -j JOB_ID
+
+
+You need to determine the correct number of reducers based on the EC2
+instance type and size of your cluster. For xlarge nodes, set the number of
+reducers to 3 x N (where N is the size of your EMR cluster not counting the
+master node). For large instances, 2 reducers per node is probably safe
+unless your job is extremely CPU intensive, in which case use only 1
+reducer per node.
+
+Be sure to use Hadoop's *s3n* protocol for the input parameter ({{-i
+s3n://asf-mail-archives/mahout-0.4/sequence-files/}}) so that Mahout/Hadoop
+can find the SequenceFiles in S3. Also, notice that we've configured the
+job to send output to HDFS instead of S3. This is needed to work-around an
+issue with multi-step jobs and EMR (see [MAHOUT-598](https://issues.apache.org/jira/browse/MAHOUT-598)
+). Once the job completes, you can copy the results to S3 from the EMR
+cluster's HDFS using distcp.
+
+The job shown above created 6,076,937 vectors with 20,444 dimensions in
+around 28 minutes on a 4+1 node cluster of EC2 xlarge instances. Depending
+on the number of unique terms, setting maxNGramSize greater than 1 has a
+major impact on the execution time of the seq2sparse job. For example, the
+same job with maxNGramSize=2 can take up to 2 hours with the bulk of the
+time spent creating collocations, see [Collocations](collocations.html)
+.
+
+To monitor the status of the job, use:
+
+
+    elastic-mapreduce --logs -j JOB_ID
+
+
+<a name="MahoutonElasticMapReduce-6.CopyoutputfromHDFStoS3(optional)"></a>
+#### 6. Copy output from HDFS to S3 (optional)
+
+It's a good idea to save the vectors for running future jobs. Of course, if
+you don't save the vectors to S3, then they will be lost when you terminate
+the EMR cluster. There are two approaches to moving data out of HDFS to S3:
+
+1. SSH into the master node to run distcp, or
+1. Add a jobflow step to run distcp
+
+To login to the master node, use:
+
+
+    elastic-mapreduce --ssh -j JOB_ID
+
+
+Once logged in, do:
+
+
+    hadoop distcp /asf-mail-archives/mahout-0.4/vectors/
+s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/vectors/ &
+
+
+Or, you can just add another job flow step to do it:
+
+
+    elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
+      --arg hdfs:///asf-mail-archives/mahout-0.4/vectors/ \
+      --arg
+s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/vectors/ \
+      -j JOB_ID
+
+
+_Note: You will need all the output from the vectorize step in order to run
+Mahout's clusterdump._
+
+Once copied, if you would like to share your results with the Mahout
+community, make the vectors public in S3 using the Amazon console or s3cmd:
+
+
+    s3cmd setacl --acl-public --recursive
+s3://BUCKET/asf-mail-archives/mahout-0.4/vectors/
+
+
+Dump out the size of the vectors:
+
+
+    bin/mahout vectordump --seqFile
+s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/vectors/tfidf-vectors/part-r-00000
+--sizeOnly | more
+
+
+<a name="MahoutonElasticMapReduce-7.k-MeansClustering"></a>
+#### 7. k-Means Clustering
+
+Now that you have vectors, you can do some clustering! The following
+command will create a new jobflow step to run the k-Means job using the
+TFIDF vectors produced by seq2sparse:
+
+
+    elastic-mapreduce --jar
+s3://asf-mail-archives/mahout-0.4/mahout-examples-0.4-job-ext.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg kmeans \
+      --arg -i --arg /asf-mail-archives/mahout-0.4/vectors/tfidf-vectors/ \
+      --arg -c --arg /asf-mail-archives/mahout-0.4/initial-clusters/ \
+      --arg -o --arg /asf-mail-archives/mahout-0.4/kmeans-clusters \
+      --arg -x --arg 10 \
+      --arg -cd --arg 0.01 \
+      --arg -k --arg 60 \
+      --arg --distanceMeasure --arg
+org.apache.mahout.common.distance.CosineDistanceMeasure \
+      -j JOB_ID
+
+
+Depending on the EC2 instance type and size of your cluster, the k-Means
+job can take a couple of hours to complete. The input is the HDFS location
+of the vectors created by the seq2sparse job. If you copied the vectors to
+S3, then you could also use the s3n protocol. However, since I'm using the
+same EMR job flow, the vectors are already in HDFS, so there is no need to
+pull them from S3.
+
+_Tip: use a convergenceDelta of 0.01 to ensure the clustering job performs
+more than one iteration._
+
+<a name="MahoutonElasticMapReduce-UselynxtoViewtheJobTrackerWebUI"></a>
+##### Use lynx to View the JobTracker Web UI
+
+A somewhat subtle feature of EMR is that you can use lynx to access the
+JobTracker UI from the master node. Login to the master node using:
+
+
+    elastic-mapreduce --ssh -j JOB_ID
+
+
+Once logged in, launch the JobTracker using:
+
+
+    lynx http://localhost:9100/
+
+
+Now you can easily monitor the state of running jobs. Or, better yet, you
+can setup an SSH tunnel to port 9100 on the master server using:
+
+
+    ssh -i PATH_TO_KEYPAIR/gsg-keypair.pem \
+      -L 9100:ec2-???-???-???-???.compute-1.amazonaws.com:9100 \
+      hadoop@ec2-???-???-???-???.compute-1.amazonaws.com
+
+
+With this command, you can point your browser to http://localhost:9100 to
+access the JobTracker UI.
+
+<a name="MahoutonElasticMapReduce-8.Shutdownyourcluster"></a>
+#### 8. Shut down your cluster
+
+
+    elastic-mapreduce --terminate -j JOB_ID
+
+
+Verify the cluster is terminated in your Amazon console.

Added: mahout/site/trunk/content/mahout-project.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahout-project.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahout-project.mdtext (added)
+++ mahout/site/trunk/content/mahout-project.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,2 @@
+Title: Mahout Project
+{bookmark:url=https://Apache.org/Apache Mahout}Details later{bookmark}

Added: mahout/site/trunk/content/mahout-wiki.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahout-wiki.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahout-wiki.mdtext (added)
+++ mahout/site/trunk/content/mahout-wiki.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,184 @@
+Title: Mahout Wiki
+Apache Mahout is a new Apache TLP project to create scalable, machine
+learning algorithms under the Apache license. 
+
+[TOC]
+
+## General
+[Overview](overview.html)
+ -- Mahout? What's that supposed to be?
+
+[Quickstart](quickstart.html)
+ -- learn how to quickly setup Apache Mahout for your project.
+
+[FAQ](faq.html)
+ -- Frequent questions encountered on the mailing lists.
+
+[Developer Resources](developer-resources.html)
+ -- overview of the Mahout development infrastructure.
+
+[How To Contribute](how-to-contribute.html)
+ -- get involved with the Mahout community.
+
+[How To Become A Committer](how-to-become-a-committer.html)
+ -- become a member of the Mahout development community.
+
+[Hadoop](http://hadoop.apache.org)
+ -- several of our implementations depend on Hadoop.
+
+[Machine Learning Open Source Software](http://mloss.org/software/)
+ -- other projects implementing Open Source Machine Learning libraries.
+
+[Mahout -- The name, history and its pronunciation](mahoutname.html)
+
+## Community
+
+[Who we are](who-we-are.html)
+ -- who are the developers behind Apache Mahout?
+
+[Books, Tutorials, Talks, Articles, News, Background Reading, etc. on Mahout](books-tutorials-and-talks.html)
+
+[Issue Tracker](issue-tracker.html)
+ -- see what features people are working on, submit patches and file bugs.
+
+[Source Code (SVN)](https://svn.apache.org/repos/asf/mahout/)
+ -- [Fisheye|http://fisheye6.atlassian.com/browse/mahout]
+ -- download the Mahout source code from svn.
+
+[Mailing lists and IRC](mailing-lists,-irc-and-archives.html)
+ -- links to our mailing lists, IRC channel and archived design and
+algorithm discussions, maybe your questions was answered there already?
+
+[Version Control](version-control.html)
+ -- where we track our code.
+
+[Powered By Mahout](powered-by-mahout.html)
+ -- who is using Mahout in production?
+
+[Professional Support](professional-support.html)
+ -- who is offering professional support for Mahout?
+
+[Mahout and Google Summer of Code](gsoc.html)
+  -- All you need to know about Mahout and GSoC.
+
+
+[Glossary of commonly used terms and abbreviations](glossary.html)
+
+## Installation/Setup
+
+[System Requirements](system-requirements.html)
+ -- what do you need to run Mahout?
+
+[Quickstart](quickstart.html)
+ -- get started with Mahout, run the examples and get pointers to further
+resources.
+
+[Downloads](downloads.html)
+ -- a list of Mahout releases.
+
+[Download and installation](buildingmahout.html)
+ -- build Mahout from the sources.
+
+[Mahout on Amazon's EC2 Service](mahout-on-amazon-ec2.html)
+ -- run Mahout on Amazon's EC2.
+
+[Mahout on Amazon's EMR](mahout-on-elastic-mapreduce.html)
+ -- Run Mahout on Amazon's Elastic Map Reduce
+
+[Integrating Mahout into an Application](mahoutintegration.html)
+ -- integrate Mahout's capabilities in your application.
+
+## Examples
+
+1. [ASF Email Examples](asfemail.html)
+ -- Examples of recommenders, clustering and classification all using a
+public domain collection of 7 million emails.
+
+<a name="MahoutWiki-ImplementationBackground"></a>
+## Implementation Background
+
+<a name="MahoutWiki-RequirementsandDesign"></a>
+### Requirements and Design
+
+[Matrix and Vector Needs](matrix-and-vector-needs.html)
+ -- requirements for Mahout vectors.
+
+[Collection(De-)Serialization](collection(de-)serialization.html)
+
+<a name="MahoutWiki-CollectionsandAlgorithms"></a>
+### Collections and Algorithms
+
+Learn more about [mahout-collections](mahout-collections.html)
+, containers for efficient storage of primitive-type data and open hash
+tables.
+
+Learn more about the [Algorithms](algorithms.html)
+ discussed and employed by Mahout.
+
+Learn more about the [Mahout recommender implementation](recommender-documentation.html)
+.
+
+### Utilities
+
+This section describes tools that might be useful for working with Mahout.
+
+[Converting Content](converting-content.html)
+ -- Mahout has some utilities for converting content such as logs to
+formats more amenable for consumption by Mahout.
+[Creating Vectors](creating-vectors.html)
+ -- Mahout's algorithms operate on vectors. Learn more on how to generate
+these from raw data.
+[Viewing Result](viewing-result.html)
+ -- How to visualize the result of your trained algorithms.
+
+<a name="MahoutWiki-Data"></a>
+### Data
+
+[Collections](collections.html)
+ -- To try out and test Mahout's algorithms you need training data. We are
+always looking for new training data collections.
+
+<a name="MahoutWiki-Benchmarks"></a>
+### Benchmarks
+
+[Mahout Benchmarks](mahout-benchmarks.html)
+
+## Committer's Resources
+
+* [Testing](testing.html)
+ -- Information on test plans and ideas for testing
+
+### Project Resources
+
+* [Dealing with Third Party Dependencies not in Maven](thirdparty-dependencies.html)
+* [How To Update The Website](how-to-update-the-website.html)
+* [Patch Check List](patch-check-list.html)
+* [How To Release](http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release)
+* [Sonar Code Quality Analysis](https://analysis.apache.org/dashboard/index/63921)
+
+### Additional Resources
+
+* [Apache Machine Status](http://monitoring.apache.org/status/)
+ \- Check to see if SVN, other resources are available.
+* [Committer's FAQ](http://www.apache.org/dev/committers.html)
+* [Apache Dev](http://www.apache.org/dev/)
+
+
+## How To Edit This Wiki
+
+How to edit this Wiki
+
+This Wiki is a collaborative site, anyone can contribute and share:
+
+* Create an account by clicking the "Login" link at the top of any page,
+and picking a username and password.
+* Edit any page by pressing Edit at the top of the page
+
+There are some conventions used on the Mahout wiki:
+
+    * {noformat}+*TODO:*+{noformat} (+*TODO:*+ ) is used to denote sections
+that definitely need to be cleaned up.
+    * {noformat}+*Mahout_(version)*+{noformat} (+*Mahout_0.2*+) is used to
+draw attention to which version of Mahout a feature was (or will be) added
+to Mahout.
+

Added: mahout/site/trunk/content/mahout.ga.tutorial.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahout.ga.tutorial.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahout.ga.tutorial.mdtext (added)
+++ mahout/site/trunk/content/mahout.ga.tutorial.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,36 @@
+Title: Mahout.GA.Tutorial
+<a name="Mahout.GA.Tutorial-HowtodistributethefitnessevaluationusingMahout.GA"></a>
+# How to distribute the fitness evaluation using Mahout.GA
+
+In any Watchmaker program, you'll have to create an instance of a
+StandaloneEvolutionEngine. For the TSP example this is done in the
+EvolutionaryTravellingSalesman class:
+
+<pre><code>
+private EvolutionEngine<List<String>>
+getEngine(CandidateFactory<List<String>> candidateFactory,
+EvolutionaryOperator<List<?>> pipeline, Random rng) {
+  return new StandaloneEvolutionEngine<List<String>>(candidateFactory,
+pipeline, new RouteEvaluator(distances), selectionStrategy, rng);
+}
+</code></pre>
+    
+    The RouteEvaluator class is where the fitness of each individual is
+evaluated, if we want to distribute the evaluation over a Hadoop Cluster,
+all we have to is wrap the evaluator in a MahoutFitnessEvaluator, and
+instead of a StandaloneEvolutionEngine we'll use a STEvolutionEngine :
+    
+ <pre><code>
+    private EvolutionEngine<List<String>>
+getEngine(CandidateFactory<List<String>> candidateFactory,
+EvolutionaryOperator<List<?>> pipeline, Random rng) {
+      MahoutFitnessEvaluator<List<String>> evaluator = new
+MahoutFitnessEvaluator<List<String>>(new RouteEvaluator(distances));
+      return new STEvolutionEngine<List<String>>(candidateFactory, pipeline,
+evaluator, selectionStrategy, rng);
+    }
+</code></pre>
+
+And voila! your code is ready to run on Hadoop. The complete running
+example is available with the examples in the
+org/apache/mahout/ga/watchmaker/travellingsalesman directory

Added: mahout/site/trunk/content/mahoutintegration.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahoutintegration.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahoutintegration.mdtext (added)
+++ mahout/site/trunk/content/mahoutintegration.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1 @@
+Title: MahoutIntegration