You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by is...@apache.org on 2013/11/03 22:36:27 UTC
svn commit: r1538467 [19/20] - in /mahout/site/mahout_cms: ./ cgi-bin/ content/ content/css/ content/developers/ content/general/ content/images/ content/js/ content/users/ content/users/basics/ content/users/classification/ content/users/clustering/ c...

Added: mahout/site/mahout_cms/content/users/clustering/twenty-newsgroups.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/twenty-newsgroups.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/twenty-newsgroups.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/twenty-newsgroups.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,135 @@
+Title: Twenty Newsgroups
+<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a>
+## Twenty Newsgroups Classification Example
+
+<a name="TwentyNewsgroups-Introduction"></a>
+## Introduction
+
+The 20 Newsgroups data set is a collection of approximately 20,000
+newsgroup documents, partitioned (nearly) evenly across 20 different
+newsgroups. The 20 newsgroups collection has become a popular data set for
+experiments in text applications of machine learning techniques, such as
+text classification and text clustering. We will use Mahout Bayes
+Classifier to create a model that would classify a new document into one of
+the 20 newsgroup.
+
+<a name="TwentyNewsgroups-Prerequisites"></a>
+## Prerequisites
+
+* Mahout has been downloaded ([instructions here](http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup)
+)
+* Maven is available
+* Your environment has the following variables:
+<table>
+<tr><td> *HADOOP_HOME* </td><td> Environment variables refers to where Hadoop lives </td></tr>
+<tr><td> *MAHOUT_HOME* </td><td> Environment variables refers to where Mahout lives </td></tr>
+</table>
+
+<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
+## Instructions for running the example
+
+1. Start the hadoop daemons by executing the following commands
+
+    $ cd $HADOOP_HOME/bin
+    $ ./start-all.sh
+
+1. In the trunk directory of mahout, compile everything and create the
+mahout job:
+
+    $ cd $MAHOUT_HOME
+    $ mvn install
+
+1. Run the 20 newsgroup example by executing the script as below
+
+    $ ./examples/bin/build-20news-bayes.sh
+
+After MAHOUT-857 is committed (available when 0.6 is released), the command
+will be:
+
+    $ ./examples/bin/classify-20newsgroups.sh
+
+This later version allows you to also try out running Stochastic Gradient
+Descent (SGD) on the same data.
+
+The script performs the following
+1. # Downloads *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
+1. # Extracts dataset
+1. # Generates input dataset for training classifier
+1. # Generates input dataset for testing classifier
+1. # Trains the classifier
+1. # Tests the classifier
+
+Output might look like:
+
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+    a   b	c   d	e   f	g   h	i   j	k   l	m   n	o   p	q   r	s  
+t   u	<--Classified as
+    381 0	0   0	0   9	1   0	0   0	1   0	0   2	0   1	0   0	3  
+0   0	 |  398  a     = rec.motorcycles
+    1   284 0   0	0   0	1   0	6   3	11  0	66  3	0   1	6   0	4  
+9   0	 |  395  b     = comp.windows.x
+    2   0	339 2	0   3	5   1	0   0	0   0	1   1	12  1	7   0	2  
+0   0	 |  376  c     = talk.politics.mideast
+    4   0	1   327 0   2	2   0	0   2	1   1	0   5	1   4	12  0	2  
+0   0	 |  364  d     = talk.politics.guns
+    7   0	4   32	27  7	7   2	0   12	0   0	6   0	100 9	7   31	0  
+0   0	 |  251  e     = talk.religion.misc
+    10  0	0   0	0   359 2   2	0   1	3   0	1   6	0   1	0   0	11 
+0   0	 |  396  f     = rec.autos
+    0   0	0   0	0   1	383 9	1   0	0   0	0   0	0   0	0   0	3  
+0   0	 |  397  g     = rec.sport.baseball
+    1   0	0   0	0   0	9   382 0   0	0   0	1   1	1   0	2   0	2  
+0   0	 |  399  h     = rec.sport.hockey
+    2   0	0   0	0   4	3   0	330 4	4   0	5   12	0   0	2   0	12 
+7   0	 |  385  i     = comp.sys.mac.hardware
+    0   3	0   0	0   0	1   0	0   368 0   0	10  4	1   3	2   0	2  
+0   0	 |  394  j     = sci.space
+    0   0	0   0	0   3	1   0	27  2	291 0	11  25	0   0	1   0	13 
+18  0	 |  392  k     = comp.sys.ibm.pc.hardware
+    8   0	1   109 0   6	11  4	1   18	0   98	1   3	11  10	27  1	1  
+0   0	 |  310  l     = talk.politics.misc
+    0   11	0   0	0   3	6   0	10  6	11  0	299 13	0   2	13  0	7  
+8   0	 |  389  m     = comp.graphics
+    6   0	1   0	0   4	2   0	5   2	12  0	8   321 0   4	14  0	8  
+6   0	 |  393  n     = sci.electronics
+    2   0	0   0	0   0	4   1	0   3	1   0	3   1	372 6	0   2	1  
+2   0	 |  398  o     = soc.religion.christian
+    4   0	0   1	0   2	3   3	0   4	2   0	7   12	6   342 1   0	9  
+0   0	 |  396  p     = sci.med
+    0   1	0   1	0   1	4   0	3   0	1   0	8   4	0   2	369 0	1  
+1   0	 |  396  q     = sci.crypt
+    10  0	4   10	1   5	6   2	2   6	2   0	2   1	86  15	14  152 0  
+1   0	 |  319  r     = alt.atheism
+    4   0	0   0	0   9	1   1	8   1	12  0	3   6	0   2	0   0	341
+2   0	 |  390  s     = misc.forsale
+    8   5	0   0	0   1	6   0	8   5	50  0	40  2	1   0	9   0	3  
+256 0	 |  394  t     = comp.os.ms-windows.misc
+    0   0	0   0	0   0	0   0	0   0	0   0	0   0	0   0	0   0	0  
+0   0	 |  0	 u     = unknown
+
+
+<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
+## Complementary Naive Bayes
+
+To Train a CBayes Classifier using bi-grams
+
+    $> $MAHOUT_HOME/bin/mahout trainclassifier \
+      -i 20news-input \
+      -o newsmodel \
+      -type cbayes \
+      -ng 2 \
+      -source hdfs
+
+
+To Test a CBayes Classifier using bi-grams
+
+    $> $MAHOUT_HOME/bin/mahout testclassifier \
+      -m newsmodel \
+      -d 20news-input \
+      -type cbayes \
+      -ng 2 \
+      -source hdfs \
+      -method mapreduce
+

Added: mahout/site/mahout_cms/content/users/clustering/viewing-result.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/viewing-result.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/viewing-result.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/viewing-result.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,10 @@
+Title: Viewing Result
+* [Algorithm Viewing pages](#ViewingResult-AlgorithmViewingpages)
+
+There are various technologies available to view the output of Mahout
+algorithms.
+* Clusters
+
+<a name="ViewingResult-AlgorithmViewingpages"></a>
+# Algorithm Viewing pages
+{pagetree:root=@self|excerpt=true|expandCollapseAll=true}

Added: mahout/site/mahout_cms/content/users/clustering/viewing-results.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/viewing-results.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/viewing-results.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/viewing-results.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,44 @@
+Title: Viewing Results
+<a name="ViewingResults-Intro"></a>
+# Intro
+
+Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
+sequence files or other data structures.  This page is intended to
+demonstrate the various ways one might inspect the outcome of various jobs.
+ The page is organized by algorithms.
+
+<a name="ViewingResults-GeneralUtilities"></a>
+# General Utilities
+
+<a name="ViewingResults-SequenceFileDumper"></a>
+## Sequence File Dumper
+
+
+<a name="ViewingResults-Clustering"></a>
+# Clustering
+
+<a name="ViewingResults-ClusterDumper"></a>
+## Cluster Dumper
+
+Run the following to print out all options:
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --help
+
+
+
+<a name="ViewingResults-Example"></a>
+### Example
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir
+./solr-clust-n2/out/clusters-2
+          --dictionary ./solr-clust-n2/dictionary.txt
+          --substring 100 --pointsDir ./solr-clust-n2/out/points/
+    
+
+
+
+<a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a>
+## Cluster Labels (MAHOUT-163)
+
+<a name="ViewingResults-Classification"></a>
+# Classification

Added: mahout/site/mahout_cms/content/users/clustering/visualizing-sample-clusters.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/visualizing-sample-clusters.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/visualizing-sample-clusters.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/visualizing-sample-clusters.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,76 @@
+Title: Visualizing Sample Clusters
+<a name="VisualizingSampleClusters-Introduction"></a>
+# Introduction
+
+Mahout provides examples to visualize sample clusters that gets created by
+various clustering algorithms like
+* Canopy Clustering
+* Dirichlet Process
+* KMeans
+* Fuzzy KMeans
+* MeanShift Canopy
+* Spectral KMeans
+* MinHash
+
+<a name="VisualizingSampleClusters-Note"></a>
+##### Note
+These are Swing programs. You have to be in a window system on the same
+machine you run these, or logged in via a "remote desktop" or VNC program.
+
+<a name="VisualizingSampleClusters-Pre-Prep"></a>
+# Pre - Prep
+
+For visualizing the clusters, you would just have to execute the Java
+classes under org.apache.mahout.clustering.display package in
+mahout-examples module. If you are using eclipse, setup mahout-examples as
+a project as specified in [Working with Maven in Eclipse](buildingmahout#mahout_maven_eclipse.html)
+.
+
+<a name="VisualizingSampleClusters-Visualizingclusters"></a>
+# Visualizing clusters
+
+The following classes in org.apache.mahout.clustering.display can be run
+without parameters to generate a sample data set and run the reference
+clustering implementations over them:
+1. DisplayClustering - generates 1000 samples from three, symmetric
+distributions. This is the same data set that is used by the following
+clustering programs. It displays the points on a screen and superimposes
+the model parameters that were used to generate the points. You can edit
+the generateSamples() method to change the sample points used by these
+programs.
+1. DisplayClustering - displays initial areas of generated points
+1. DisplayDirichlet - uses Dirichlet Process clustering
+1. DisplayCanopy - uses Canopy clustering
+1. DisplayKMeans - uses k-Means clustering
+1. DisplayFuzzyKMeans - uses Fuzzy k-Means clustering
+1. DisplayMeanShift - uses MeanShift clustering
+1. DisplaySpectralKMeans - uses Spectral KMeans via map-reduce algorithm
+
+If you are using Eclipse and have set it up as specified in Pre-Prep, just
+right-click on each of the classes mentioned above and choose "Run As -
+Java Application". To run these directly from the command line:
+
+    cd $MAHOUT_HOME/examples
+    mvn -q exec:java
+-Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
+    # substitute other names above for DisplayClustering
+    # Note: the DisplaySpectralKMeans program does a Hadoop job that takes 3
+minutes on a laptop. Set this MVN_OPTS=300m to give the program enough
+memory. You may find that some of the other programs also need more memory.
+
+
+Note:
+* Some of these programs display the sample points and then superimpose all
+of the clusters from each iteration. The last iteration's clusters are in
+bold red and the previous several are colored (orange, yellow, green, blue,
+magenta) in order after which all earlier clusters are in light grey. This
+helps to visualize how the clusters converge upon a solution over multiple
+iterations.
+
+* By changing the parameter values (k, ALPHA_0, numIterations) and the
+display SIGNIFICANCE you can obtain different results.
+
+<a name="VisualizingSampleClusters-ScreenCaptureAnimation"></a>
+# Screen Capture Animation
+See [Sample Clusters Animation](sample-clusters-animation.html)
+ for a screen caps of all the above programs, and an animated gif.

Added: mahout/site/mahout_cms/content/users/emr/mahout-on-amazon-ec2.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/emr/mahout-on-amazon-ec2.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/emr/mahout-on-amazon-ec2.mdtext (added)
+++ mahout/site/mahout_cms/content/users/emr/mahout-on-amazon-ec2.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,226 @@
+Title: Mahout on Amazon EC2
+Amazon EC2 is a compute-on-demand platform sold by Amazon.com that allows
+users to purchase one or more host machines on an hourly basis and execute
+applications.  Since Hadoop can run on EC2, it is also possible to run
+Mahout on EC2.	The following sections will detail how to create a Hadoop
+cluster from the ground up. Alternatively, you can use an existing Hadoop
+AMI, in which case, please see [Use an Existing Hadoop AMI](use-an-existing-hadoop-ami.html)
+.
+
+  
+<a name="MahoutonAmazonEC2-Prerequisites"></a>
+# Prerequisites
+
+To run Mahout on EC2 you need to start up a Hadoop cluster on one or more
+instances of a Hadoop-0.20.2 compatible Amazon Machine Instance (AMI).
+Unfortunately, there do not currently exist any public AMIs that support
+Hadoop-0.20.2; you will have to create one. The following steps begin with
+a public Cloudera Ubuntu AMI that comes with Java installed on it. You
+could use any other AMI with Java installed or you could use a clean AMI
+and install Java yourself. These instructions assume some familiarity with
+Amazon EC2 concepts and terminology. See the Amazon EC2 User Guide, in
+References below.
+
+1. From the [AWS Management Console](https://console.aws.amazon.com/ec2/home#c=EC2&s=Home)
+/AMIs, start the following AMI (_ami-8759bfee_)
+
+    cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-x86_64.manifest.xml 
+
+1. From the AWS Console/Instances, select the instance and right-click
+'Connect" to get the connect string which contains your <instance public
+DNS name>
+
+    > ssh -i <gsg-keypair.pem> root@<instance public DNS name>
+
+1. In the root home directory evaluate:
+
+    # apt-get update
+    # apt-get upgrade  // This is optional, but probably advisable since the
+AMI is over a year old.
+    # apt-get install python-setuptools
+    # easy_install "simplejson==2.0.9"
+    # easy_install "boto==1.8d"
+    # apt-get install ant
+    # apt-get install subversion
+    # apt-get install maven2
+
+1. Add the following to your .profile
+
+    export JAVA_HOME=/usr/lib/jvm/java-6-sun
+    export HADOOP_HOME=/usr/local/hadoop-0.20.2
+    export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf
+    export MAHOUT_HOME=/usr/local/mahout-0.4
+    export MAHOUT_VERSION=0.4-SNAPSHOT
+    export MAVEN_OPTS=-Xmx1024m
+
+1. Upload the Hadoop distribution and configure it. This distribution is not
+available on the Hadoop site. You can download a beta version from [Cloudera's CH3 distribution](http://archive.cloudera.com/cdh/3/)
+
+    > scp -i <gsg-keypair.pem>  <where>/hadoop-0.20.2.tar.gz root@<instance
+public DNS name>:.
+    
+    # tar -xzf hadoop-0.20.2.tar.gz
+    # mv hadoop-0.20.2 /usr/local/.
+
+1. Configure Hadoop for temporary single node operation
+1. # add the following to $HADOOP_HOME/conf/hadoop-env.sh
+
+    # The java implementation to use.  Required.
+    export JAVA_HOME=/usr/lib/jvm/java-6-sun
+    
+    # The maximum amount of heap to use, in MB. Default is 1000.
+    export HADOOP_HEAPSIZE=2000
+
+1. # add the following to $HADOOP_HOME/conf/core-site.xml and also
+$HADOOP_HOME/conf/mapred-site.xml
+
+    <configuration>
+      <property>
+        <name>fs.default.name</name>
+        <value>hdfs://localhost:9000</value>
+      </property>
+    
+      <property>
+        <name>mapred.job.tracker</name>
+        <value>localhost:9001</value>
+      </property>
+    
+      <property>
+        <name>dfs.replication</name>
+        <value>1</value>
+    	<!-- set to 1 to reduce warnings when 
+    	running on a single node -->
+      </property>
+    </configuration>
+
+1. # set up authorized keys for localhost login w/o passwords and format your
+name node
+
+    # ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
+    # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
+    # $HADOOP_HOME/bin/hadoop namenode -format
+
+1. Checkout and build Mahout from trunk. Alternatively, you can upload a
+Mahout release tarball and install it as we did with the Hadoop tarball
+(Don't forget to update your .profile accordingly).
+
+    # svn co http://svn.apache.org/repos/asf/mahout/trunk mahout 
+    # cd mahout
+    # mvn clean install
+    # cd ..
+    # mv mahout /usr/local/mahout-0.4
+
+1. Run Hadoop, just to prove you can, and test Mahout by building the
+Reuters dataset on it. Finally, delete the files and shut it down.
+
+    # $HADOOP_HOME/bin/hadoop namenode -format
+    # $HADOOP_HOME/bin/start-all.sh
+    # jps	  // you should see all 5 Hadoop processes (NameNode,
+SecondaryNameNode, DataNode, JobTracker, TaskTracker)
+    # cd $MAHOUT_HOME
+    # ./examples/bin/build-reuters.sh
+    
+    # $HADOOP_HOME/bin/stop-all.sh
+    # rm -rf /tmp/* 		  // delete the Hadoop files
+
+1. Remove the single-host stuff you added to $HADOOP_HOME/conf/core-site.xml
+and $HADOOP_HOME/conf/mapred-site.xml in step #6b and verify you are happy
+with the other conf file settings. The Hadoop startup scripts will not make
+any changes to them. In particular, upping the Java heap size is required
+for many of the Mahout jobs.
+
+       // $HADOOP_HOME/conf/mapred-site.xml
+       <property>
+         <name>mapred.child.java.opts</name>
+         <value>-Xmx2000m</value>
+       </property>
+
+1. Bundle your image into a new AMI, upload it to S3 and register it so it
+can be launched multiple times to construct a Mahout-ready Hadoop cluster.
+(See Amazon's [Preparing And Creating AMIs](http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?PreparingAndCreatingAMIs.html)
+ for details). 
+
+    // copy your AWS private key file and certificate file to /mnt on your
+instance (you don't want to leave these around in the AMI).
+    > scp -i <gsg-keypair.pem> <your AWS cert directory>/*.pem root@<instance
+public DNS name>:/mnt/.
+    
+    # Note that ec2-bundle-vol may fail if EC2_HOME is set.  So you may want to
+temporarily unset EC2_HOME before running the bundle command.  However the
+shell will need to have the correct value of EC2_HOME set before running
+the ec2-register step.
+    
+    # ec2-bundle-vol -k /mnt/pk*.pem -c /mnt/cert*.pem -u <your-AWS-user_id> -d
+/mnt -p mahout
+    # ec2-upload-bundle -b <your-s3-bucket> -m /mnt/mahout.manifest.xml -a
+<your-AWS-access_key> -s <your-AWS-secret_key> 
+    # ec2-register -K /mnt/pk-*.pem -C /mnt/cert-*.pem
+<your-s3-bucket>/mahout.manifest.xml
+
+<a name="MahoutonAmazonEC2-GettingStarted"></a>
+# Getting Started
+
+1. Now you can go back to your AWS Management Console and try launching a
+single instance of your image. Once this launches, make sure you can
+connect to it and test it by re-running the test code.	If you removed the
+single host configuration added in step 6(b) above, you will need to re-add
+it before you can run this test.  To test run (again):
+
+    # $HADOOP_HOME/bin/hadoop namenode -format
+    # $HADOOP_HOME/bin/start-all.sh
+    # jps	  // you should see all 5 Hadoop processes (NameNode,
+SecondaryNameNode, DataNode, JobTracker, TaskTracker)
+    # cd $MAHOUT_HOME
+    # ./examples/bin/build-reuters.sh
+    
+    # $HADOOP_HOME/bin/stop-all.sh
+    # rm -rf /tmp/* 		  // delete the Hadoop files
+
+
+1. Now that you have a working Mahout-ready AMI, follow [Hadoop's instructions](http://wiki.apache.org/hadoop/AmazonEC2)
+ to configure their scripts for your environment.
+1. # edit bin/hadoop-ec2-env.sh, setting the following environment variables:
+
+    AWS_ACCOUNT_ID
+    AWS_ACCESS_KEY_ID
+    AWS_SECRET_ACCESS_KEY
+    S3_BUCKET
+    (and perhaps others depending upon your environment)
+
+1. # edit bin/launch-hadoop-master and bin/launch-hadoop-slaves, setting:
+
+    AMI_IMAGE
+
+1. # finally, launch your cluster and log in
+
+    > bin/hadoop-ec2 launch-cluster test-cluster 2
+    > bin/hadoop-ec2 login test-cluster
+    # ...  
+    # exit
+    > bin/hadoop-ec2 terminate-cluster test-cluster     // when you are done
+with it
+
+
+<a name="MahoutonAmazonEC2-RunningtheExamples"></a>
+# Running the Examples
+1. Submit the Reuters test job
+
+    # cd $MAHOUT_HOME
+    # ./examples/bin/build-reuters.sh
+    // the warnings about configuration files do not seem to matter
+
+1. See the Mahout [Quickstart](quickstart.html)
+ page for more examples
+<a name="MahoutonAmazonEC2-References"></a>
+# References
+
+[Amazon EC2 User Guide](http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html)
+[Hadoop's instructions](http://wiki.apache.org/hadoop/AmazonEC2)
+
+
+
+<a name="MahoutonAmazonEC2-Recognition"></a>
+# Recognition
+
+Some of the information available here was possible through the "Amazon Web
+Services Apache Projects Testing Program".

Added: mahout/site/mahout_cms/content/users/emr/mahout-on-elastic-mapreduce.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/emr/mahout-on-elastic-mapreduce.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/emr/mahout-on-elastic-mapreduce.mdtext (added)
+++ mahout/site/mahout_cms/content/users/emr/mahout-on-elastic-mapreduce.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,522 @@
+Title: Mahout on Elastic MapReduce
+<a name="MahoutonElasticMapReduce-Introduction"></a>
+# Introduction
+
+This page details the set of steps that was necessary to get an example of
+k-Means clustering running on Amazon's [Elastic MapReduce](http://aws.amazon.com/elasticmapreduce/)
+ (EMR). 
+
+Note: Some of this work is due in part to credits donated by Amazon Web
+Services Apache Projects Testing Program.
+
+<a name="MahoutonElasticMapReduce-GettingStarted"></a>
+# Getting Started
+
+   * Get yourself an EMR account.  If you're already using EC2, then you
+can do this from [Amazon's AWS Managment Console](https://console.aws.amazon.com/)
+, which has a tab for running EMR.
+   * Get the [ElasticFox](https://addons.mozilla.org/en-US/firefox/addon/11626)
+ and [S3Fox|https://addons.mozilla.org/en-US/firefox/search?q=s3fox&cat=all]
+ Firefox extensions.  These will make it easy to monitor running EMR
+instances, upload code and data, and download results.
+   * Download the [Ruby command line client for EMR](http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264&categoryID=262)
+.  You can do things from the GUI, but when you're in the midst of trying
+to get something running, the CLI client will make life a lot easier.
+   * Have a look at [Common Problems Running Job Flows](http://developer.amazonwebservices.com/connect/thread.jspa?messageID=124694&#124694)
+ and [Developing and Debugging Job Flows|http://developer.amazonwebservices.com/connect/message.jspa?messageID=124695#124695]
+ in the EMR forum at Amazon.  They were tremendously useful.
+   * Make sure that you're up to date with the Mahout source.  The fix for [Issue 118](http://issues.apache.org/jira/browse/MAHOUT-118)
+ is required to get things running when you're sending output to an S3
+bucket.
+   * Build the Mahout core and examples.
+
+Note that the Hadoop that's running on EMR is version of Hadoop 0.20.0. 
+The EMR GUI in the AWS Management Console provides a number of examples of
+using EMR, and you might want to try running one of these to get started.
+
+One big gotcha that I discovered is that the S3N file system for Hadoop has
+a couple of weird cases that boil down to the following advice:  if you're
+naming a directory in an s3n URI, make sure that it ends in a slash and you
+should not try to use a top-level S3 bucket name as the place where your
+Mahout output will be going, you should always include a subdirectory.
+
+<a name="MahoutonElasticMapReduce-UploadingCodeandData"></a>
+# Uploading Code and Data
+
+I decided that I would use separate S3 buckets for the Mahout code, the
+input for the clustering (I used the synthetic control data, you can find
+it easily from the [Quickstart](quickstart.html)
+ page), and the output of the clustering.  
+
+You will need to upload:
+1. The Mahout Job jar.  For the example here, we are using
+*mahout-core-0.4-SNAPSHOT.job*
+1. The data.  In this example, we uploaded two files: dictionary.txt and
+part-out.vec.  The latter is the main vector file and the former is the
+dictionary that maps words to columns.	It was created by converting a
+Lucene index to Mahout vectors.
+
+
+<a name="MahoutonElasticMapReduce-Runningk-meansClustering"></a>
+# Running k-means Clustering
+
+EMR offers two modes for running MapReduce jobs.  The first is a
+"streaming" mode where you provide the source for single-step mapper and
+reducer functions (you can use languages other than Java for this).  The
+second mode is called "Custom Jar" and it gives you full control over the
+job steps that will run.  This is the mode that we need to use to run
+Mahout.  
+
+In order to run in Custom Jar mode, you need to look at the example that
+you want to run and figure out the arguments that you need to provide to
+the job.  Essentially, you need to know the command line that you would
+give to bin/hadoop in order to run the job, including whatever parameters
+the job needs to run.  
+
+<a name="MahoutonElasticMapReduce-UsingtheGUI"></a>
+## Using the GUI
+
+The EMR GUI is an easy way to start up a Custom Jar run, but it doesn't
+have the full functionality of the CLI.  Basically, you tell the GUI where
+in S3 the jar file is using a Hadoop s3n URI like
+*s3n://PATH/mahout-core-0.4-SNAPSHOT.job*.  The GUI will check and make
+sure that the given file exists, which is a nice sanity check.	You can
+then provide the arguments for the job just as you would on the command
+line.  The arguments for the k-means job that were as follows:
+
+
+    org.apache.mahout.clustering.kmeans.KMeansDriver --input
+s3n://news-vecs/part-out.vec --clusters
+s3n://news-vecs/kmeans/clusters-9-11/ -k 10 --output
+s3n://news-vecs/out-9-11/ --distanceMeasure
+org.apache.mahout.common.distance.CosineDistanceMeasure --convergenceDelta
+0.001 --overwrite --maxIter 50 --clustering
+
+
+TODO: Screenshot
+
+The main failing with the GUI mode is that you can only specify a single
+job to run, and you can't run another job in the same set of instances. 
+Recall that on AWS you pay for partial hours at the hourly rate, so if your
+job fails in the first 10 seconds, you pay for the full hour and if you try
+again, you're going to paying for another hour.
+
+Because of this, using a command line interface (CLI) is strongly
+recommended.
+
+<a name="MahoutonElasticMapReduce-UsingtheCLI"></a>
+## Using the CLI
+
+If you're in development mode, and trying things out, EMR allows you to set
+up a set of instances and leave them running.  Once you've done this, you
+can add job steps to the set of instances as you like.	This solves the "10
+second failure" problem that I described above and lets you get full value
+for your EMR dollar.  Amazon has pretty good [documentation for the CLI](http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?CHAP_UsingEMR.html)
+, which you'll need to read to figure out how to do things like set up your
+AWS credentials for the EMR CLI.
+
+You can start up a job flow that will keep running using an invocation like
+the following:
+
+
+    ./elastic-mapreduce --create --alive \
+       --log-uri s3n://PATH_FOR_LOGS/ --key-pair YOUR_KEY \
+       --num-instances 2 --name NAME_HERE
+
+
+Fill in the name, key pair and path for logs as appropriate. This call
+returns the name of the job flow, and you'll need that for subsequent calls
+to add steps to the job flow. You can, however, retrieve it at any time by
+calling:
+
+    ./elastic-mapreduce --list
+
+
+Let's list our job flows:
+
+
+    [stgreen@dhcp-ubur02-74-153 14:16:15 emr]
+$ ./elastic-mapreduce --list
+    j-3JB4UF7CQQ025     WAITING	  
+ec2-174-129-90-97.compute-1.amazonaws.com    kmeans
+
+
+At this point, everything's started up, and it's waiting for us to add a
+step to the job.  When we started the job flow, we specified a key pair
+that we created earlier so that we can log into the master while the job
+flow is running:
+
+
+     elastic-mapreduce --ssh -j j-3JB4UF7CQQ025
+
+
+Let's add a step to run a job:
+
+
+     elastic-mapreduce -j j-3JB4UF7CQQ025  --jar
+s3n://PATH/mahout-core-0.4-SNAPSHOT-job.jar  --main-class
+org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg
+s3n://PATH/part-out.vec --arg --clusters --arg s3n://PATH/kmeans/clusters/
+--arg -k --arg 10 --arg --output --arg s3n://PATH/out-9-11/ --arg
+--distanceMeasure --arg 
+org.apache.mahout.common.distance.CosineDistanceMeasure --arg
+--convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50
+--arg --clustering
+
+
+When you do this, the job flow goes into the *RUNNING* state for a while
+and then returns to *WAITING* once the step has finished.  You can use
+the CLI or the GUI to monitor the step while it runs.  Once you've finished
+with your job flow, you can shut it down the following way:
+
+
+    ./elastic-mapreduce -j j-3JB4UF7CQQ025 --terminate
+
+
+and go look in your S3 buckets to find your output and logs.
+
+
+<a name="MahoutonElasticMapReduce-Troubleshooting"></a>
+# Troubleshooting
+
+The primary means for understanding what went wrong is via the logs and
+stderr/stdout.	When running on EMR, stderr and stdout are captured to
+files in your log directories.	Additionally, logging is setup to write out
+to a file called syslog.  To view these in the AWS Console, go to your logs
+directory, then the folder with the same JobFlow id as above
+(j-3JB4UF7CQQ025), then the steps folder and then the appropriate step
+number (usually 1 for this case).
+
+That is, go to the folder s3n://PATH_TO_LOGS/j-3JB4UF7CQQ025/steps/1.  In
+this directory, you will find stdout, stderr, syslog and potentially a few
+other logs. 
+
+
+See [resulting thread](http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945&tstart=15)
+ for some early user experience with Mahout on EMR
+
+<a name="MahoutonElasticMapReduce-BuildingVectorsforLargeDocumentSets"></a>
+## Building Vectors for Large Document Sets
+
+Use the following steps as a guide to using Elastic MapReduce (EMR) to
+create sparse vectors needed for running Mahout clustering algorithms on
+large document sets. This section evolved from benchmarking Mahout's
+clustering algorithms using a large document set. Specifically, we used the
+ASF mail archives that have been parsed and converted to the Hadoop
+SequenceFile format (block-compressed) and saved to a public S3 folder:
+*s3://asf-mail-archives/mahout-0.4/sequence-files*. Overall, there are
+6,094,444 key-value pairs in 283 files taking around 5.7GB of disk.
+
+<a name="MahoutonElasticMapReduce-1.Setupelastic-mapreduce-ruby"></a>
+#### 1. Setup elastic-mapreduce-ruby
+
+As discussed previously, make sure you install the *elastic-mapreduce-ruby*
+tool. On Debian-based Linux like Ubuntu, use the following commands to
+install elastic-mapreduce-ruby's dependencies:
+
+
+    apt-get install ruby1.8
+    apt-get install libopenssl-ruby1.8
+    apt-get install libruby1.8-extras
+
+
+Once these dependencies are installed, download and extract the
+elastic-mapreduce-ruby application. We use */mnt/dev* as the base working
+directory because this process was originally conducted on an EC2 instance;
+be sure to replace this path with the correct path for your environment as
+you work through these steps.
+
+
+    mkdir -p /mnt/dev/elastic-mapreduce /mnt/dev/downloads
+    cd /mnt/dev/downloads
+    wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
+    cd /mnt/dev/elastic-mapreduce
+    unzip /mnt/dev/downloads/elastic-mapreduce-ruby.zip
+
+
+Please refer to [Amazon Elastic MapReduce Ruby Client](http://aws.amazon.com/developertools/2264?_encoding=UTF8&jiveRedirect=1)
+ for a detailed explanation, but to get running quickly, all you need to do
+is create a file named *credentials.json* in the elastic-mapreduce
+directory, such as */mnt/dev/elastic-mapreduce/credentials.json*. The
+credentials.json should contain the following information (change to match
+your environment):
+
+
+    { 
+      "access-id": "YOUR_ACCESS_KEY",
+      "private-key": "YOUR_SECRET_KEY", 
+      "key-pair": "gsg-keypair", 
+      "key-pair-file": "/mnt/dev/aws/gsg-keypair.pem", 
+      "region": "us-east-1", 
+      "log-uri": "s3n://BUCKET/asf-mail-archives/logs/"
+    }
+
+  
+If you are confused about any of these parameters, please read: [Understanding Access Credentials for AWS/EC2](http://alestic.com/2009/11/ec2-credentials)
+. Also, it's a good idea to add the elastic-mapreduce directory to your
+PATH. To verify it is working correctly, simply do:
+
+
+    elastic-mapreduce --list
+
+
+<a name="MahoutonElasticMapReduce-2.Setups3cmdandCreateaBucket"></a>
+#### 2. Setup s3cmd and Create a Bucket
+
+It's also beneficial when working with EMR and S3 to install [s3cmd](http://s3tools.org/s3cmd)
+, which helps you interact with S3 using easy to understand command-line
+options. To install on Ubuntu, simply do:
+
+
+    sudo apt-get install s3cmd
+
+
+Once installed, configure s3cmd by doing:
+
+
+    s3cmd --configure
+
+
+If you don't have an S3 bucket to work with, then please create one using:
+
+
+    s3cmd mb s3://BUCKET
+
+
+Replace this bucket name in the remaining steps whenever you see
+*s3://BUCKET* in the steps below.
+
+<a name="MahoutonElasticMapReduce-3.LaunchEMRCluster"></a>
+#### 3. Launch EMR Cluster
+
+Once elastic-mapreduce is installed, start a cluster with no jobflow steps:
+
+
+    elastic-mapreduce --create --alive \
+      --log-uri s3n://BUCKET/emr/logs/ \
+      --key-pair gsg-keypair \
+      --slave-instance-type m1.xlarge \
+      --master-instance-type m1.xlarge \
+      --num-instances # \
+      --name mahout-0.4-vectorize
+
+
+This will create an EMR Job Flow named "mahout-0.4-vectorize" in the
+US-East region using EC2 xlarge instances. Take note of the Job ID returned
+as you will need it to add the "seq2sparse" step to the Job Flow. It can
+take a few minutes for the cluster to start; the job flow enters a
+"waiting" status when it is ready. We launch the EMR instances in the
+*us-east-1* region so that we don't incur data transfer charges to/from
+US-Standard S3 buckets (credentials.json => "region":"us-east-1").
+
+When vectorizing large document sets, you need to distribute processing
+across as many reducers as possible. This also helps keep the size of the
+vector files more manageable. I'll leave it to you to decide how many
+instances to allocate, but keep in mind that one will be dedicated as the
+master (Hadoop NameNode). Also, it took about 75 minutes to run the
+seq2sparse job on 19 xlarge instances when using *maxNGramSize=2* (~190
+normalized instance hours â not cheap). I think you'll be safe to use
+about 10-13 instances and still finish in under 2 hours. Also, if you are
+not creating bi-grams, then you won't need as much horse-power; a four node
+cluster with 3 reducers per node is sufficient for generating vectors with
+*maxNGramSize = 1* in less than 30 minutes.
+
+_Tip: Amazon provides a bootstrap action to configure the cluster for
+running memory intensive jobs. For more information about this, see: [http://forums.aws.amazon.com/ann.jspa?annID=834](http://forums.aws.amazon.com/ann.jspa?annID=834)
+_
+
+<a name="MahoutonElasticMapReduce-4.CopyMahoutJARtoS3"></a>
+#### 4. Copy Mahout JAR to S3
+
+The Mahout 0.4 JAR containing a custom Lucene Analyzer
+(*org.apache.mahout.text.MailArchivesClusteringAnalyzer*) is available
+at:
+
+
+    s3://asf-mail-archives/mahout-0.4/mahout-examples-0.4-job-ext.jar 
+
+
+The source code is available at [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
+.
+
+If you need to use your own Mahout JAR, use s3cmd to copy it to your S3
+bucket:
+
+
+    s3cmd put JAR_FILE s3://BUCKET/
+
+
+<a name="MahoutonElasticMapReduce-5.Vectorize"></a>
+#### 5. Vectorize
+
+Schedule a jobflow step to vectorize (1-grams only) using Mahout's
+seq2sparse job:
+
+
+    elastic-mapreduce --jar
+s3://asf-mail-archives/mahout-0.4/mahout-examples-0.4-job-ext.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg seq2sparse \
+      --arg -i --arg s3n://asf-mail-archives/mahout-0.4/sequence-files/ \
+      --arg -o --arg /asf-mail-archives/mahout-0.4/vectors/ \
+      --arg --weight --arg tfidf \
+      --arg --minSupport --arg 500 \
+      --arg --maxDFPercent --arg 70 \
+      --arg --norm --arg 2 \
+      --arg --numReducers --arg # \
+      --arg --analyzerName --arg
+org.apache.mahout.text.MailArchivesClusteringAnalyzer \
+      --arg --maxNGramSize --arg 1 \
+      -j JOB_ID
+
+
+You need to determine the correct number of reducers based on the EC2
+instance type and size of your cluster. For xlarge nodes, set the number of
+reducers to 3 x N (where N is the size of your EMR cluster not counting the
+master node). For large instances, 2 reducers per node is probably safe
+unless your job is extremely CPU intensive, in which case use only 1
+reducer per node.
+
+Be sure to use Hadoop's *s3n* protocol for the input parameter ({{-i
+s3n://asf-mail-archives/mahout-0.4/sequence-files/}}) so that Mahout/Hadoop
+can find the SequenceFiles in S3. Also, notice that we've configured the
+job to send output to HDFS instead of S3. This is needed to work-around an
+issue with multi-step jobs and EMR (see [MAHOUT-598](https://issues.apache.org/jira/browse/MAHOUT-598)
+). Once the job completes, you can copy the results to S3 from the EMR
+cluster's HDFS using distcp.
+
+The job shown above created 6,076,937 vectors with 20,444 dimensions in
+around 28 minutes on a 4+1 node cluster of EC2 xlarge instances. Depending
+on the number of unique terms, setting maxNGramSize greater than 1 has a
+major impact on the execution time of the seq2sparse job. For example, the
+same job with maxNGramSize=2 can take up to 2 hours with the bulk of the
+time spent creating collocations, see [Collocations](https://cwiki.apache.org/MAHOUT/collocations.html)
+.
+
+To monitor the status of the job, use:
+
+
+    elastic-mapreduce --logs -j JOB_ID
+
+
+<a name="MahoutonElasticMapReduce-6.CopyoutputfromHDFStoS3(optional)"></a>
+#### 6. Copy output from HDFS to S3 (optional)
+
+It's a good idea to save the vectors for running future jobs. Of course, if
+you don't save the vectors to S3, then they will be lost when you terminate
+the EMR cluster. There are two approaches to moving data out of HDFS to S3:
+
+1. SSH into the master node to run distcp, or
+1. Add a jobflow step to run distcp
+
+To login to the master node, use:
+
+
+    elastic-mapreduce --ssh -j JOB_ID
+
+
+Once logged in, do:
+
+
+    hadoop distcp /asf-mail-archives/mahout-0.4/vectors/
+s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/vectors/ &
+
+
+Or, you can just add another job flow step to do it:
+
+
+    elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
+      --arg hdfs:///asf-mail-archives/mahout-0.4/vectors/ \
+      --arg
+s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/vectors/ \
+      -j JOB_ID
+
+
+_Note: You will need all the output from the vectorize step in order to run
+Mahout's clusterdump._
+
+Once copied, if you would like to share your results with the Mahout
+community, make the vectors public in S3 using the Amazon console or s3cmd:
+
+
+    s3cmd setacl --acl-public --recursive
+s3://BUCKET/asf-mail-archives/mahout-0.4/vectors/
+
+
+Dump out the size of the vectors:
+
+
+    bin/mahout vectordump --seqFile
+s3n://ACCESS_KEY:SECRET_KEY@BUCKET/asf-mail-archives/mahout-0.4/vectors/tfidf-vectors/part-r-00000
+--sizeOnly | more
+
+
+<a name="MahoutonElasticMapReduce-7.k-MeansClustering"></a>
+#### 7. k-Means Clustering
+
+Now that you have vectors, you can do some clustering! The following
+command will create a new jobflow step to run the k-Means job using the
+TFIDF vectors produced by seq2sparse:
+
+
+    elastic-mapreduce --jar
+s3://asf-mail-archives/mahout-0.4/mahout-examples-0.4-job-ext.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg kmeans \
+      --arg -i --arg /asf-mail-archives/mahout-0.4/vectors/tfidf-vectors/ \
+      --arg -c --arg /asf-mail-archives/mahout-0.4/initial-clusters/ \
+      --arg -o --arg /asf-mail-archives/mahout-0.4/kmeans-clusters \
+      --arg -x --arg 10 \
+      --arg -cd --arg 0.01 \
+      --arg -k --arg 60 \
+      --arg --distanceMeasure --arg
+org.apache.mahout.common.distance.CosineDistanceMeasure \
+      -j JOB_ID
+
+
+Depending on the EC2 instance type and size of your cluster, the k-Means
+job can take a couple of hours to complete. The input is the HDFS location
+of the vectors created by the seq2sparse job. If you copied the vectors to
+S3, then you could also use the s3n protocol. However, since I'm using the
+same EMR job flow, the vectors are already in HDFS, so there is no need to
+pull them from S3.
+
+_Tip: use a convergenceDelta of 0.01 to ensure the clustering job performs
+more than one iteration._
+
+<a name="MahoutonElasticMapReduce-UselynxtoViewtheJobTrackerWebUI"></a>
+##### Use lynx to View the JobTracker Web UI
+
+A somewhat subtle feature of EMR is that you can use lynx to access the
+JobTracker UI from the master node. Login to the master node using:
+
+
+    elastic-mapreduce --ssh -j JOB_ID
+
+
+Once logged in, launch the JobTracker using:
+
+
+    lynx http://localhost:9100/
+
+
+Now you can easily monitor the state of running jobs. Or, better yet, you
+can setup an SSH tunnel to port 9100 on the master server using:
+
+
+    ssh -i PATH_TO_KEYPAIR/gsg-keypair.pem \
+      -L 9100:ec2-???-???-???-???.compute-1.amazonaws.com:9100 \
+      hadoop@ec2-???-???-???-???.compute-1.amazonaws.com
+
+
+With this command, you can point your browser to http://localhost:9100 to
+access the JobTracker UI.
+
+<a name="MahoutonElasticMapReduce-8.Shutdownyourcluster"></a>
+#### 8. Shut down your cluster
+
+
+    elastic-mapreduce --terminate -j JOB_ID
+
+
+Verify the cluster is terminated in your Amazon console.

Added: mahout/site/mahout_cms/content/users/emr/use-an-existing-hadoop-ami.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/emr/use-an-existing-hadoop-ami.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/emr/use-an-existing-hadoop-ami.mdtext (added)
+++ mahout/site/mahout_cms/content/users/emr/use-an-existing-hadoop-ami.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,292 @@
+Title: Use an Existing Hadoop AMI
+The following process was developed for launching Hadoop clusters in EC2 in
+order to benchmark Mahout's clustering algorithms using a large document
+set (see Mahout-588). Specifically, we used the ASF mail archives that have
+been parsed and converted to the Hadoop SequenceFile format
+(block-compressed) and saved to a public S3 folder:
+s3://asf-mail-archives/mahout-0.4/sequence-files. Overall, there are
+6,094,444 key-value pairs in 283 files taking around 5.7GB of disk.
+
+You can also use Amazon's Elastic MapReduce, see [Mahout on Elastic MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)
+. However, using EC2 directly is slightly less expensive and provides
+greater visibility into the state of running jobs via the JobTracker Web
+UI. You can launch the EC2 cluster from your development machine; the
+following instructions were generated on Ubuntu workstation. We assume that
+you have successfully completed the Amazon EC2 Getting Started Guide, see [EC2 Getting Started Guide|http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/]
+.
+
+Note, this work was supported in part by the Amazon Web Services Apache
+Projects Testing Program.
+
+<a name="UseanExistingHadoopAMI-LaunchHadoopCluster"></a>
+## Launch Hadoop Cluster
+
+<a name="UseanExistingHadoopAMI-GatherAmazonEC2keys/securitycredentials"></a>
+#### Gather Amazon EC2 keys / security credentials
+
+You will need the following:
+AWS Account ID
+Access Key ID
+Secret Access Key
+X.509 certificate and private key (e.g. cert-aws.pem and pk-aws.pem)
+EC2 Key-Pair (ssh public and private keys) for the US-EAST region.
+
+Please make sure the file permissions are "-rw-------" (e.g. chmod 600
+gsg-keypair.pem). You can create a key-pair for the US-East region using
+the Amazon console. If you are confused about any of these terms, please
+see: [Understanding Access Credentials for AWS/EC2](http://alestic.com/2009/11/ec2-credentials)
+.
+
+You should also export the EC2_PRIVATE_KEY and EC2_CERT environment
+variables to point to your AWS Certificate and Private Key files, for
+example:
+
+
+    export EC2_PRIVATE_KEY=$DEV/aws/pk-aws.pem
+    export EC2_CERT=$DEV/aws/cert-aws.pem
+
+
+These are used by the ec2-api-tools command to interact with Amazon Web
+Services.
+
+<a name="UseanExistingHadoopAMI-InstallandConfiguretheAmazonEC2APITools:"></a>
+#### Install and Configure the Amazon EC2 API Tools:
+
+On Ubuntu, you'll need to enable the multi-verse in /etc/apt/sources.list
+to find the ec2-api-tools
+
+
+    apt-get update
+    apt-get install ec2-api-tools
+
+
+Once installed, verify you have access to EC2 by executing:
+
+
+    ec2-describe-images -x all | grep hadoop
+
+
+<a name="UseanExistingHadoopAMI-InstallHadoop0.20.2Locally"></a>
+#### Install Hadoop 0.20.2 Locally
+
+You need to install Hadoop locally in order to get access to the EC2
+cluster deployment scripts. We use */mnt/dev* as the base working
+directory because this process was originally conducted on an EC2 instance;
+be sure to replace this path with the correct path for your environment as
+you work through these steps.
+
+
+    sudo mkdir -p /mnt/dev/downloads
+    sudo chown -R ubuntu:ubuntu /mnt/dev
+    cd /mnt/dev/downloads
+    wget
+http://apache.mirrors.hoobly.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
+&& cd /mnt/dev && tar zxvf downloads/hadoop-0.20.2.tar.gz
+    ln -s hadoop-0.20.2 hadoop 
+
+
+The scripts we need are in $HADOOP_HOME/scr/contrib/ec2. There are other
+approaches to deploying a Hadoop cluster on EC2, such as Cloudera's [CDH3](https://docs.cloudera.com/display/DOC/Cloudera+Documentation+Home+Page)
+. We chose to use the contrib/ec2 scripts because they are very easy to use
+provided there is an existing Hadoop AMI available.
+
+<a name="UseanExistingHadoopAMI-Edithadoop-ec2-env.sh"></a>
+#### Edit hadoop-ec2-env.sh 
+
+Open hadoop/src/contrib/ec2/bin/hadoop-ec2-env.sh in your editor and set
+the Amazon security variables to match your environment, for example:
+
+
+    AWS_ACCOUNT_ID=####-####-####
+    AWS_ACCESS_KEY_ID=???
+    AWS_SECRET_ACCESS_KEY=???
+    EC2_KEYDIR=/mnt/dev/aws
+    KEY_NAME=gsg-keypair
+    PRIVATE_KEY_PATH=/mnt/dev/aws/gsg-keypair.pem
+
+
+The value of PRIVATE_KEY_PATH should be your EC2 key-pair pem file, such as
+/mnt/dev/aws/gsg-keypair.pem. This key-pair must be created in the US-East
+region.
+
+For Mahout, we recommended the following settings:
+
+
+    HADOOP_VERSION=0.20.2
+    S3_BUCKET=453820947548/bixolabs-public-amis
+    ENABLE_WEB_PORTS=true
+    INSTANCE_TYPE="m1.xlarge"
+
+
+You do not need to worry about changing any variables below the comment
+that reads "The following variables are only used when creating an AMI.".
+
+These settings will create a cluster of EC2 xlarge instances using the
+Hadoop 0.20.2 AMI provided by Bixo Labs.
+
+<a name="UseanExistingHadoopAMI-LaunchHadoopCluster"></a>
+#### Launch Hadoop Cluster
+
+
+    cd $HADOOP_HOME/src/contrib/ec2
+    bin/hadoop-ec2 launch-cluster mahout-clustering 2
+
+
+This will launch 3 xlarge instances (two workers + one for the NameNode aka
+"master"). It may take up to 5 minutes to launch a cluster named
+"mahout-clustering"; watch the console for errors. The cluster will launch
+in the US-East region so you won't incur any data transfer fees to/from
+US-Standard S3 buckets. You can re-use the cluster name for launching other
+clusters of different sizes. Behind the scenes, the Hadoop scripts will
+create two EC2 security groups that configure the firewall for accessing
+your Hadoop cluster.
+
+<a name="UseanExistingHadoopAMI-Launchproxy"></a>
+#### Launch proxy
+
+Assuming your cluster launched successfully, establish a SOCKS tunnel to
+your master node to access the JobTracker Web UI from your local browser.
+
+
+    bin/hadoop-ec2 proxy mahout-clustering &
+
+
+This command will output the URLs for the JobTracker and NameNode Web UI,
+such as:
+
+
+    JobTracker http://ec2-???-???-???-???.compute-1.amazonaws.com:50030
+
+
+<a name="UseanExistingHadoopAMI-SetupFoxyProxy(FireFoxplug-in)"></a>
+#### Setup FoxyProxy (FireFox plug-in)
+
+Once the FoxyProxy plug-in is installed in FireFox, go to Options >
+FoxyProxy Standard > Options to setup a proxy on localhost:6666 for the
+JobTracker and NameNode Web UI URLs from the previous step. For more
+information about FoxyProxy, please see: [FoxyProxy](http://getfoxyproxy.org/downloads.html)
+
+Now you are ready to run Mahout jobs in your cluster.
+
+<a name="UseanExistingHadoopAMI-LaunchClusteringJobfromMasterserver"></a>
+## Launch Clustering Job from Master server
+
+<a name="UseanExistingHadoopAMI-Logintothemasterserver:"></a>
+#### Login to the master server:
+
+
+    bin/hadoop-ec2 login mahout-clustering
+
+
+Hadoop does not start until all EC2 instances are running, look for java
+processes on the master server using: ps waux | grep java
+
+<a name="UseanExistingHadoopAMI-InstallMahout"></a>
+#### Install Mahout
+
+Since this is EC2, you have the most disk space on the master node in /mnt.
+
+<a name="UseanExistingHadoopAMI-Fromadistribution"></a>
+##### From a distribution
+
+NOTE: Substitute in the appropriate version number/URLs as necessary.  0.4
+is not the latest version of Mahout.
+
+    mkdir -p /mnt/dev/downloads
+    cd /mnt/dev/downloads
+    wget http://apache.mesi.com.ar//mahout/0.4/mahout-distribution-0.4.tar.gz
+&& cd /mnt/dev && tar zxvf downloads/mahout-distribution-0.4.tar.gz
+    ln -s mahout-distribution-0.4 mahout
+
+
+<a name="UseanExistingHadoopAMI-FromSource"></a>
+##### From Source
+
+
+    Install Subversion: >yum install subversion //Note, you can also use Git,
+so substitute in the appropriate URL
+    > svn co http://svn.apache.org/repos/asf/mahout/trunk mahout/trunk
+    Install Maven 3.x and put it in the path
+    > cd mahout/trunk
+    > mvn install //Optionally add -DskipTests
+
+
+<a name="UseanExistingHadoopAMI-ConfigureHadoop"></a>
+#### Configure Hadoop
+
+You'll want to increase the Max Heap Size for the data nodes
+(mapred.child.java.opts) and set the correct number of reduce tasks based
+on the size of your cluster. 
+
+
+    vi $HADOOP_HOME/conf/hadoop-site.xml
+
+
+(NOTE: if this file doesn't exist yet, then the cluster nodes are still
+starting up. Wait a few minutes and then try again.)
+
+Add the following properties:
+
+
+    <!-- Change 6 to the correct number for your cluster -->
+    <property>
+      <name>mapred.reduce.tasks</name>
+      <value>6</value>
+    </property>
+    
+    <property>
+      <name>mapred.child.java.opts</name>
+      <value>-Xmx4096m</value>
+    </property>
+
+
+You can safely run 3 reducers per node on EC2 xlarge instances with 4GB of
+max heap each. If you are using large instances, then you may be able to
+have 2 per node or only 1 if your jobs are CPU intensive.
+
+<a name="UseanExistingHadoopAMI-CopythevectorsfromS3toHDFS"></a>
+#### Copy the vectors from S3 to HDFS
+
+Use Hadoop's distcp command to copy the vectors from S3 to HDFS.
+
+
+    hadoop distcp -Dmapred.task.timeout=1800000 \
+    s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
+\
+    /asf-mail-archives/mahout-0.4/tfidf-vectors
+
+
+The files are stored in the US-Standard S3 bucket so there is no charge for
+data transfer to your EC2 cluster, as it is running in the US-EAST region.
+
+<a name="UseanExistingHadoopAMI-Launchtheclusteringjob(fromthemasterserver)"></a>
+#### Launch the clustering job (from the master server)
+
+
+    cd /mnt/dev/mahout
+    bin/mahout kmeans -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \
+      -c /asf-mail-archives/mahout-0.4/initial-clusters/ \
+      -o /asf-mail-archives/mahout-0.4/kmeans-clusters/ \
+      --numClusters 100 \
+      --maxIter 10 \
+      --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure
+\
+      --convergenceDelta 0.01 &
+
+  
+You can monitor the job using the JobTracker Web UI through FoxyProxy.
+
+<a name="UseanExistingHadoopAMI-DumpClusters"></a>
+#### Dump Clusters
+
+Once completed, you can view the results using Mahout's cluster dumper
+
+
+    bin/mahout clusterdump --seqFileDir
+/asf-mail-archives/mahout-0.4/kmeans-clusters/clusters-1/ \
+      --numWords 20 \
+      --dictionary
+s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/dictionary.file-0
+\
+      --dictionaryType sequencefile --output clusters.txt --substring 100
+

Added: mahout/site/mahout_cms/content/users/recommender/itembased-collaborative-filtering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/recommender/itembased-collaborative-filtering.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/recommender/itembased-collaborative-filtering.mdtext (added)
+++ mahout/site/mahout_cms/content/users/recommender/itembased-collaborative-filtering.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,122 @@
+Title: Itembased Collaborative Filtering
+Itembased Collaborative Filtering is a popular way of doing Recommendation
+Mining.
+
+<a name="ItembasedCollaborativeFiltering-Terminology"></a>
+### Terminology
+
+We have *users* that interact with *items* (which can be pretty much
+anything like books, videos, news, other users,...). Those users express
+*preferences* towards the items which can either be boolean (just modelling
+that a user likes an item) or numeric (by having a rating value assigned to
+the preference). Typically only a small number of preferences is known for
+each single user.
+
+<a name="ItembasedCollaborativeFiltering-Algorithmicproblems"></a>
+### Algorithmic problems
+
+Collaborative Filtering algorithms aim to solve the *prediction* problem
+where the task is to estimate the preference of a user towards an item
+which he/she has not yet seen.
+
+Once an algorithm can predict preferences it can also be used to do
+*Top-N-Recommendation* where the task is to find the N items a given user
+might like best. This is usually done by isolating a set of candidate
+items, computing the predicted preferences of the given user towards them
+and returning the highest scoring ones.
+
+If we look at the problem from a mathematical perspective, a
+*user-item-matrix* is created from the preference data and the task is to
+predict the missing entries by finding patterns in the known entries.
+
+<a name="ItembasedCollaborativeFiltering-ItembasedCollaborativeFiltering"></a>
+### Itembased Collaborative Filtering
+
+A popular approach called "Itembased Collaborative Filtering" estimates a
+user's preference towards an item by looking at his/her preferences towards
+similar items, be aware that similarity must be thought of as similarity of
+rating behaviour not similarity of content in this context.
+
+The standard procedure is to pairwisely compare the columns of the
+user-item-matrix (the item-vectors) using a similarity measure like
+pearson-correlation, cosine or loglikelihood to obtain similar items and
+use those together with the user's ratings to predict his/her preference
+towards unknown items.
+
+
+<a name="ItembasedCollaborativeFiltering-Map/Reduceimplementations"></a>
+### Map/Reduce implementations
+
+Mahout offers two Map/Reduce jobs aimed to support Itembased Collaborative
+Filtering.
+
+*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob*
+computes all similar items. It expects a .csv file with the preference data
+as input, where each line represents a single preference in the form
+_userID,itemID,value_ and outputs pairs of itemIDs with their associated
+similarity value.
+
+_job specific options_
+
+<table>
+<tr><td>input</td><td>path to input directory</td></tr>
+<tr><td>output</td><td>path to output directory</td></tr>
+<tr><td>similarityClassname</td><td>Name of distributed similarity class to instantiate,  
+							alternatively use
+one of the predefined similarities					   
+		  (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE, 
+							 
+SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,		   
+				    SIMILARITY_TANIMOTO_COEFFICIENT,
+SIMILARITY_UNCENTERED_COSINE,						   
+	     SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)</td></tr>
+<tr><td>maxSimilaritiesPerItem</td><td>try to cap the number of similar items per item to
+this number</td></tr>
+<tr><td>maxPrefsPerUser</td><td>max number of preferences to consider per user, users with
+more preferences will be sampled down</td></tr>
+<tr><td>minPrefsPerUser</td><td>ignore users with less preferences than this</td></tr>
+<tr><td>booleanData</td><td>treat input as having no preference values</td></tr>
+<tr><td>threshold</td><td>discard item pairs with a similarity value below this</td></tr>
+</table>
+
+*org.apache.mahout.cf.taste.hadoop.item.RecommenderJob* is a completely
+distributed itembased recommender. It expects a .csv file with the
+preference data as input, where each line represents a single preference in
+the form _userID,itemID,value_ and outputs userIDs with associated
+recommended itemIDs and their scores.
+
+_job specific options_
+
+<table>
+<tr><td>input</td><td>path to input directory</td></tr>
+<tr><td>output</td><td>path to output directory</td></tr>
+<tr><td>numRecommendations</td><td>number of recommendations per user</td></tr>
+<tr><td>usersFile</td><td>file of users to recommend for</td></tr>
+<tr><td>itemsFile</td><td>file of items to recommend for</td></tr>
+<tr><td>filterFile</td><td>file containing comma-separated userID,itemID pairs. Used to
+exclude the item from the recommendations for that user (optional)</td></tr>
+<tr><td>maxPrefsPerUser</td><td>maximum number of preferences considered per user in final
+recommendation phase</td></tr>
+<tr><td>similarityClassname</td><td>Name of distributed similarity class to instantiate,  
+							alternatively use
+one of the predefined similarities					   
+		  (SIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE, 
+							 
+SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION,		   
+				    SIMILARITY_TANIMOTO_COEFFICIENT,
+SIMILARITY_UNCENTERED_COSINE,						   
+	     SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE)</td></tr>
+<tr><td>maxSimilaritiesPerItem</td><td>try to cap the number of similar items per item to
+this number</td></tr>
+<tr><td>maxPrefsPerUserInItemSimilarity</td><td>max number of preferences to consider per
+user, users with more preferences will be sampled down</td></tr>
+<tr><td>minPrefsPerUser</td><td>ignore users with less preferences than this</td></tr>
+<tr><td>booleanData</td><td>treat input as having no preference values</td></tr>
+<tr><td>threshold</td><td>discard item pairs with a similarity value below this</td></tr>
+</table>
+
+<a name="ItembasedCollaborativeFiltering-Resources"></a>
+### Resources
+
+* [Sarwar et al.:Item-Based Collaborative Filtering Recommendation Algorithms ](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf)
+* [Slides: Distributed Itembased Collaborative Filtering with Apache Mahout](http://www.slideshare.net/sscdotopen/mahoutcf)

Added: mahout/site/mahout_cms/content/users/recommender/pearsoncorrelation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/recommender/pearsoncorrelation.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/recommender/pearsoncorrelation.mdtext (added)
+++ mahout/site/mahout_cms/content/users/recommender/pearsoncorrelation.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,24 @@
+Title: PearsonCorrelation
+{excerpt}The Pearson correlation measures the degree to which two series of
+numbers tend to move together -- values in corresponding positions tend to
+be high together, or low together. In particular it measures the strength
+of the linear relationship between the two series, the degree to which one
+can be estimated as a linear function of the other. It is often used in
+collaborative filtering as a similarity metric on users or items; users
+that tend to rate the same items high, or low, have a high Pearson
+correlation and therefore are "similar".{excerpt}
+
+The Pearson correlation can behave very badly when small counts are
+involved.  For example, if you compare any two sequences with two distinct
+values, you get a correlation of 1.  To some degree, this problem can be
+avoided by not computing correlations for short sequences (with less than,
+say, 10 values).  
+
+Pearson correlation is sometimes used in collaborative filtering to define
+similarity between the ratings of two users on a common set of items.  In
+this application, it is a reasonable measure if there is sufficient
+overlap.  It, unfortunately, is not able to take advantage of the degree of
+overlapping ratings relative to the sets of all ratings.
+
+See Also
+* [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)

Added: mahout/site/mahout_cms/content/users/recommender/recommendationexamples.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/recommender/recommendationexamples.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/recommender/recommendationexamples.mdtext (added)
+++ mahout/site/mahout_cms/content/users/recommender/recommendationexamples.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,47 @@
+Title: RecommendationExamples
+<a name="RecommendationExamples-Introduction"></a>
+# Introduction 
+
+This quick start page describes how to run the recommendation examples
+provided by Mahout. Mahout comes with four recommendation mining examples.
+They are based on netflixx, jester, grouplens and bookcrossing
+respectively.
+
+<a name="RecommendationExamples-Steps"></a>
+# Steps 
+
+<a name="RecommendationExamples-Testingitononesinglemachine"></a>
+## Testing it on one single machine 
+
+In the examples directory type: 
+
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.bookcrossing.BookCrossingRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.netflix.NetflixRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.netflix.TransposeToByUser"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner"
+-Dexec.args="<OPTIONS>" 
+
+
+Here, the command line options need only be:
+
+
+    -i [input file]
+
+
+
+Note that the GroupLens example is designed for the "1 million" data set,
+available at http://www.grouplens.org/node/73 . And  the "input file" above
+is the ratings.dat contained in the zipfile from the data set . This file
+has an unusual format and so has a special parser. The example code here
+can be easily modified to use a regular FileDataModel and thus work on more
+standard input, including the other data sets available at this site.

Added: mahout/site/mahout_cms/content/users/recommender/recommender-documentation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/recommender/recommender-documentation.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/recommender/recommender-documentation.mdtext (added)
+++ mahout/site/mahout_cms/content/users/recommender/recommender-documentation.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,394 @@
+Title: Recommender Documentation
+<a name="RecommenderDocumentation-Overview"></a>
+## Overview
+
+_This documentation concerns the non-distributed, non-Hadoop-based
+recommender engine / collaborative filtering code inside Mahout. It was
+formerly a separate project called "Taste" and has continued development
+inside Mahout alongside other Hadoop-based code. It may be viewed as a
+somewhat separate, older, more comprehensive and more mature aspect of this
+code, compared to current development efforts focusing on Hadoop-based
+distributed recommenders. This remains the best entry point into Mahout
+recommender engines of all kinds._
+
+A Mahout-based collaborative filtering engine takes users' preferences for
+items ("tastes") and returns estimated preferences for other items. For
+example, a site that sells books or CDs could easily use Mahout to figure
+out, from past purchase data, which CDs a customer might be interested in
+listening to.
+
+Mahout provides a rich set of components from which you can construct a
+customized recommender system from a selection of algorithms. Mahout is
+designed to be enterprise-ready; it's designed for performance, scalability
+and flexibility.
+
+Mahout recommenders are not just for Java; it can be run as an external
+server which exposes recommendation logic to your application via web
+services and HTTP.
+
+Top-level packages define the Mahout interfaces to these key abstractions:
+* DataModel
+* UserSimilarity
+* ItemSimilarity
+* UserNeighborhood
+* Recommender
+
+Subpackages of org.apache.mahout.cf.taste.impl hold implementations of
+these interfaces. These are the pieces from which you will build your own
+recommendation engine. That's it! For the academically inclined, Mahout
+supports both *memory-based*, *item-based* recommender systems, *slope one*
+recommenders, and a couple other experimental implementations. It does not
+currently support *model-based* recommenders.
+
+<a name="RecommenderDocumentation-Architecture"></a>
+## Architecture
+
+!https://cwiki.apache.org/confluence/download/attachments/22872433/taste-architecture.png!
+
+This diagram shows the relationship between various Mahout components in a
+user-based recommender. An item-based recommender system is similar except
+that there are no PreferenceInferrers or Neighborhood algorithms involved.
+
+<a name="RecommenderDocumentation-Recommender"></a>
+### Recommender
+A Recommender is the core abstraction in Mahout. Given a DataModel, it can
+produce recommendations. Applications will most likely use the
+GenericUserBasedRecommender implementation GenericItemBasedRecommender,
+possibly decorated by CachingRecommender.
+
+<a name="RecommenderDocumentation-DataModel"></a>
+### DataModel
+A DataModel is the interface to information about user preferences. An
+implementation might draw this data from any source, but a database is the
+most likely source. Mahout provides MySQLJDBCDataModel, for example, to
+access preference data from a database via JDBC and MySQL. Another exists
+for PostgreSQL. Mahout also provides a FileDataModel.
+
+There are no abstractions for a user or item in the object model (not
+anymore). Users and items are identified solely by an ID value in the
+framework. Further, this ID value must be numeric; it is a Java long type
+through the APIs. A Preference object or PreferenceArray object
+encapsulates the relation between user and preferred items (or items and
+users preferring them).
+
+Finally, Mahout supports, in various ways, a so-called "boolean" data model
+in which users do not express preferences of varying strengths for items,
+but simply express an association or none at all. For example, while users
+might express a preference from 1 to 5 in the context of a movie
+recommender site, there may be no notion of a preference value between
+users and pages in the context of recommending pages on a web site: there
+is only a notion of an association, or none, between a user and pages that
+have been visited.
+
+<a name="RecommenderDocumentation-UserSimilarity"></a>
+### UserSimilarity
+A UserSimilarity defines a notion of similarity between two Users. This is
+a crucial part of a recommendation engine. These are attached to a
+Neighborhood implementation. ItemSimilarities are analagous, but find
+similarity between Items.
+
+<a name="RecommenderDocumentation-UserNeighborhood"></a>
+### UserNeighborhood
+In a user-based recommender, recommendations are produced by finding a
+"neighborhood" of similar users near a given user. A UserNeighborhood
+defines a means of determining that neighborhood &mdash; for example,
+nearest 10 users. Implementations typically need a UserSimilarity to
+operate.
+
+<a name="RecommenderDocumentation-Requirements"></a>
+## Requirements
+<a name="RecommenderDocumentation-Required"></a>
+### Required
+
+* [Java/ J2SE 6.0](http://www.java.com/getjava/index.jsp)
+
+<a name="RecommenderDocumentation-Optional"></a>
+### Optional
+* [Apache Maven](http://maven.apache.org)
+  2.2.1 or later, if you want to build from source or build examples. (Mac
+users note that even OS X 10.5 ships with Maven 2.0.6, which will not
+work.)
+* Mahout web applications require a [Servlet 2.3+](http://java.sun.com/products/servlet/index.jsp)
+ container, such as [Apache Tomcat|http://jakarta.apache.org/tomcat/]
+. It may in fact work with oldercontainers with slight modification.
+
+<a name="RecommenderDocumentation-Demo"></a>
+## Demo
+
+To build and run the demo, follow the instructions below, which are written
+for Unix-like operating systems:
+
+* Obtain a copy of the Mahout distribution, either from SVN or as a
+downloaded archive.
+* Download the "1 Million MovieLens Dataset" from [Grouplens.org](http://www.grouplens.org/)
+* Unpack the archive and copy movies.dat and ratings.dat to
+trunk/integration/src/main/resources/org/apache/mahout/cf/taste/example/grouplens
+under the Mahout distribution directory.
+* Navigate to the directory where you unpacked the Mahout distribution, and
+navigate to trunk.
+* Run mvn -DskipTests install, which builds and installs Mahout core to
+your local repository
+* cd integration
+* You may need to give Maven more memory: in a bash shell, export
+MAVEN_OPTS=-Xmx1024M
+* mvn jetty:run.
+* Get recommendations by accessing the web application in your browser:
+http://localhost:8080/mahout-integration/RecommenderServlet?userID=1 This
+will produce a simple preference-item ID list which could be consumed by a
+client application. Get more useful human-readable output with the debug
+parameter:
+http://localhost:8080/mahout-integration/RecommenderServlet?userID=1&debug=true
+
+
+<a name="RecommenderDocumentation-Examples"></a>
+## Examples
+<a name="RecommenderDocumentation-User-basedRecommender"></a>
+### User-based Recommender
+User-based recommenders are the "original", conventional style of
+recommender system. They can produce good recommendations when tweaked
+properly; they are not necessarily the fastest recommender systems and are
+thus suitable for small data sets (roughly, less than ten million ratings).
+We'll start with an example of this.
+
+First, create a DataModel of some kind. Here, we'll use a simple on based
+on data in a file. The file should be in CSV format, with lines of the form
+"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
+
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+
+We'll use the PearsonCorrelationSimilarity implementation of UserSimilarity
+as our user correlation algorithm, and add an optional preference inference
+algorithm:
+
+
+    UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
+    // Optional:
+    userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
+
+
+Now we create a UserNeighborhood algorithm. Here we use nearest-3:
+
+
+    UserNeighborhood neighborhood =
+    	  new NearestNUserNeighborhood(3, userSimilarity, model);{code}
+    
+    Now we can create our Recommender, and add a caching decorator:
+    
+
+Recommender recommender =
+	  new GenericUserBasedRecommender(model, neighborhood,
+userSimilarity);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+
+    
+    Now we can get 10 recommendations for user ID "1234" &mdash; done!
+
+List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+
+    
+    h3.Item-based Recommender
+    
+    We could have created an item-based recommender instead. Item-based
+recommender base recommendation not on user similarity, but on item
+similarity. In theory these are about the same approach to the problem,
+just from different angles. However the similarity of two items is
+relatively fixed, more so than the similarity of two users. So, item-based
+recommenders can use pre-computed similarity values in the computations,
+which make them much faster. For large data sets, item-based recommenders
+are more appropriate.
+    
+    Let's start over, again with a FileDataModel to start:
+    
+
+DataModel model = new FileDataModel(new File("data.txt"));
+
+    
+    We'll also need an ItemSimilarity. We could use
+PearsonCorrelationSimilarity, which computes item similarity in realtime,
+but, this is generally too slow to be useful. Instead, in a real
+application, you would feed a list of pre-computed correlations to a
+GenericItemSimilarity: 
+    
+
+// Construct the list of pre-computed correlations
+Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
+	  ...;
+ItemSimilarity itemSimilarity =
+	  new GenericItemSimilarity(correlations);
+
+
+    
+    Then we can finish as before to produce recommendations:
+    
+
+Recommender recommender =
+	  new GenericItemBasedRecommender(model, itemSimilarity);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+...
+List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+
+    
+    h3. Slope-One Recommender
+    This is a simple yet effective Recommender and we present another example
+to round out the list:
+    
+
+DataModel model = new FileDataModel(new File("data.txt"));
+	  // Make a weighted slope one recommender
+	  Recommender recommender = new SlopeOneRecommender(model);
+	  Recommender cachingRecommender = new
+CachingRecommender(recommender);
+	{code}
+
+
+    
+<a name="RecommenderDocumentation-Integrationwithyourapplication"></a>
+## Integration with your application
+<a name="RecommenderDocumentation-Direct"></a>
+### Direct
+
+You can create a Recommender, as shown above, wherever you like in your
+Java application, and use it. This includes simple Java applications or GUI
+applications, server applications, and J2EE web applications.
+
+<a name="RecommenderDocumentation-Standaloneserver"></a>
+### Standalone server
+A Mahout recommender can also be run as an external server, which may be
+the only option for non-Java applications. It can be exposed as a web
+application via org.apach.mahout.cf.taste.web.RecommenderServlet, and your
+application can then access recommendations via simple HTTP requests and
+response. See above, and see the javadoc for details.
+
+<a name="RecommenderDocumentation-Performance"></a>
+## Performance
+<a name="RecommenderDocumentation-RuntimePerformance"></a>
+### Runtime Performance
+The more data you give, the better. Though Mahout is designed for
+performance, you will undoubtedly run into performance issues at some
+point. For best results, consider using the following command-line flags to
+your JVM:
+
+* -server: Enables the server VM, which is generally appropriate for
+long-running, computation-intensive applications.
+* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte
+doesn't hurt when dealing with tens millions of preferences. Mahout
+recommenders will generally use as much memory as you give it for caching,
+which helps performance. Set the initial and max size to the same value to
+avoid wasting time growing the heap, and to avoid having the JVM run minor
+collections to avoid growing the heap, which will clear cached values.
+* -da -dsa: Disable all assertions.
+* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
+of them in this framework
+* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only):
+Use a GC algorithm designed to take advantage of multiple processors, and
+designed for throughput. This is a default in J2SE 5.0.
+* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can
+only hurt in the presence of modern GC algorithms; they may force Mahout to
+remove cached data needlessly. This flag isn't needed if you're sure your
+code and third-party code you use doesn't call this method.
+
+Also consider the following tips:
+
+* Use CachingRecommender on top of your custom Recommender implementation.
+* When using JDBCDataModel, make sure you've taken basic steps to optimize
+the table storing preference data. Create a primary key on the user ID and
+item ID columns, and an index on them. Set them to be non-null. And so on.
+Tune your database for lots of concurrent reads! When using JDBC, the
+database is almost always the bottleneck. Plenty of memory and caching are
+even more important.
+* Also, pooling database connections is essential to performance. If using
+a J2EE container, it probably provides a way to configure connection pools.
+If you are creating your own DataSource directly, try wrapping it in
+org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource
+* See MySQL-specific notes on performance in the javadoc for
+MySQLJDBCDataModel.
+
+<a name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a>
+### Algorithm Performance: Which One Is Best?
+There is no right answer; it depends on your data, your application,
+environment, and performance needs. Mahout provides the building blocks
+from which you can construct the best Recommender for your application. The
+links below provide research on this topic. You will probably need a bit of
+trial-and-error to find a setup that works best. The code sample above
+provides a good starting point.
+
+Fortunately, Mahout provides a way to evaluate the accuracy of your
+Recommender on your own data, in org.apache.mahout.cf.taste.eval"
+
+
+    DataModel myModel = ...;
+    RecommenderBuilder builder = new RecommenderBuilder() {
+      public Recommender buildRecommender(DataModel model) {
+        // build and return the Recommender to evaluate here
+      }
+    };
+    RecommenderEvaluator evaluator =
+    	  new AverageAbsoluteDifferenceRecommenderEvaluator();
+    double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+
+
+For "boolean" data model situations, where there are no notions of
+preference value, the above evaluation based on estimated preference does
+not make sense. In this case, try this kind of evaluation, which presents
+traditional information retrieval figures like precision and recall, which
+are more meaningful:
+
+
+    ...
+    RecommenderIRStatsEvaluator evaluator =
+    	new GenericRecommenderIRStatsEvaluator();
+    IRStatistics stats =
+    	evaluator.evaluate(builder, null, myModel, null, 3,
+    RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
+    	&sect;1.0);
+
+
+
+<a name="RecommenderDocumentation-UsefulLinks"></a>
+## Useful Links
+You'll want to look at these packages too, which offer more algorithms and
+approaches that you may find useful:
+
+* [Cofi](http://www.nongnu.org/cofi/)
+: A Java-Based Collaborative Filtering Library
+* [CoFE](http://eecs.oregonstate.edu/iis/CoFE/)
+
+Here's a handful of research papers that I've read and found particularly
+useful:
+
+J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for Collaborative Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166)
+," in Proceedings of the Fourteenth Conference on Uncertainity in
+Artificial Intelligence (UAI 1998), 1998.
+
+B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative filtering recommendation algorithms](http://www10.org/cdrom/papers/519/)
+" in Proceedings of the Tenth International Conference on the World Wide
+Web (WWW 10), pp. 285-295, 2001.
+
+P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture for collaborative filtering of netnews](http://doi.acm.org/10.1145/192844.192905)
+" in Proceedings of the 1994 ACM conference on Computer Supported
+Cooperative Work (CSCW 1994), pp. 175-186, 1994.
+
+J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing collaborative filtering](http://www.grouplens.org/papers/pdf/algs.pdf)
+" in Proceedings of the 22nd annual international ACM SIGIR Conference on
+Research and Development in Information Retrieval (SIGIR 99), pp. 230-237,
+1999.
+
+Clifford Lyon, "[Movie Recommender](http://materialobjects.com/cf/MovieRecommender.pdf)
+" CSCI E-280 final project, Harvard University, 2004.
+
+Daniel Lemire, Anna Maclachlan, "[Slope One Predictors for Online Rating-Based Collaborative Filtering](http://www.daniel-lemire.com/fr/abstracts/SDM2005.html)
+," Proceedings of SIAM Data Mining (SDM '05), 2005.
+
+Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, Daniel Lemire and Sean McGrath, "[RACOFI: A Rule-Applying Collaborative Filtering System](http://www.daniel-lemire.com/fr/documents/publications/racofi_nrc.pdf)
+"," Proceedings of COLA '03, 2003.
+
+These links will take you to all the collaborative filtering reading you
+could ever want!
+* [Paul Perry's notes](http://www.paulperry.net/notes/cf.asp)
+* [James Thornton's collaborative filtering resources](http://jamesthornton.com/cf/)
+* [Daniel Lemire's blog](http://www.daniel-lemire.com/blog/)
+ which frequently covers collaborative filtering topics

Added: mahout/site/mahout_cms/content/users/recommender/recommender-first-timer-faq.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/recommender/recommender-first-timer-faq.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/recommender/recommender-first-timer-faq.mdtext (added)
+++ mahout/site/mahout_cms/content/users/recommender/recommender-first-timer-faq.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,50 @@
+Title: Recommender First-Timer FAQ
+Many people with an interest in recommenders arrive at Mahout since they're
+building a first recommender system. Some starting questions have been
+asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
+newcomers.
+
+For the interested, these topics are treated in detail in the book [Mahout in Action](http://manning.com/owen/)
+.
+
+Don't start with a distributed, Hadoop-based recommender; take on that
+complexity only if necessary. Start with non-distributed recommenders. It
+is simpler, has fewer requirements, and is more flexible. 
+
+As a crude rule of thumb, a system with up to 100M user-item associations
+(ratings, preferences) should "fit" onto one modern server machine with 4GB
+of heap available and run acceptably as a real-time recommender. The system
+is invariably memory-bound since keeping data in memory is essential to
+performance.
+
+Beyond this point it gets expensive to deploy a machine with enough RAM,
+so, designing for a distributed makes sense when nearing this scale.
+However most applications don't "really" have 100M associations to process.
+Data can be sampled; noisy and old data can often be aggressively pruned
+without significant impact on the result.
+
+The next question is whether or not your system has preference values, or
+ratings. Do users and items merely have an association or not, such as the
+existence or lack of a click? or is behavior translated into some scalar
+value representing the user's degree of preference for the item.
+
+If you have ratings, then a good place to start is a
+GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
+metric. If you don't have ratings, then a good place to start is
+GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
+
+If you want to do content-based item-item similarity, you need to implement
+your own ItemSimilarity.
+
+If your data can be simply exported to a CSV file, use FileDataModel and
+push new files periodically.
+If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
+counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
+top a ReloadFromJDBCDataModel.
+
+This should give a reasonable starter system which responds fast. The
+nature of the system is that new data comes in from the file or database
+only periodically -- perhaps on the order of minutes. If that's not OK,
+you'll have to look into some more specialized work -- SlopeOneRecommender
+deals with updates quickly, or, it is possible to do some work to update
+the GenericDataModel in real time. 

Added: mahout/site/mahout_cms/content/users/recommender/tastecommandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/recommender/tastecommandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/recommender/tastecommandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/recommender/tastecommandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,41 @@
+Title: TasteCommandLine
+<a name="TasteCommandLine-Introduction"></a>
+# Introduction 
+
+This quick start page describes how to run the hadoop based recommendation
+jobs of Mahout Taste on a Hadoop cluster. 
+
+<a name="TasteCommandLine-Steps"></a>
+# Steps 
+
+<a name="TasteCommandLine-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster 
+
+In the examples directory type, for example: 
+
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob"
+-Dexec.args="<OPTIONS>" 
+
+
+<a name="TasteCommandLine-Runningitonthecluster"></a>
+## Running it on the cluster 
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.jar 
+* (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh 
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata 
+* Run the Job: $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/core/target/mahout-core-<MAHOUT VERSION>.job
+org.apache.mahout.cf.taste.hadoop.<JOB> <OPTIONS> 
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs. 
+
+<a name="TasteCommandLine-Commandlineoptions"></a>
+# Command line options 
+
+Specify only the command line option "--help" for a complete summary of
+available command line options. Or, refer to the javadoc for the "Job"
+class being run.