You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by sr...@apache.org on 2012/07/12 11:26:03 UTC

svn commit: r1360593 [14/17] - in /mahout/site/trunk: ./ cgi-bin/ content/ content/attachments/ content/attachments/101992/ content/attachments/116559/ content/attachments/22872433/ content/attachments/22872443/ content/attachments/23335706/ content/at...

Added: mahout/site/trunk/content/mahoutname.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mahoutname.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mahoutname.mdtext (added)
+++ mahout/site/trunk/content/mahoutname.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,39 @@
+Title: MahoutName
+* [What's in a Name?](#MahoutName-What'sinaName?)
+* [Pronunciation](#MahoutName-Pronunciation)
+* [History](#MahoutName-History)
+
+<a name="MahoutName-What'sinaName?"></a>
+# What's in a Name?
+
+A Mahout is a keeper/driver of elephants ([http://en.wikipedia.org/wiki/Mahout](http://en.wikipedia.org/wiki/Mahout)
+).   Since many of Mahout's algorithms are implemented in MapReduce on
+Hadoop, we thought it appropriate to come up with a name that was:
+
+1. Related to Hadoop
+2. Easily findable on the web since it is a relatively uncommon word in
+US/Europe circles
+
+Prior to coming to the ASF, those of us working on the project plan voted between Howdah ([http://en.wikipedia.org/wiki/Howdah](http://en.wikipedia.org/wiki/Howdah)
+ -- the carriage on top of an elephant) and Mahout.
+
+<a name="MahoutName-Pronunciation"></a>
+# Pronunciation
+
+There are some disagreements about how to pronounce the name. Webster's has
+it as muh-hout (as in "out" --
+http://dictionary.reference.com/browse/mahout), but the Sanskrit/Hindi
+origins pronounce it as "muh-hoot".  The second pronunciation suggests a
+nice pun on the Hebrew word מהות meaning "essence or truth".
+
+<a name="MahoutName-History"></a>
+# History
+
+Mahout was started by [Isabel Drost, Grant Ingersoll and Karl Wettin](http://web.archive.org/web/20071228055210/http://ml-site.grantingersoll.com/index.php?title=Main_Page)
+.  It [started|http://web.archive.org/web/20080201093120/http://lucene.apache.org/#22+January+2008+-+Lucene+PMC+Approves+Mahout+Machine+Learning+Project]
+ as part of the [Lucene|http://lucene.apache.org]
+ project (see the [original proposal|http://web.archive.org/web/20080102151102/http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal]
+) and went on to become a top level project in April of 2010.
+
+The original goal was to implement all 10 algorithms from Andrew Ng's paper title "[Map-Reduce for Machine Learning on Multicore](http://www.google.com/url?sa=t&source=web&cd=1&ved=0CB8QFjAA&url=http%3A%2F%2Fwww.cs.stanford.edu%2Fpeople%2Fang%2Fpapers%2Fnips06-mapreducemulticore.pdf&ei=iaR8TvKYK_DTiALCq7GODg&usg=AFQjCNFaW8ZuT6xuAz61ZaoKaQ7mpmIv2w&sig2=KVaGbhPFI3rKgjtxg4yIjg)
+"

Added: mahout/site/trunk/content/mailing-lists,-irc-and-archives.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mailing-lists%2C-irc-and-archives.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mailing-lists,-irc-and-archives.mdtext (added)
+++ mahout/site/trunk/content/mailing-lists,-irc-and-archives.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,88 @@
+Title: Mailing Lists, IRC and Archives
+   * [Mailing lists](#MailingLists,IRCandArchives-Mailinglists)
+      * [Mahout User List](#MailingLists,IRCandArchives-MahoutUserList)
+      * [Mahout Developer List](#MailingLists,IRCandArchives-MahoutDeveloperList)
+   * [IRC](#MailingLists,IRCandArchives-IRC)
+   * [Archives](#MailingLists,IRCandArchives-Archives)
+      * [Official Apache Archive](#MailingLists,IRCandArchives-OfficialApacheArchive)
+      * [External Archives](#MailingLists,IRCandArchives-ExternalArchives)
+
+Communication at Mahout happens primarily online via mailing lists. We have
+a user as well as a dev list for discussion. In addition there is a commit
+list so we are able to monitor what happens on the wiki and in svn.
+
+<a name="MailingLists,IRCandArchives-Mailinglists"></a>
+## Mailing lists
+
+<a name="MailingLists,IRCandArchives-MahoutUserList"></a>
+### Mahout User List
+
+This list is for users of Mahout to ask questions, share knowledge, and
+discuss issues. Do send mail to this list with usage and configuration
+questions and problems. Also, please send questions to this list to verify
+your problem before filing issues in JIRA. 
+
+* [Subscribe](mailto:mahout-user-subscribe@apache.org.html)
+* [Unsubscribe](mailto:mahout-user-unsubscribe@apache.org.html)
+
+<a name="MailingLists,IRCandArchives-MahoutDeveloperList"></a>
+### Mahout Developer List
+
+This is the list where participating developers of the Mahout project meet
+and discuss issues concerning Mahout internals, code changes/additions,
+etc. Do not send mail to this list with usage questions or configuration
+questions and problems. 
+
+Discussion list: 
+
+* [Subscribe](mailto:mahout-dev-subscribe@apache.org.html)
+ -- Do not send mail to this list with usage questions or configuration
+questions and problems. 
+* [Unsubscribe](mailto:mahout-dev-unsubscribe@apache.org.html)
+
+Commit notifications: 
+
+* [Subscribe](mailto:mahout-commits-subscribe@apache.org.html)
+* [Unsubscribe](mailto:mahout-commits-unsubscribe@apache.org.html)
+
+<a name="MailingLists,IRCandArchives-IRC"></a>
+## IRC
+
+Mahout's IRC channel is #mahout.  It is a logged channel.  Please keep in
+mind that it is for discussion purposes only and that (pseudo)decisions
+should be brought back to the dev@ mailing list or JIRA and other people
+who are not on IRC should be given time to respond before any work is
+committed.
+
+<a name="MailingLists,IRCandArchives-Archives"></a>
+## Archives
+
+<a name="MailingLists,IRCandArchives-OfficialApacheArchive"></a>
+### Official Apache Archive
+
+* [http://mail-archives.apache.org/mod_mbox/mahout-dev/](http://mail-archives.apache.org/mod_mbox/mahout-dev/)
+* [http://mail-archives.apache.org/mod_mbox/mahout-user/](http://mail-archives.apache.org/mod_mbox/mahout-user/)
+
+* [Mbox Archive](http://mahout.apache.org/mail/)
+
+Archives previous to becoming Apache top level project:
+
+* [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/)
+* [http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/](http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/)
+
+* [Mbox Archive](http://lucene.apache.org/mail/)
+
+<a name="MailingLists,IRCandArchives-ExternalArchives"></a>
+### External Archives
+
+* [http://www.lucidimagination.com/search](http://www.lucidimagination.com/search)
+ - Search the entire Lucene ecosystem, including Mahout (archives, JIRA,
+etc.)  Powered by Solr/Lucene.
+* [MarkMail](http://mahout.markmail.org/)
+* [Nabble](http://www.nabble.com/Apache-Mahout-f32040.html)
+* [Gmane](http://dir.gmane.org/gmane.comp.apache.mahout.user)
+
+Please note the inclusion of a link to an archive does not imply an
+endorsement of that company by any of the committers of Mahout the Lucene
+PMC or the Apache Software Foundation. Each archive owner is solely
+responsible for the contents and availability of their archive.

Added: mahout/site/trunk/content/matrix-and-vector-needs.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/matrix-and-vector-needs.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/matrix-and-vector-needs.mdtext (added)
+++ mahout/site/trunk/content/matrix-and-vector-needs.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,82 @@
+Title: Matrix and Vector Needs
+<a name="MatrixandVectorNeeds-Intro"></a>
+# Intro
+
+Most ML algorithms require the ability to represent multidimensional data
+concisely and to be able to easily perform common operations on that data.
+MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality,
+along with a set of common operations on their instances. Vectors and
+matrices are provided with sparse and dense implementations that are memory
+resident and are suitable for manipulating intermediate results within
+mapper, combiner and reducer implementations. They are not intended for
+applications requiring vectors or matrices that exceed the size of a single
+JVM, though such applications might be able to utilize them within a larger
+organizing framework.
+
+<a name="MatrixandVectorNeeds-Background"></a>
+## Background
+
+See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser)
+
+<a name="MatrixandVectorNeeds-Vectors"></a>
+## Vectors
+
+Mahout supports a Vector interface that defines the following operations over all implementation classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double[](.html)
+ that is storage and access efficient. The class SparseVector implements
+vectors as a HashMap<Integer, Double> that is surprisingly fast and
+efficient. For sparse vectors, the size() method returns the current number
+of elements whereas the cardinality() method returns the number of
+dimensions it holds. An additional VectorView class allows views of an
+underlying vector to be specified by the viewPart() method. See the
+JavaDocs for more complete definitions.
+
+<a name="MatrixandVectorNeeds-Matrices"></a>
+## Matrices
+
+Mahout also supports a Matrix interface that defines a similar set of operations over all implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements matrices as a double[](.html)
+[] that is storage and access efficient. The class SparseRowMatrix
+implements matrices as a Vector[] holding the rows of the matrix in a
+SparseVector, and the symmetric class SparseColumnMatrix implements
+matrices as a Vector[] holding the columns in a SparseVector. Each of these
+classes can quickly produce a given row or column, respectively. A fourth
+class SparseMatrix, uses a HashMap<Integer, Vector> which is also a
+SparseVector. For sparse matrices, the size() method returns an int\[2\]
+containing the actual row and column sizes whereas the cardinality() method
+returns an int\[2\] with the number of dimensions of each. An additional
+MatrixView class allows views of an underlying matrix to be specified by
+the viewPart() method. See the JavaDocs for more complete definitions.
+
+The Matrix interface does not currently provide invert or determinant
+methods, though these are desirable. It is arguable that the
+implementations of SparseRowMatrix and SparseColumnMatrix ought to use the
+HashMap<Integer, Vector> implementations and that SparseMatrix should
+instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of
+sparse matrices can also be envisioned that support different storage and
+access characteristics. Because the arguments of assignColumn and assignRow
+operations accept all forms of Vector, it is possible to construct
+instances of sparse matrices containing dense rows or columns. See the
+JavaDocs for more complete definitions.
+
+For applications like PageRank/TextRank, iterative approaches to calculate
+eigenvectors would also be useful. Batching of row/column operations would
+also be useful, such as perhaps assignRow or assighColumn accepting
+UnaryFunction and BinaryFunction arguments.
+
+
+<a name="MatrixandVectorNeeds-Ideas"></a>
+## Ideas
+
+As Vector and Matrix implementations are currently memory-resident, very
+large instances greater than available memory are not supported. An
+extended set of implementations that use HBase (BigTable) in Hadoop to
+represent their instances would facilitate applications requiring such
+large collections.  
+See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6)
+See [Hama](http://wiki.apache.org/hadoop/Hama)
+
+
+<a name="MatrixandVectorNeeds-References"></a>
+## References
+
+Have a look at the old parallel computing libraries like [ScalaPACK](http://www.netlib.org/scalapack/)
+, others

Added: mahout/site/trunk/content/mean-shift-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mean-shift-clustering.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mean-shift-clustering.mdtext (added)
+++ mahout/site/trunk/content/mean-shift-clustering.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,160 @@
+Title: Mean Shift Clustering
+"Mean Shift: A Robust Approach to Feature Space Analysis"
+(http://www.caip.rutgers.edu/riul/research/papers/pdf/mnshft.pdf)
+introduces the geneology of the mean shift custering procedure which dates
+back to work in pattern recognition in 1975. The paper contains a detailed
+derivation and several examples of the use of mean shift for image smooting
+and segmentation. "Mean Shift Clustering"
+(http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf)
+presents an overview of the algorithm with a summary of the derivation. An
+attractive feature of mean shift clustering is that it does not require
+a-priori knowledge of the number of clusters (as required in k-means) and
+it will produce arbitrarily-shaped clusters that depend upon the topology
+of the data (unlike canopy).
+
+The algorithm begins with a set of datapoints, and creates a fixed-radius
+window for each. It then iterates over each window, calculating a mean
+shift vector which points in the direction of the maximum increase in the
+local density function. Then each window is migrated to a new position via
+the vector and the iteration resumes. Iterations complete when each window
+has reached a local maximum in the density function and the vector becomes
+negligable.
+
+<a name="MeanShiftClustering-ReferenceImplementation"></a>
+## Reference Implementation
+
+The implementation introduced by MAHOUT-15 uses modified Canopy Clustering
+canopies to represent the mean shift windows. 
+* It uses the canopy's T1 distance threshold as the radius of the window,
+and the canopy's T2 threshold to decide when two canopies have converged
+and will thereby follow the same path. 
+* Each canopy contains one or more bound points which are represented using
+the (internally specified) integer ids of the bound points. 
+* The algorithm is initialized with a canopy containing each input point. 
+* During each iteration, every canopy calculates its mean shift vector by
+summing the canopy centers within its T1 threshold. This value is
+normalized and the resulting centroid becomes the new canopy center.
+** The centers are weighted in proportion to their numbers of bound points
+(weighted pair-group centroid)
+* If any canopies are within their T2 thresholds they are merged and their
+respective bound points are accumulated. 
+* The iterations complete when each canopy's mean shift vector has a
+magnitude less than a given termination delta. 
+* Upon termination, the remaining canopies contain sets of points which are
+the members of their cluster.
+
+<a name="MeanShiftClustering-Map/ReduceImplementation"></a>
+## Map/Reduce Implementation
+
+* Each mapper receives a subset of the canopies for each iteration. It
+compares each canopy with each one it has already seen and performs the T1
+and T2 distance tests using an arbitrary user-supplied DistanceMeasure. The
+mapper merges canopies within T2 distance, moves each canopy to its new
+centroid position and outputs the canopy to the reducer with a constant key
+ * A single reducer coalesces all the canopies from the combiners by
+performing another clustering iteration on them.
+ * A driver class manages the iteration and determines when either a
+maximum number of iterations occur or the termination criteria is reached. 
+
+
+<a name="MeanShiftClustering-RunningMeanShiftClustering"></a>
+## Running Mean Shift Clustering
+
+The Mean Shift clustering algorithm may be run using a command-line
+invocation on MeanShiftCanopyDriver.main or by making a Java call to
+MeanShiftCanopyDriver.run(). 
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout meanshift \
+        -i <input vectors directory> \
+        -o <output working directory> \
+        -inputIsCanopies <input directory contains mean shift canopies not
+vectors> \
+        -dm <DistanceMeasure> \
+        -t1 <the T1 threshold> \
+        -t2 <the T2 threshold> \
+        -x <maximum number of iterations> \
+        -cd <optional convergence delta. Default is 0.5> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Clusters>
+        -xm <execution method: sequential or mapreduce>
+
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. measure: the fully-qualified class name of an instance of DistanceMeasure
+which will be used for the clustering.
+1. t1: the T1 threshold is used to determine if clusters are close enough to
+influence each other's next mean calculation.
+1. t2: the T2 threshold is used to determine when two clusters are close
+enough to merge.
+1. convergence: a double value used to determine if the algorithm has
+converged (clusters have not moved more than the value in the last
+iteration)
+1. max-iterations: the maximum number of iterations to run, independent of
+the convergence specified
+1. inputIsClusters: a boolean indicating, if true, that the input directory
+already contains MeanShiftCanopies and no further initialization is needed.
+If false (the default) input VectorWritables are used to form the initial
+canopies and these will be written to the clusters-0 directory.
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. runSequential: a boolean indicating, if true, that the clustering is to
+be done using the sequential reference implementation in memory.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy)
+produced by the algorithm for each iteration. The Text _key_ is a cluster
+identifier string.
+1. clusteredPoints: (if runClustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weight indicates
+the probability that the vector is a member of the cluster. As Mean Shift
+only produces a single clustering for each point, the weights are all == 1.
+
+<a name="MeanShiftClustering-Examples"></a>
+# Examples
+
+The following images illustrate Mean Shift clustering applied to a set of
+randomly-generated 2-d data points. The points are generated using a normal
+distribution centered at a mean location and with a constant standard
+deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=[1.0, 1.0]
+ sd=3.0
+* 300 samples m=[1.0, 0.0]
+ sd=0.5
+* 300 samples m=[0.0, 2.0]
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed. 
+
+![SampleData](attachments/81503/23527482.png)
+
+In the second image, the resulting clusters (k=3) are shown superimposed
+upon the sample data. In this image, each cluster renders in a different
+color and the T1 and T2 radii are superimposed upon the final cluster
+centers determined by the algorithm. Mean Shift does an excellent job of
+clustering this data, though by its design the cluster membership is unique
+and the clusters do not overlap. 
+
+![MeanShift](attachments/81503/23527484.png)
+
+The third image shows the results of running Mean Shift on a different data
+set (see [Dirichlet Process Clustering](dirichlet-process-clustering.html)
+ for details) which is generated using asymmetrical standard deviations.
+Mean Shift does an excellent job of clustering this data set too.
+
+![2dMeanShift](attachments/81503/23527483.png)

Added: mahout/site/trunk/content/mean-shift-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mean-shift-commandline.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mean-shift-commandline.mdtext (added)
+++ mahout/site/trunk/content/mean-shift-commandline.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,74 @@
+Title: mean-shift-commandline
+<a name="mean-shift-commandline-RunningMeanShiftCanopyClusteringfromtheCommandLine"></a>
+# Running Mean Shift Canopy Clustering from the Command Line
+Mahout's Mean Shift clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Mean Shift on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout meanshift <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="mean-shift-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout meanshift -i testdata <OTHER OPTIONS>
+
+
+<a name="mean-shift-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout meanshift -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="mean-shift-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			       Path to job input directory. 
+    					       Must be a SequenceFile of    
+    					       VectorWritable		    
+      --output (-o) output			       The directory pathname for   
+    					       output.			    
+      --overwrite (-ow)			       If present, overwrite the
+output 
+    					       directory before running job 
+      --distanceMeasure (-dm) distanceMeasure      The classname of the	    
+    					       DistanceMeasure. Default is  
+    					       SquaredEuclidean 	    
+      --help (-h)				       Print out help		    
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta value. 
+    					       Default is 0.5		    
+      --t1 (-t1) t1 			       T1 threshold value	    
+      --t2 (-t2) t2 			       T2 threshold value	    
+      --clustering (-cl)			       If present, run clustering
+after 
+    					       the iterations have taken
+place  
+      --maxIter (-x) maxIter		       The maximum number of	    
+    					       iterations.		    
+      --inputIsCanopies (-ic) inputIsCanopies      If present, the input
+directory  
+    					       already contains 	    
+    					       MeanShiftCanopies	    
+

Added: mahout/site/trunk/content/minhash-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/minhash-clustering.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/minhash-clustering.mdtext (added)
+++ mahout/site/trunk/content/minhash-clustering.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,128 @@
+Title: Minhash Clustering
+Minhash clustering performs probabilistic dimension reduction of high
+dimensional data. The essence of the technique is to hash each item using
+multiple independent hash functions such that the probability of collision
+of similar items is higher. Multiple such hash tables can then be
+constructed to answer near neighbor types of queries efficiently.
+
+There is a MinHashDriver class which works in the TestMinHashClustering
+unit test. This is not included in the standard driver.props class, but it
+can be run by specifying the full package name.
+
+<a name="MinhashClustering-RunningMinHashDriverontheReuters-21578Collection"></a>
+#### Running MinHashDriver on the Reuters-21578 Collection
+
+There are two ways of doing this:
+
+<a name="MinhashClustering-Runcluster-reuters.sh"></a>
+##### Run cluster-reuters.sh
+
+1. cd $MAHOUT_HOME/examples/bin/cluster-reuters.sh  (trunk only)
+1. Select the Minhash algorithm when prompted.
+
+<a name="MinhashClustering-StepByStep"></a>
+##### Step By Step
+
+h6. 1.&nbsp; Download the Reuters-21578 Dataset from [http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz](http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz)
+ and extract this under /examples/reuters folder.
+
+
+The Reuters-21578 collection has about 22578 documents in SGML
+format.&nbsp; These need to be converted to text files to subsequently
+generate the SequenceFiles and SparseVectors.
+
+To convert the SGML files to Text, we invoke the ExtractReuters utility
+that comes with Lucene. This creates text files from SGML containing -
+Title, Date, Body.
+
+h6. 2.&nbsp;&nbsp; Run the Reuters extraction code from the examples
+directory as follows:
+
+mvn \-e \-q exec:java
+\-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
+\-Dexec.args="reuters/ reuters-extracted/"
+
+<a name="MinhashClustering-3.CreateSequenceFilesfromtheconvertedReutersTextfiles"></a>
+###### 3. Create SequenceFiles from the converted Reuters Text files
+
+bin/mahout seqdirectory \-c UTF-8 \-i examples/reuters-extracted/ \-o
+reuters-seqfiles
+
+This will write the Reuters documents into Sequence files.
+
+
+<a name="MinhashClustering-4.CreateSparseVectorsfromtheSequenceFiles"></a>
+###### 4. Create SparseVectors from the SequenceFiles
+
+
+bin/mahout seq2sparse \-i \-ng 1 reuters-seqfiles/ \-o reuters-vectors \-ow
+
+The \--ow flag is used to denote whether or not to overwrite
+the output folder
+
+The \-ng flag is the maximum size of NGrams to be selected from collection
+of documents
+
+<a name="MinhashClustering-5.RuntheMinHashDriveronthegeneratedSparseVectors"></a>
+###### 5. Run the MinHashDriver on the generated SparseVectors
+
+bin/mahout org.apache.mahout.clustering.minhash.MinHashDriver \--input
+reuters-vectors/tfidf-vectors/ \-o /minhash
+
+The resulting output in /minhash/part-r-00000 would be something like below
+
+97618498-357680743&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /reut2-006.sgm-25.txt
+97618498-357680743&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /reut2-007.sgm-660.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-015.sgm-697.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /reut2-014.sgm-99.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-009.sgm-705.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-000.sgm-495.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-009.sgm-732.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-010.sgm-473.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /reut2-000.sgm-15.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-009.sgm-872.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-010.sgm-547.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-006.sgm-366.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /reut2-002.sgm-53.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-000.sgm-569.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-019.sgm-366.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-003.sgm-540.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-019.sgm-154.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-004.sgm-372.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /reut2-000.sgm-3.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-002.sgm-935.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-013.sgm-567.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-004.sgm-938.txt
+97618498-61898030&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-004.sgm-620.txt
+97618498-92898924&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-018.sgm-316.txt
+97618498-92898924&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-007.sgm-976.txt
+97618498-92898924&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-003.sgm-796.txt
+97618498-92898924&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-006.sgm-176.txt
+97618498-92898924&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-004.sgm-290.txt
+97618498-92898924&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+/reut2-004.sgm-248.txt
+
+The first column is the <Cluster-Id> and the second column is
+<reuters-text-filename>.

Added: mahout/site/trunk/content/mr---map-reduce.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/mr---map-reduce.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/mr---map-reduce.mdtext (added)
+++ mahout/site/trunk/content/mr---map-reduce.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,13 @@
+Title: MR - Map Reduce
+{excerpt}MapReduce is a framework for processing huge datasets on certain
+kinds of distributable problems using a large number of computers (nodes),
+collectively referred to as a cluster.{excerpt} Computational processing
+can occur on data stored either in a filesystem (unstructured) or within a
+database (structured).
+
+&nbsp; Also written M/R
+
+
+&nbsp; See Also
+* [http://wiki.apache.org/hadoop/HadoopMapReduce](http://wiki.apache.org/hadoop/HadoopMapReduce)
+* [http://en.wikipedia.org/wiki/MapReduce](http://en.wikipedia.org/wiki/MapReduce)

Added: mahout/site/trunk/content/naivebayes.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/naivebayes.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/naivebayes.mdtext (added)
+++ mahout/site/trunk/content/naivebayes.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,39 @@
+Title: NaiveBayes
+<a name="NaiveBayes-NaiveBayes"></a>
+# Naive Bayes
+
+Naive Bayes is an algorithm that can be used to classify objects into
+usually binary categories. It is one of the most common learning algorithms
+in spam filters. Despite its simplicity and rather naive assumptions it has
+proven to work surprisingly well in practice.
+
+Before applying the algorithm, the objects to be classified need to be
+represented by numerical features. In the case of e-mail spam each feature
+might indicate whether some specific word is present or absent in the mail
+to classify. The algorithm comes in two phases: Learning and application.
+During learning, a set of feature vectors is given to the algorithm, each
+vector labeled with the class the object it represents, belongs to. From
+that it is deduced which combination of features appears with high
+probability in spam messages. Given this information, during application
+one can easily compute the probability of a new message being either spam
+or not.
+
+The algorithm does make several assumptions, that are not true for most
+datasets, but make computations easier. The worst probably being, that all
+features of an objects are considered independent. In practice, that means,
+given the phrase "Statue of Liberty" was already found in a text, does not
+influence the probability of seeing the phrase "New York" as well.
+
+<a name="NaiveBayes-StrategyforaparallelNaiveBayes"></a>
+## Strategy for a parallel Naive Bayes
+
+See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)
+.
+
+
+<a name="NaiveBayes-Examples"></a>
+## Examples
+
+[20Newsgroups](20newsgroups.html)
+ - Example code showing how to train and use the Naive Bayes classifier
+using the 20 Newsgroups data available at [http://people.csail.mit.edu/jrennie/20Newsgroups/]

Added: mahout/site/trunk/content/neural-network.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/neural-network.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/neural-network.mdtext (added)
+++ mahout/site/trunk/content/neural-network.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,16 @@
+Title: Neural Network
+<a name="NeuralNetwork-NeuralNetworks"></a>
+# Neural Networks
+
+Neural Networks are a means for classifying multi dimensional objects. We
+concentrate on implementing back propagation networks with one hidden layer
+as these networks have been covered by the [2006 NIPS map reduce paper](http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf)
+. Those networks are capable of learning not only linear separating hyper
+planes but arbitrary decision boundaries.
+
+<a name="NeuralNetwork-Strategyforparallelbackpropagationnetwork"></a>
+## Strategy for parallel backpropagation network
+
+
+<a name="NeuralNetwork-Designofimplementation"></a>
+## Design of implementation

Added: mahout/site/trunk/content/online-passive-aggressive.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/online-passive-aggressive.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/online-passive-aggressive.mdtext (added)
+++ mahout/site/trunk/content/online-passive-aggressive.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,29 @@
+Title: Online Passive Aggressive
+Online Passive Aggressive
+
+Implements [http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fciteseer.ist.psu.edu%2Fviewdoc%2Fdownload%3Bjsessionid%3DF4743238B0EF35EB396A5ABFF1332021%3Fdoi%3D10.1.1.61.5120%26rep%3Drep1%26type%3Dpdf&rct=j&q=online%20passive%20aggressive&ei=elvWTa6jBcfHrQf8o52KBg&usg=AFQjCNGqNjaHyWgT4Z3QrK7hEqSTGM10YQ&sig2=-szWIrzBLoQ52jBER9-I0Q&cad=rja](http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fciteseer.ist.psu.edu%2Fviewdoc%2Fdownload%3Bjsessionid%3DF4743238B0EF35EB396A5ABFF1332021%3Fdoi%3D10.1.1.61.5120%26rep%3Drep1%26type%3Dpdf&rct=j&q=online%20passive%20aggressive&ei=elvWTa6jBcfHrQf8o52KBg&usg=AFQjCNGqNjaHyWgT4Z3QrK7hEqSTGM10YQ&sig2=-szWIrzBLoQ52jBER9-I0Q&cad=rja)
+.
+
+Use cases:
+
+  When you have many classes that are linearly separable and want a fast
+online learner to get results quickly.
+
+Pre-requisites:
+
+  Data must be shuffled and normalized either between 0..1 or by mean and
+standard deviation.
+
+Technical details:
+
+  The training approach taken is to minimize the ranking loss of the
+correct label vs the incorrect ones. We define this loss as hinge(1 -
+correct label score + wrong label score) where wrong label score is the
+score of the highest scoring label that is not the correct label. The hinge
+function is hinge(x) = x if x > 0, 0 otherwise.
+
+Parameters:
+
+  There is only one - learningRate. You set it to a larger number to
+converge faster, or a smaller number to be more cautious. The normal way to
+use it is via cross validation. Good values are (0.1, 1.0, 10.0).

Added: mahout/site/trunk/content/online-viterbi.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/online-viterbi.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/online-viterbi.mdtext (added)
+++ mahout/site/trunk/content/online-viterbi.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,22 @@
+Title: Online Viterbi
+Online Viterbi algorithm implementation which could decode hidden variable sequence from the given sequence of observed variables as soon as some part of input sequence could be decoded. In some cases this algorithm may perform at the constant space and asymptotically same time as the normal Viterbi \[1\](1\.html)
+.
+
+Online Viterbi stores special compressed tree representation of
+backpointers where only potentially optimal paths of hidden states are
+present. When they all intersects in one point backward pass could be
+performed from this state. 
+
+<a name="OnlineViterbi-Usage"></a>
+### Usage
+
+You can use Online Viterbi just as the normal Viterbi by running
+"bin/mahout viterbi \-online" for sequential computation and "bin/mahout
+poviterbi" for parallel.
+
+<a name="OnlineViterbi-References"></a>
+### References
+
+\[1\](1\.html)
+&nbsp;Rastislav Sramek. The Online Viterbi algorithm (Master's Thesis).
+2007

Added: mahout/site/trunk/content/overview.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/overview.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/overview.mdtext (added)
+++ mahout/site/trunk/content/overview.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,34 @@
+Title: Overview
+<a name="Overview-OverviewofMahout"></a>
+# Overview of Mahout
+
+Mahout's goal is to build scalable machine learning libraries. With
+scalable we mean: 
+* Scalable to reasonably large data sets. Our core algorithms for
+clustering, classfication and batch based collaborative filtering are
+implemented on top of Apache Hadoop using the map/reduce paradigm. However
+we do not restrict contributions to Hadoop based implementations:
+Contributions that run on a single node or on a non-Hadoop cluster are
+welcome as well. The core libraries are highly optimized to allow for good
+performance also for non-distributed algorithms.
+* Scalable to support your business case. Mahout is distributed under a
+commercially friendly Apache Software license.
+* Scalable community. The goal of Mahout is to build a vibrant, responsive,
+diverse community to facilitate discussions not only on the project itself
+but also on potential use cases. Come to the mailing lists to find out
+more.
+
+
+Currently Mahout supports mainly four use cases: Recommendation mining
+takes users' behavior and from that tries to find items users might like.
+Clustering takes e.g. text documents and groups them into groups of
+topically related documents. Classification learns from exisiting
+categorized documents what documents of a specific category look like and
+is able to assign unlabelled documents to the (hopefully) correct category.
+Frequent itemset mining takes a set of item groups (terms in a query
+session, shopping cart content) and identifies, which individual items
+usually appear together. 
+
+Interested in helping? See the [Wiki](mahout-wiki.html)
+ or send us an email. Also note, we are just getting off the ground, so
+please be patient as we get the various infrastructure pieces in place.

Added: mahout/site/trunk/content/parallel-frequent-pattern-mining.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/parallel-frequent-pattern-mining.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/parallel-frequent-pattern-mining.mdtext (added)
+++ mahout/site/trunk/content/parallel-frequent-pattern-mining.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,180 @@
+Title: Parallel Frequent Pattern Mining
+Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper [http://infolab.stanford.edu/~echang/recsys08-69.pdf](http://infolab.stanford.edu/~echang/recsys08-69.pdf)
+ with some optimisations in mining the data.
+
+Given a huge transaction list, the algorithm finds all unique features(sets
+of field values) and eliminates those features whose frequency in the whole
+dataset is less that minSupport. Using these remaining features N, we find
+the top K closed patterns for each of them, generating a total of NxK
+patterns. FPGrowth Algorithm is a generic implementation, we can use any
+Object type to denote a feature. Current implementation requires you to use
+a String as the object type. You may implement a version for any object by
+creating Iterators, Convertors and TopKPatternWritable for that particular
+object. For more information please refer the package
+org.apache.mahout.fpm.pfpgrowth.convertors.string
+
+    e.g:
+     FPGrowth<String> fp = new FPGrowth<String>();
+     Set<String> features = new HashSet<String>();
+     fp.generateTopKStringFrequentPatterns(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern),
+    	fp.generateFList(
+    	  new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern), minSupport),
+    	 minSupport,
+    	maxHeapSize,
+    	features,
+    	new StringOutputConvertor(new SequenceFileOutputCollector<Text,
+TopKStringPatterns>(writer))
+      );
+
+* The first argument is the iterator of transaction in this case its
+Iterator<List<String>>
+* The second argument is the output of generateFList function, which
+returns the frequent items and their frequencies from the given database
+transaction iterator
+* The third argument is the minimum Support of the pattern to be generated
+* The fourth argument is the maximum number of patterns to be mined for
+each feature
+* The fifth argument is the set of features for which the frequent patterns
+has to be mined
+* The last argument is an output collector which takes \[key, value\](key,-value\.html)
+ of Feature and TopK Patterns of the format \[String,
+List<Pair<List<String>, Long>>\] and writes them to the appropriate writer
+class which takes care of storing the object, in this case in a Sequence
+File Output format
+
+<a name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a>
+## Running Frequent Pattern Growth via command line
+
+The command line launcher for string transaction data
+org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
+specifying the regex pattern for spitting a string line of a transaction
+into the constituent features.
+
+Input files have to be in the following format.
+
+<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....
+
+instead of tab you could use , or \| as the default tokenization is done using a java Regex pattern <pre><code>[,\t](,\t.html)
+*[,|\t][ ,\t]*</code></pre>
+You can override this parameter to parse your log files or transaction
+files (each line is a transaction.) The FPGrowth algorithm mines the top K
+frequently occurring sets of items and their counts from the given input
+data
+
+$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this
+format. 
+Other sample files are accident.dat.gz from [http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+. As a quick test, try this:
+
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+The minimumSupport parameter \-s is the minimum number of times a pattern
+or a feature needs to occur in the dataset so that it is included in the
+patterns generated. You can speed up the process by having a large value of
+s. There are cases where you will have less than k patterns for a
+particular feature as the rest don't for qualify the minimum support
+criteria
+
+Note that the input to the algorithm, could be uncompressed or compressed
+gz file or even a directory containing any number of such files.
+We modified the regex to use space to split the token. Note that input
+regex string is escaped.
+
+<a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a>
+## Running Parallel FPGrowth
+
+Running parallel FPGrowth is as easy as adding changing the flag \-method
+mapreduce and adding the number of groups parameter e.g. \-g 20 for 20
+groups. First, let's run the above sample test in map-reduce mode:
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
+the sequential mode, (with 5 gigs of ram allocated). In a separate test,
+the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
+seconds in sequential mode.
+
+Here is another dataset which, while several times larger, requires much
+less time to find frequent patterns, as there are very few. Get
+accidents.dat.gz from [http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+ and place it on your hdfs in a folder named accidents. Then, run the
+hadoop version of the FPGrowth job:
+
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+OR to run a dataset of this size in sequential mode on a single machine
+let's give Mahout a lot more memory and only keep features with more than
+300 members:
+
+    export MAHOUT_HEAPSIZE=-Xmx5000m
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+
+The numGroups parameter \-g in FPGrowthJob specifies the number of groups
+into which transactions have to be decomposed. The default of 1000 works
+very well on a single-machine cluster; this may be very different on large
+clusters.
+
+Note that accidents.dat has 340 unique features. So we chose \-g 10 to
+split the transactions across 10 shards where 34 patterns are mined from
+each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm
+takes care of calculating the split. For better performance in large
+datasets and clusters, try not to mine for more than 20-25 features per
+shard. Stick to the defaults on a small machine.
+
+The numTreeCacheEntries parameter \-tc specifies the number of generated
+conditional FP-Trees to be kept in memory so that subsequent operations do
+not to regenerate them. Increasing this number increases the memory
+consumption but might improve speed until a certain point. This depends
+entirely on the dataset in question. A value of 5-10 is recommended for
+mining up to top 100 patterns for each feature.
+
+<a name="ParallelFrequentPatternMining-Viewingtheresults"></a>
+## Viewing the results
+The output will be dumped to a SequenceFile in the frequentpatterns
+directory in Text=>TopKStringPatterns format. Run this command to see a few
+of the Frequent Patterns:
+
+    bin/mahout seqdumper \
+         -s patterns/frequentpatterns/part-?-00000 \
+         -n 4
+
+or replace -n 4 with -c for the count of patterns.
+ 
+Open questions: how does one experiment and monitor with these various
+parameters?

Added: mahout/site/trunk/content/parallel-viterbi.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/parallel-viterbi.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/parallel-viterbi.mdtext (added)
+++ mahout/site/trunk/content/parallel-viterbi.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,74 @@
+Title: Parallel Viterbi
+Viterbi algorithm is known as inference algorithm (synonyms: segmentation, decoding etc) for Hidden Markov Model \[1\](1\.html)
+ which finds the most likely sequence of hidden states by given sequence of
+observed states.
+
+Apache Mahout has both [sequential](hidden-markov-models.html)
+ and parallel (that's what you're reading about) implementations of the
+algorithm.
+
+Detailed presentation about parallel viterbi implementation could be found
+in [there](http://modis.ispras.ru/seminar/wp-content/uploads/2011/11/Mahout_Viterbi.pdf)
+ (in russian)
+
+<a name="ParallelViterbi-Parallelizationstrategy"></a>
+### Parallelization strategy
+
+is quite straightforward and based on data parallelizm. There are some
+studies on Viterbi (and Belief Propogation which is inference algorithm for
+loop-less Markov Random Fields and is quite similar to Viterbi)
+parallelization, but at the moment of writing this article none of them
+seem to be applyable for MapReduce paradigm.
+
+For example, forward pass of Viterbi could be represented in terms of
+matrix computations (as being dynamic programming algorithm) an thus
+essentially paralleled, but overhead for MapReduce would be greater than
+profit for parallel matrix multiplication.
+
+Input sequences of observed variables are supposed to be divided into the
+chunks of some length, enough to store O(N*K) data in main memory. A set of
+all chunks number N is called a "serie number N". The algorithm process the
+data from serie number N-1 to serie number N (or vice versa), performing
+forward and backward Viterbi passes independently for each chunk (and
+consequently for each sequence) in reducers. Only data that is nescessary
+for computation of next serie is being outputed by direct output of
+reducers, all other data is collected in background. For example, when
+performing forward Viterbi pass only probabilities of last hidden state are
+nescessary for the next step, backpointers tables could be written in
+parallel to local store since they would be needed only for backward pass.
+
+If all the sequences are of the same length approximately and the number of
+sequences to decode is much more that number of reducers, O(N*M/K) time is
+required to decode them in parallel (N is number of each sequence, M is
+number of all sequences, K is number of reducers).
+
+<a name="ParallelViterbi-Dataformat"></a>
+### Data format
+
+Each sequence of observed states must be stored in sequence files, where
+key is the name of the sequence and value is&nbsp;ObservedSequenceWritable
+where number of chunk, data length and data itself are stored. At the
+moment it is hardcoded requirement, but it seems to be easy to implement
+any input file format that will output this information.
+
+The easiest way to get adjust plain text files with space-delimeted numbers
+of observed states to this format is to use "bin/mahout hmmchunks".
+
+After parallel Viterbi is ended, decoded sequences will be stored in
+sequence files, one for each chunk (key is number of chunk, value is
+HiddenSequenceWritable). They could be unchunked to plain text
+space-delimeted numbers of hidden states by&nbsp;"bin/mahout hmmchunks
+\-unchunk".
+
+<a name="ParallelViterbi-Usage"></a>
+### Usage
+
+Run "bin/mahout pviterbi" and see what it wants from you. That is:&nbsp;
+
+* serialized HmmModel (i.e. by LossyHmmModelSerializer class)
+* input data (observed sequences) in the format described above
+* paths for temporary storage (i.e. backpointers) and for decoded sequences
+
+*References*
+
+1. [Wikipedia article](http://en.wikipedia.org/wiki/Viterbi_algorithm)

Added: mahout/site/trunk/content/partial-implementation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/partial-implementation.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/partial-implementation.mdtext (added)
+++ mahout/site/trunk/content/partial-implementation.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,135 @@
+Title: Partial Implementation
+<a name="PartialImplementation-Introduction"></a>
+# Introduction
+
+This quick start page shows how to build a decision forest using the
+partial implementation. This tutorial also explains how to use the decision
+forest to classify new data.
+Partial Decision Forests is a mapreduce implementation where each mapper
+builds a subset of the forest using only the data available in its
+partition. This allows building forests using large datasets as long as
+each partition can be loaded in-memory.
+
+<a name="PartialImplementation-Steps"></a>
+# Steps
+<a name="PartialImplementation-Downloadthedata"></a>
+## Download the data
+* The current implementation is compatible with the UCI repository file
+format. In this example we'll use the NSL-KDD dataset because its large
+enough to show the performances of the partial implementation.
+You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/
+You can either download the full training set "KDDTrain+.ARFF", or a 20%
+subset "KDDTrain+_20Percent.ARFF" (we'll use the full dataset in this
+tutorial) and the test set "KDDTest+.ARFF".
+* Open the train and test files and remove all the lines that begin with
+'@'. All those lines are at the top of the files. Actually you can keep
+those lines somewhere, because they'll help us describe the dataset to
+Mahout
+* Put the data in HDFS: <pre><code>
+$HADOOP_HOME/bin/hadoop fs -mkdir testdata
+$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata</code></pre>
+
+<a name="PartialImplementation-BuildtheJobfiles"></a>
+## Build the Job files
+* In $MAHOUT_HOME/ run: <pre><code>mvn clean install -DskipTests</code></pre>
+
+<a name="PartialImplementation-Generateafiledescriptorforthedataset:"></a>
+## Generate a file descriptor for the dataset: 
+run the following command:
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar
+org.apache.mahout.df.tools.Describe -p testdata/KDDTrain+.arff -f
+testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
+
+The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string describes all the attributes
+of the data. In this cases, it means 1 numerical(N) attribute, followed by
+3 Categorical(C) attributes, ...L indicates the label. You can also use 'I'
+to ignore some attributes
+
+<a name="PartialImplementation-Runtheexample"></a>
+## Run the example
+
+
+    $HADOOP_HOME/hadoop jar
+$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
+org.apache.mahout.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231
+-oob -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100
+-o nsl-forest
+
+which builds 100 trees (-t argument) using the partial implementation (-p).
+Each tree is built using 5 random selected attribute per node (-sl
+argument) the example computes the out-of-bag error (-oob) and outputs the
+decision tree in the "nsl-forest" directory (-o).
+The number of partitions is controlled by the -Dmapred.max.split.size
+argument that indicates to Hadoop the max. size of each partition, in this
+case 1/10 of the size of the dataset. Thus 10 partitions will be used.
+IMPORTANT: using less partitions should give better classification results,
+but needs a lot of memory. So if the Jobs are failing, try increasing the
+number of partitions.
+* The example outputs the Build Time and the oob error estimation
+
+
+    10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582
+    10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate :
+0.002325895231517865
+    10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in:
+nsl-forest/forest.seq
+
+
+<a name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a>
+## Using the Decision Forest to Classify new data
+run the following command:
+
+    $HADOOP_HOME/hadoop jar
+$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
+org.apache.mahout.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds
+nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions
+
+This will compute the predictions of "KDDTest+.arff" dataset (-i argument)
+using the same data descriptor generated for the training dataset (-ds) and
+the decision forest built previously (-m). Optionally (if the test dataset
+contains the labels of the tuples) run the analyzer to compute the
+confusion matrix (-a), and you can also store the predictions in a text
+file or a directory of text files(-o). Passing the (-mr) parameter will use
+Hadoop to distribute the classification.
+
+* The example should output the classification time and the confusion
+matrix
+
+
+    10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s
+355
+    10/03/13 18:08:56 INFO mapreduce.TestForest:
+=======================================================
+    Summary
+    -------------------------------------------------------
+    Correctly Classified Instances		:      17657	   78.3224%
+    Incorrectly Classified Instances	:	4887	   21.6776%
+    Total Classified Instances		:      22544
+    
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+    a	b	<--Classified as
+    9459	252	 |  9711	a     = normal
+    4635	8198	 |  12833	b     = anomaly
+    Default Category: unknown: 2
+
+
+If the input is a single file then the output will be a single text file,
+in the above example 'predictions' would be one single file. If the input
+if a directory containing for example two files 'a.data' and 'b.data', then
+the output will be a directory 'predictions' containing two files
+'a.data.out' and 'b.data.out'
+
+<a name="PartialImplementation-KnownIssuesandlimitations"></a>
+## Known Issues and limitations
+The "Decision Forest" code is still "a work in progress", many features are
+still missing. Here is a list of some known issues:
+* For now, the training does not support multiple input files. The input
+dataset must be one single file. Classifying new data does support multiple
+input files.
+* The tree building is done when each mapper.close() method is called.
+Because the mappers don't refresh their state, the job can fail when the
+dataset is big and you try to build a large number of trees.

Added: mahout/site/trunk/content/patch-check-list.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/patch-check-list.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/patch-check-list.mdtext (added)
+++ mahout/site/trunk/content/patch-check-list.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,33 @@
+Title: Patch Check List
+So, you want to apply a patch?	Here are tips, traps, etc. for dealing with
+patches (in no particular order):
+
+1. Get a fresh copy of trunk.  Or at least make sure you are up to date and
+clean your build area.	For complex patches, it is recommended you deal
+with a fresh checkout.
+1. Look at the patch and see where it is applied.  Ideally it is generated
+from the root, but not everyone does this, especially for contrib areas.
+1. patch \-p 0 \-i <path to patch>  Throw a \--dry-run on there if you want
+to see what happens w/o screwing up your checkout.
+1. Did the author write unit tests?  Are the unit tests worthwhile?
+1. How are the benchmark results?  contrib/benchmarker may be used to test
+performance in before/after scenarios.
+1. Are the licenses correct on newly added files? Has an ASF license been
+granted?
+1. Update CHANGES.txt.  Give proper credit to the authors.
+1. Make sure you update JIRA by assigning the issue to you so that others
+know you are working on it.
+1. If it is a complex change and you have added to the original author's
+patch, it is suggested that you create a new patch and attach that to JIRA
+so that it can be discussed.
+1. How's the documentation, esp. the javadocs?
+1. Before committing, make sure you add any new documents to SVN.  Just b/c
+the patch added them doesn't mean you have.
+1. Run all unit tests, verify all tests pass.
+1. Generate javadocs, verify no javadoc errors/warnings were introduced by
+the patch.
+1. Put in a meaningful commit message.  Reference the JIRA issue when
+appropriate.
+1. Remember to update the issue in JIRA when you have completed it.
+1. From the top directory "ant rat-sources" to make sure all the files have
+license headers.

Added: mahout/site/trunk/content/pearsoncorrelation.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/pearsoncorrelation.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/pearsoncorrelation.mdtext (added)
+++ mahout/site/trunk/content/pearsoncorrelation.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,24 @@
+Title: PearsonCorrelation
+{excerpt}The Pearson correlation measures the degree to which two series of
+numbers tend to move together -- values in corresponding positions tend to
+be high together, or low together. In particular it measures the strength
+of the linear relationship between the two series, the degree to which one
+can be estimated as a linear function of the other. It is often used in
+collaborative filtering as a similarity metric on users or items; users
+that tend to rate the same items high, or low, have a high Pearson
+correlation and therefore are "similar".{excerpt}
+
+The Pearson correlation can behave very badly when small counts are
+involved.  For example, if you compare any two sequences with two distinct
+values, you get a correlation of 1.  To some degree, this problem can be
+avoided by not computing correlations for short sequences (with less than,
+say, 10 values).  
+
+Pearson correlation is sometimes used in collaborative filtering to define
+similarity between the ratings of two users on a common set of items.  In
+this application, it is a reasonable measure if there is sufficient
+overlap.  It, unfortunately, is not able to take advantage of the degree of
+overlapping ratings relative to the sets of all ratings.
+
+See Also
+* [http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient](http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)

Added: mahout/site/trunk/content/perceptron-and-winnow.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/perceptron-and-winnow.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/perceptron-and-winnow.mdtext (added)
+++ mahout/site/trunk/content/perceptron-and-winnow.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,36 @@
+Title: Perceptron and Winnow
+<a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a>
+# Classification with Perceptron or Winnow
+
+Both algorithms can are comparably simple linear classifiers. Given
+training data in some n-dimensional vector space that is annotated with
+binary labels the algorithms are guaranteed to find a linear separating
+hyperplane if there exists one. In contrast to the Perceptron, Winnow works
+only for binary feature vectors.
+
+For more information on the Perceptron see for instance:
+http://en.wikipedia.org/wiki/Perceptron
+
+Concise course notes on both algorithms:
+http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf
+
+Although the algorithms are comparably simple they still work pretty good
+for text classification and are fast to train even for huge example sets.
+In contrast to Naive Bayes they are not based on the assumption that all
+features (in the domain of text classification: all terms in a document)
+are independent.
+
+<a name="PerceptronandWinnow-Strategyforparallelisation"></a>
+## Strategy for parallelisation
+
+Currently the strategy for parallelisation is simple: Given there is enough
+training data, split the training data. Train the classifier on each split.
+The resulting hyperplanes are than averaged.
+
+<a name="PerceptronandWinnow-Roadmap"></a>
+## Roadmap
+
+Currently the patch only contains the code for the classifier itself. It is
+planned to provide unit tests and at least one example based on the WebKB
+dataset by the end of November for the serial version. After that the
+parallelisation will be added.

Added: mahout/site/trunk/content/please-remove-this-page.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/please-remove-this-page.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/please-remove-this-page.mdtext (added)
+++ mahout/site/trunk/content/please-remove-this-page.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,2 @@
+Title: Please remove this page
+blank

Added: mahout/site/trunk/content/powered-by-mahout.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/powered-by-mahout.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/powered-by-mahout.mdtext (added)
+++ mahout/site/trunk/content/powered-by-mahout.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,122 @@
+Title: Powered By Mahout
+* [Intro](#PoweredByMahout-Intro)
+* [Commercial Use](#PoweredByMahout-CommercialUse)
+* [Academic Use](#PoweredByMahout-AcademicUse)
+* [Powered By Logos](#PoweredByMahout-PoweredByLogos)
+<a name="PoweredByMahout-Intro"></a>
+# Intro
+Are you using Mahout to do Machine Learning?  Care to share?
+
+*NOTE: Please add links in alphabetical order.	Links here do NOT imply
+endorsement by Mahout, its committers or the Apache Software Foundation and
+are for informational purposes only.*
+
+<a name="PoweredByMahout-CommercialUse"></a>
+# Commercial Use
+
+* Adobe AMP uses Mahout's clustering algorithms to increase video
+consumption by better user targeting. See [http://nosql.mypopescu.com/post/2082712431/hbase-and-hadoop-at-adobe](http://nosql.mypopescu.com/post/2082712431/hbase-and-hadoop-at-adobe)
+* Amazon's Personalization Platform -- See [http://www.linkedin.com/groups/Apache-Mahout-2182513](http://www.linkedin.com/groups/Apache-Mahout-2182513)
+* [AOL ](http://www.aol.com)
+ use Mahout for shopping recommendations. See [http://www.slideshare.net/kryton/the-data-layer]
+* [Booz Allen Hamilton ](http://www.boozallen.com/)
+ uses Mahout's clustering algorithms. See [http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010]
+* [Buzzlogic](http://www.buzzlogic.com)
+ uses Mahout's clustering algorithms to improve ad targeting
+* [Cull.tv](http://cull.tv/)
+ uses modified Mahout algorithms for content recommendations
+* ![DataMine Lab](http://cdn.dataminelab.com/favicon.ico) [DataMine Lab](http://dataminelab.com)
+ uses Mahout's recommendation and clustering algorithms to improve our
+clients' ad targeting.
+* [Drupal](http://drupal.org/project/recommender)
+ users Mahout to provide open source content recommendation solutions.
+* [Foursquare](http://www.foursquare.com)
+ uses Mahout for its [recommendation engine |http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/]
+.
+* [Idealo](http://www.idealo.de)
+ uses Mahout's recommendation engine.
+* [InfoGlutton](http://www.infoglutton.com)
+ uses Mahout's clustering and classification for various consulting
+projects.
+* [Intela](http://www.intela.com/)
+ has implementations of Mahout's recommendation algorithms to select new
+offers to send tu customers, as well as to recommend potential customers to
+current offers. We are also working on enhancing our offer categories by
+using the clustering algorithms. We have a [blog post|http://intela.com/best-practices/intela-gets-smarter]
+ where we talk about it. 
+* ![ioffer](http://ioffer.com/favicon.ico) [iOffer](http://www.ioffer.com)
+ uses Mahout's Frequent Pattern Mining and Collaborative Filtering to
+recommend items to users.
+* ![kauli](http://kau.li/favicon.ico) [Kauli](http://kau.li/en)
+, one of Japanese Adnetwork, uses Mahout's clustering to handle clickstream
+data for predicting audience's interests and intents.
+* ![Mendeley](http://mendeley.com/favicon.ico) [Mendeley](http://mendeley.com)
+ uses Mahout to power Mendeley Suggest, a research article recommendation
+service.
+* ![Mippin](http://mippin.com/favicon.ico) [Mippin](http://mippin.com)
+ uses Mahout's collaborative filtering engine to recommend news feeds
+* [Mobage](http://www.slideshare.net/hamadakoichi/mobage-prmu-2011-mahout-hadoop)
+ uses Mahout in their analysis pipeline
+* ![Myrrix](http://myrrix.com/wp-content/uploads/2012/03/favicon.ico) [Myrrix](http://myrrix.com)
+ is a recommender system product built on Mahout.
+* ![NewsCred](http://www.newscred.com/media/img/favicon.ico) [NewsCred](http://platform.newscred.com)
+ uses Mahout to generate clusters of news articles and to surface the
+important stories of the day
+* ![Radoop](http://blog.radoop.eu/favicon.ico) [Radoop](http://radoop.eu)
+ provides a drag-n-drop interface for big data analytics, including Mahout
+clustering and classification algorithms
+* [Sematext](http://www.sematext.com/)
+ uses Mahout for its [Recommendation Engine|http://www.sematext.com/products/recommendation-engine/index.html]
+* [SpeedDate.com](http://www.speeddate.com)
+ uses Mahout's collaborative filtering engine to recommend member profiles
+* [Twitter](http://twitter.com)
+ uses Mahout's LDA implementation for user interest modeling, and maintains
+a (periodically sync'ed with Apache trunk) [fork|http://github.com/twitter/mahout]
+ of Mahout on GitHub
+* [Yahoo\!](http://www.yahoo.com)
+ Mail uses Mahout's Frequent Pattern Set Mining.  See [http://www.slideshare.net/hadoopusergroup/mail-antispam]
+* ![imageshack](http://a.imageshack.us/img823/3443/logoyf.gif) [365Media ](http://365media.com/)
+ uses *Mahout's* Classification and Collaborative Filtering algorithms in
+its Real-time system named [UPTIME|http://uptime.365media.com/]
+ and 365Media/Social
+
+<a name="PoweredByMahout-AcademicUse"></a>
+# Academic Use
+
+* [Dicode](https://www.dicode-project.eu/)
+ project uses Mahout's clustering and classification algorithms on top of
+HBase.
+* The course [Large Scale Data Analysis and Data Mining](http://www.dima.tu-berlin.de/menue/studium_und_lehre/aktuelles_semester_sommersemester_2011/aim_3_advanced_information_management)
+ at [TU Berlin|http://www.tu-berlin.de/]
+&nbsp;uses Mahout to teach students about the parallelization of data
+mining problems with Hadoop and Map/Reduce
+* Mahout is used at Carnegie Mellon University, as a comparable platform to [GraphLab](http://www.graphlab.ml.cmu.edu/)
+.
+* The [ROBUST project](http://www.robust-project.eu/)
+, co-funded by the European Commission, employs Mahout in the large scale
+analysis of online community data.
+* Mahout is used for research and data processing at [Nagoya Institute of Technology](http://www.nitech.ac.jp/eng/schools/grad/cse.html)
+, in the context of a large-scale citizen participation platform project,
+funded by the Ministry of Interior of Japan.
+* Several researches within [Digital Enterprise Research Institute](http://www.deri.ie)
+ [NUI Galway|http://www.nuigalway.ie]
+ use Mahout for e.g. topic mining and modelling of large corpora.
+* We used Mahout in the NoTube EU project, and it saved a lot of time (and
+a brain transplant). The only piece we've used heavily in our apps
+(http://vimeo.com/user3487770http://notube.tv/) so far is the Taste
+recommender, but I've been digging deeper into the other components. I
+can't claim we're a hugely famous or successful application, but I can say
+without doubt I don't regret using Mahout. It did what it said it would do,
+and easily. One nice thing about this community, is that Mahout is not
+over-marketed. If the nature or scale of your problem better suits other
+tools, the Mahout folk will tell you so.
+
+<a name="PoweredByMahout-PoweredByLogos"></a>
+# Powered By Logos
+
+Feel free to use our Powered By logos on your site:
+
+![mahout-logo-poweredby-55](attachments/80899/28016703.png)
+
+
+![mahout-logo-poweredby-100](attachments/80899/28016704.png)

Added: mahout/site/trunk/content/principal-components-analysis.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/principal-components-analysis.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/principal-components-analysis.mdtext (added)
+++ mahout/site/trunk/content/principal-components-analysis.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,23 @@
+Title: Principal Components Analysis
+<a name="PrincipalComponentsAnalysis-PrincipalComponentsAnalysis"></a>
+# Principal Components Analysis
+
+PCA is used to reduce high dimensional data set to lower dimensions. PCA
+can be used to identify patterns in data, express the data in a lower
+dimensional space. That way, similarities and differences can be
+highlighted. It is mostly used in face recognition and image compression.
+There are several flaws one has to be aware of when working with PCA:
+
+* Linearity assumption - data is assumed to be linear combinations of some
+basis. There exist non-linear methods such as kernel PCA that alleviate
+that problem.
+* Principal components are assumed to be orthogonal. ICA tries to cope with
+this limitation.
+* Mean and covariance are assumed to be statistically important.
+* Large variances are assumed to have important dynamics.
+
+<a name="PrincipalComponentsAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="PrincipalComponentsAnalysis-Designofpackages"></a>
+## Design of packages

Added: mahout/site/trunk/content/privacy-policy.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/privacy-policy.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/privacy-policy.mdtext (added)
+++ mahout/site/trunk/content/privacy-policy.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,23 @@
+Title: Privacy Policy
+Information about your use of this website is collected using server access
+logs and a tracking cookie. The collected information consists of the
+following:
+
+* The IP address from which you access the website;
+* The type of browser and operating system you use to access our site;
+* The date and time you access our site;
+* The pages you visit; and
+* The addresses of pages from where you followed a link to our site.
+
+Part of this information is gathered using a tracking cookie set by the
+Google Analytics service and handled by Google as described in their
+privacy policy. See your browser documentation for instructions on how to
+disable the cookie if you prefer not to share this data with Google.
+
+We use the gathered information to help us make our site more useful to
+visitors and to better understand how and when our site is used. We do not
+track or collect personally identifiable information or associate gathered
+data with any personally identifying information from other sources.
+
+By using this website, you consent to the collection of this data in the
+manner and for the purpose described above.

Added: mahout/site/trunk/content/professional-support.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/professional-support.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/professional-support.mdtext (added)
+++ mahout/site/trunk/content/professional-support.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,65 @@
+Title: Professional Support
+<a name="ProfessionalSupport-ProfessionalsupportforMahout"></a>
+# Professional support for Mahout
+
+Add yourself or your company if you are offering support for Mahout
+users.&nbsp; Please keep lists in alphabetical order.&nbsp; An entry here
+is not an endorsement by the Apache Software Foundation nor any of it's
+committers.
+
+
+<a name="ProfessionalSupport-Peopleandcompaniesforhire"></a>
+## People and companies for hire
+
+<table>
+<tr><th> Name </th><th> contact details </th><th> notes </th></tr>
+  
+  
+ [http://tutorteddy.com/site/free_statistics_help.php](http://tutorteddy.com/site/free_statistics_help.php)
+ |
+  
+  
+ Mahout, Hadoop, NoSQL Databases [http://www.clogeny.com](http://www.clogeny.com)
+ |
+<tr><td> Ted Dunning </td><td> tdunning@apache.org </td><td> limited availability </td></tr>
+<tr><td> GridLine </td><td> [http://www.gridline.nl/contact](http://www.gridline.nl/contact)
+ </td><td> specialised in search and&nbsp;thesauri </td></tr>
+<tr><td> Sean Owen </td><td> srowen@gmail.com </td><td> available for simple deployment of
+recommender projects </td></tr>
+<tr><td> Sematext International </td><td> [http://sematext.com/about/contact.html](http://sematext.com/about/contact.html)
+ </td><td> [http://www.sematext.com/]
+ </td></tr>
+<tr><td> Frank Scholten </td><td> frank.scholten@orange11.nl </td><td> Mahout [http://blog.orange11.nl/author/frank/](http://blog.orange11.nl/author/frank/)
+ </td></tr>
+<tr><td> Winterwell </td><td> daniel@winterwell.com </td><td> business/maths concept development &
+algorithms [http://winterwell.com](http://winterwell.com)
+ </td></tr>
+<tr><td> Jagdish Nomula </td><td> nomulaj@gmail.com </td><td> ML, Search, Algorithms, Java [http://www.kosmex.com](http://www.kosmex.com)
+ </td></tr>
+</table>
+
+<a name="ProfessionalSupport-Trainingandcourses"></a>
+## Training and courses
+
+<table>
+<tr><th> Name </th><th> contact details </th><th> notes </th></tr>
+</table>
+
+
+
+<a name="ProfessionalSupport-Talksandpresentations"></a>
+## Talks and presentations
+
+<table>
+<tr><th> Name </th><th> contact details </th><th> notes </th></tr>
+<tr><td> Isabel Drost </td><td> Mail: isabel@apache.org </td><td> If travel and accommodation
+costs are covered scheduling a talk is a lot easier. </td></tr>
+  
+  
+<tr><td> Frank Scholten </td><td> frank@jteam.nl </td><td> Mahout/Taste [http://blog.jteam.nl/author/frank/](http://blog.jteam.nl/author/frank/)
+ </td></tr>
+</table>
+
+If you are looking for local Apache people please also consider having a
+look at the [ASF Nearby Mentor Search ](http://community.zones.apache.org/)
+.

Added: mahout/site/trunk/content/quick-tour-of-text-analysis-using-the-mahout-command-line.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/quick-tour-of-text-analysis-using-the-mahout-command-line.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/quick-tour-of-text-analysis-using-the-mahout-command-line.mdtext (added)
+++ mahout/site/trunk/content/quick-tour-of-text-analysis-using-the-mahout-command-line.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,215 @@
+Title: Quick tour of text analysis using the Mahout command line
+
+[TOC]
+
+## Introduction
+
+This is a concise quick tour using the mahout command line
+to generate text analysis data. It follows examples from the [Mahout in Action](http://manning.com/owen/)
+book and uses the Reuters-21578 data set. This is one simple
+path through vectorizing text, creating clusters and calculating similar
+documents. The examples will work locally or distributed on a hadoop
+cluster. With the small data set provided a local installation is probably
+fast enough.
+
+> This is based on Mahout 0.6.
+There are several changes to the CLI for 0.7 not reflected here.
+
+
+## Generate Mahout vectors from text
+
+Get the [Reuters-21578](http://www.daviddlewis.com/resources/testcollections/reuters21578/)
+files and extract them in “./reuters”. They are in SGML format. Mahout
+will also create sequence files from raw text and other formats. At the end
+of this section you will have the text files turned into vectors, which are
+basically lists of weighted token. The weights are calculated to indicate
+the importance of each token.
+
+1. Convert from SGML to text:
+	
+	<pre><code>mvn -e -q exec:java
+	-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
+	-Dexec.args="reuters\\ reuters-extracted\\" </code></pre>
+	
+	If you plan to run this example on a hadoop cluster you will
+	need to copy the files to HDFS, which is not covered here.
+
+2. Now turn raw text in a directory into mahout sequence
+files:
+
+    <pre><code>mahout seqdirectory 
+    -c UTF-8 \\
+    -i examples/reuters-extracted/ \\
+    -o reuters-seqfiles</code></pre>
+
+3. Examine the sequence files with seqdumper:
+    <pre><code> mahout seqdumper \\-s reuters-seqfiles/chunk-0 \\| more </code></pre>
+    you should see something like this:
+    
+    <pre><code>Input Path: reuters-seqfiles/chunk-0
+    Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.Text
+    Key: /-tmp/reut2-000.sgm-0.txt: Value: 26-FEB-1987 15:01:01.79
+    
+    BAHIA COCOA REVIEW
+
+    
+    Showers continued throughout the week in the Bahia cocoa zone, alleviating
+the drought since early January and improving prospects for the coming
+temporao, although normal …</code></pre>
+    
+4. Create tfidf weighted vectors.
+    <pre><code>mahout seq2sparse \\
+       -i reuters-seqfiles/ \\
+       -o reuters-vectors/ \\
+       -ow -chunk 100 \\
+       -x 90 \\
+       -seq \\
+       -ml 50 \\
+       -n 2 \\
+       -nv </code></pre>
+    
+    This uses the default analyzer and default TFIDF weighting,
+    \\-n 2 is good for cosine distance, which we are using in clustering and for
+    similarity, \\-x 90 meaning that if a token appears in 90% of the docs it is
+    considered a stop word, \\-nv to get named vectors making further data files
+    easier to inspect.
+
+5. Examine the vectors if you like but they are not really human readable...
+
+    <pre><code>   mahout seqdumper -s reuters-seqfiles/part-r-00000</code></pre>
+
+6. Examine the tokenized docs to make sure the analyzer is
+filtering out enough (note that the rest of this example used a more
+restrictive lucene analyzer and not the default so your result may
+vary):
+
+    <pre><code>    mahout seqdumper \\
+      -s reuters-vectors/tokenized-documents/part-m-00000</code></pre>
+
+    This should show each doc with nice clean tokenized text.
+
+7. Examine the dictionary. It maps token id to token text.
+    <pre><code> mahout seqdumper \\
+    -s reuters-vectors/dictionary.file-0 \\
+    | more </code></pre>
+    
+## Cluster documents using kmeans
+
+Clustering documents can be done with one of several clustering algorithms in Mahout. Perhaps the best know is kmeans, which
+will drop documents into k categories. You have to supply k as input along
+with the vectors. The output is k centroids (vectors) for each cluster and
+optionally the documents assigned to each cluster.
+    
+1. Create clusters and assign documents to the clusters
+
+    <pre><code>mahout kmeans \\
+    -i reuters-vectors/tfidf-vectors/ \\
+    -c reuters-kmeans-centroids \\
+    -cl \\
+    -o reuters-kmeans-clusters \\
+    -k 20 \\
+    -ow \\
+    -x 10 \\
+    -dm org.apache.mahout.common.distance.CosineDistanceMeasure</code></pre>
+
+    If \\-c and \\-k are specified, kmeans will put random seed vectors into the
+    \\-c directory, if \\-c is provided without \\-k then the \\-c directory is
+    assumed to be input and kmeans will use each vector in it to seed the
+    clustering. \\-cl tell kmeans to also assign the input doc vectors to
+    clusters at the end of the process and put them in
+    reuters-kmeans-clusters/clusteredPoints. if \\-cl is not specified then the
+    documents will not be assigned to clusters.
+    
+2. Examine the clusters and perhaps even do some anaylsis of
+how good the clusters are:
+    
+    <pre><code>mahout clusterdump \\
+    -d reuters-vectors/dictionary.file-0 \\
+    -dt sequencefile \*color:#000000*
+    -s reuters-kmeans-clusters/clusters-3-final/part-r-00000 \\
+    -n 20 \\
+    -b 100 \\
+    -p reuters-kmeans-clusters/clusteredPoints/</code></pre>
+
+    Note: clusterdump can do some analysis of the quality of clusters but is not
+    shown here.
+    
+3. The clusteredPoints dir has the docs mapped into clusters, and if you created
+vectors with names (seq2sparse\\-nv) you’ll see file names. You also have the
+distance from the centroid using the distance measure supplied to the
+clustering driver. To look at this use seqdumper:
+    
+    <pre><code>mahout seqdumper \\
+    -s reuters-kmeans-clusters/clusteredPoints/part-m-00000 \\
+    | more</code></pre>
+    
+    You will see that the file contains key: clusterid, value:
+wt = % likelihood the vector is in cluster, distance from centroid, named
+vector belonging to the cluster, vector data.
+
+    For kmeans the likelihood will be 1.0 or 0. For
+example:
+
+    <pre><code>Key: 21477: Value: wt: 1.0distance: 0.9420744909793364  
+    vec: /-tmp/reut2-000.sgm-158.txt = [372:0.318,
+    966:0.396, 3027:0.230, 8816:0.452, 8868:0.308,
+    13639:0.278, 13648:0.264, 14334:0.270, 14371:0.413]</code></pre>
+    
+    Clusters, of course, do not have names. A simple solution is
+to construct a name from the top terms in the centroid as they are output
+from clusterdump.
+    
+## Calculate several similar docs to each doc in the data
+    
+This will take all docs in the data set and for each
+calculate the 10 most similar docs. This can be used for a "give me more
+like this" feature. The algorithm is fairly fast and requires only three
+mapreduce passes.
+
+1. First create a matrix from the vectors:
+
+    <pre><code>mahout rowid \\
+    -i reuters-vectors/tfidf-vectors/part-r-00000
+    -o reuters-matrix</code></pre>
+
+
+    Wrote out matrix with 21578 rows and 19515 columns to reuters-matrix/matrix
+
+    Note: This does not create a Mahout Matrix class but a
+sequence file so use seqdumper to examine the results.
+2. Create a collection of similar docs for each row of the
+matrix above:
+
+    <pre><code>mahout rowsimilarity \\
+    -i reuters-named-matrix/matrix \\
+    -o reuters-named-similarity \\
+    -r 19515
+    --similarityClassname SIMILARITY_COSINE
+    -m 10
+    -ess</code></pre>
+
+    
+3. Examine the similarity list:
+
+    <pre><code>mahout seqdumper -s reuters-matrix/matrix | more </code></pre>
+         
+    <pre><code> Key: 0: Value: \{14458:0.2966480826934176,11399:0.30290014772966095,
+     12793:0.22009858979452146,3275:0.1871791030103281,
+     14613:0.3534278632679437,4411:0.2516380602790199,
+     17520:0.3139731583634198,13611:0.18968888212315968,
+     14354:0.17673965754661425,0:1.0000000000000004\} </code></pre>
+
+
+<pre><code>    Key: 0: Value: /-tmp/reut2-000.sgm-0.txt
+Key: 1: Value: /-tmp/reut2-000.sgm-1.txt
+Key: 2: Value: /-tmp/reut2-000.sgm-10.txt
+Key: 3: Value: /-tmp/reut2-000.sgm-100.txt
+... </code></pre>
+
+    
+## Conclusion
+    
+A wide variety of tasks can be performed from the command
+line of Mahout. Many parameters available in the Java API are supported so
+it is a good way to get an idea of how Mahout works and will give a basis
+for tuning your own use.