You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by is...@apache.org on 2013/11/03 22:36:27 UTC

svn commit: r1538467 [16/20] - in /mahout/site/mahout_cms: ./ cgi-bin/ content/ content/css/ content/developers/ content/general/ content/images/ content/js/ content/users/ content/users/basics/ content/users/classification/ content/users/clustering/ c...

Added: mahout/site/mahout_cms/content/overview.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/overview.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/overview.mdtext (added)
+++ mahout/site/mahout_cms/content/overview.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,34 @@
+Title: Overview
+<a name="Overview-OverviewofMahout"></a>
+# Overview of Mahout
+
+Mahout's goal is to build scalable machine learning libraries. With
+scalable we mean: 
+* Scalable to reasonably large data sets. Our core algorithms for
+clustering, classification and batch based collaborative filtering are
+implemented on top of Apache Hadoop using the map/reduce paradigm. However
+we do not restrict contributions to Hadoop based implementations:
+Contributions that run on a single node or on a non-Hadoop cluster are
+welcome as well. The core libraries are highly optimized to allow for good
+performance also for non-distributed algorithms.
+* Scalable to support your business case. Mahout is distributed under a
+commercially friendly Apache Software license.
+* Scalable community. The goal of Mahout is to build a vibrant, responsive,
+diverse community to facilitate discussions not only on the project itself
+but also on potential use cases. Come to the mailing lists to find out
+more.
+
+
+Currently Mahout supports mainly four use cases: Recommendation mining
+takes users' behavior and from that tries to find items users might like.
+Clustering takes e.g. text documents and groups them into groups of
+topically related documents. Classification learns from exisiting
+categorized documents what documents of a specific category look like and
+is able to assign unlabelled documents to the (hopefully) correct category.
+Frequent itemset mining takes a set of item groups (terms in a query
+session, shopping cart content) and identifies, which individual items
+usually appear together. 
+
+Interested in helping? See the [Wiki](http://cwiki.apache.org/confluence/display/MAHOUT)
+ or send us an email. Also note, we are just getting off the ground, so
+please be patient as we get the various infrastructure pieces in place.

Added: mahout/site/mahout_cms/content/users/basics/algorithms.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/algorithms.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/algorithms.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/algorithms.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,215 @@
+Title: Algorithms
+<a name="Algorithms-Algorithms"></a>
+## Algorithms
+
+This section contains links to information, examples, use cases, etc. for
+the various algorithms we intend to implement.	Click the individual links
+to learn more. The initial algorithms descriptions have been copied here
+from the original project proposal. The algorithms are grouped by the
+application setting, they can be used for. In case of multiple
+applications, the version presented in the paper was chosen, versions as
+implemented in our project will be added as soon as we are working on them.
+
+Original Paper: [Map Reduce for Machine Learning on Multicore](http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf)
+
+Papers related to Map Reduce:
+* [Evaluating MapReduce for Multi-core and Multiprocessor Systems](http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf)
+* [Map Reduce: Distributed Computing for Machine Learning](http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262a_proj.pdf)
+
+For Papers, videos and books related to machine learning in general, see [Machine Learning Resources](machine-learning-resources.html)
+
+All algorithms are either marked as _integrated_, that is the
+implementation is integrated into the development version of Mahout.
+Algorithms that are currently being developed are annotated with a link to
+the JIRA issue that deals with the specific implementation. Usually these
+issues already contain patches that are more or less major, depending on
+how much work was spent on the issue so far. Algorithms that have so far
+not been touched are marked as _open_.
+
+[What, When, Where, Why (but not How or Who)](what,-when,-where,-why-(but-not-how-or-who).html)
+ \- Community tips, tricks, etc. for when to use which algorithm in what
+situations, what to watch out for in terms of errors.  That is, practical
+advice on using Mahout for your problems.
+
+<a name="Algorithms-Classification"></a>
+### Classification
+
+A general introduction to the most common text classification algorithms
+can be found at Google Answers: [http://answers.google.com/answers/main?cmd=threadview&id=225316](http://answers.google.com/answers/main?cmd=threadview&id=225316)
+ For information on the algorithms implemented in Mahout (or scheduled for
+implementation) please visit the following pages.
+
+[Logistic Regression](logistic-regression.html)
+ (SGD)
+
+[Bayesian](bayesian.html)
+
+[Support Vector Machines](support-vector-machines.html)
+ (SVM) (open: [MAHOUT-14|http://issues.apache.org/jira/browse/MAHOUT-14]
+, [MAHOUT-232|http://issues.apache.org/jira/browse/MAHOUT-232]
+ and [MAHOUT-334|https://issues.apache.org/jira/browse/MAHOUT-334]
+) 
+
+[Perceptron and Winnow](perceptron-and-winnow.html)
+ (open: [MAHOUT-85|http://issues.apache.org/jira/browse/MAHOUT-85]
+)
+
+[Neural Network](neural-network.html)
+ (open, but [MAHOUT-228|http://issues.apache.org/jira/browse/MAHOUT-228]
+ might help)
+
+[Random Forests](random-forests.html)
+ (integrated - [MAHOUT-122|http://issues.apache.org/jira/browse/MAHOUT-122]
+, [MAHOUT-140|http://issues.apache.org/jira/browse/MAHOUT-140]
+, [MAHOUT-145|http://issues.apache.org/jira/browse/MAHOUT-145]
+)
+
+[Restricted Boltzmann Machines](restricted-boltzmann-machines.html)
+ (open, [MAHOUT-375|http://issues.apache.org/jira/browse/MAHOUT-375]
+, GSOC2010)
+
+[Online Passive Aggressive](online-passive-aggressive.html)
+ (integrated, [MAHOUT-702|http://issues.apache.org/jira/browse/MAHOUT-702]
+)
+
+[Boosting](boosting.html)
+ (awaiting patch commit, [MAHOUT-716|https://issues.apache.org/jira/browse/MAHOUT-716]
+)
+
+[Hidden Markov Models](hidden-markov-models.html)
+ (HMM) (MAHOUT-627, MAHOUT-396, MAHOUT-734) - Training is done in
+Map-Reduce
+
+<a name="Algorithms-Clustering"></a>
+### Clustering
+
+[Reference Reading](reference-reading.html)
+
+[MAHOUT:Canopy Clustering](mahout:canopy-clustering.html)
+ ([MAHOUT-3|https://issues.apache.org/jira/browse/MAHOUT-3] - integrated)
+
+[K-Means Clustering](k-means-clustering.html)
+ ([MAHOUT-5|https://issues.apache.org/jira/browse/MAHOUT-5] - integrated)
+
+[Fuzzy K-Means](fuzzy-k-means.html)
+ ([MAHOUT-74|https://issues.apache.org/jira/browse/MAHOUT-74] - integrated)
+
+[Expectation Maximization](expectation-maximization.html)
+ (EM) ([MAHOUT-28|http://issues.apache.org/jira/browse/MAHOUT-28])
+
+[Mean Shift Clustering](mean-shift-clustering.html)
+ ([MAHOUT-15|https://issues.apache.org/jira/browse/MAHOUT-15] - integrated)
+
+[Hierarchical Clustering](hierarchical-clustering.html)
+ ([MAHOUT-19|http://issues.apache.org/jira/browse/MAHOUT-19])
+
+[Dirichlet Process Clustering](dirichlet-process-clustering.html)
+ ([MAHOUT-30|http://issues.apache.org/jira/browse/MAHOUT-30] - integrated)
+
+[Latent Dirichlet Allocation](latent-dirichlet-allocation.html)
+ ([MAHOUT-123|http://issues.apache.org/jira/browse/MAHOUT-123] -
+integrated)
+
+[Spectral Clustering](spectral-clustering.html)
+ ([MAHOUT-363|https://issues.apache.org/jira/browse/MAHOUT-363] -
+integrated)
+
+[Minhash Clustering](minhash-clustering.html)
+ ([MAHOUT-344|https://issues.apache.org/jira/browse/MAHOUT-344] -
+integrated)
+
+[Top Down Clustering](top-down-clustering.html)
+ ([MAHOUT-843|https://issues.apache.org/jira/browse/MAHOUT-843] -
+integrated)
+
+<a name="Algorithms-PatternMining"></a>
+### Pattern Mining
+
+[Parallel FP Growth Algorithm](parallel-frequent-pattern-mining.html)
+ (Also known as Frequent Itemset mining)
+
+<a name="Algorithms-Regression"></a>
+### Regression
+
+[Locally Weighted Linear Regression](locally-weighted-linear-regression.html)
+ (open)
+
+
+<a name="Algorithms-Dimensionreduction"></a>
+### Dimension reduction
+
+[Singular Value Decomposition and other Dimension Reduction Techniques](dimensional-reduction.html)
+ (available since 0.3)
+
+[Stochastic Singular Value Decomposition with PCA workflow](stochastic-singular-value-decomposition.html)
+ (PCA workflow now integrated)
+
+[Principal Components Analysis](principal-components-analysis.html)
+ (PCA) (open)
+
+[Independent Component Analysis](independent-component-analysis.html)
+ (open)
+
+[Gaussian Discriminative Analysis](gaussian-discriminative-analysis.html)
+ (GDA) (open)
+
+<a name="Algorithms-EvolutionaryAlgorithms"></a>
+### Evolutionary Algorithms
+
+* NOTE: * Watchmaker support has been removed as of 0.7
+
+see also: [MAHOUT-56 (integrated)](http://issues.apache.org/jira/browse/MAHOUT-56)
+
+You will find here information, examples, use cases, etc. related to
+Evolutionary Algorithms.
+
+Introductions and Tutorials:
+* [Evolutionary Algorithms Introduction](http://www.geatbx.com/docu/algindex.html)
+* [How to distribute the fitness evaluation using Mahout.GA](mahout.ga.tutorial.html)
+
+Examples:
+* [Traveling Salesman](traveling-salesman.html)
+* [Class Discovery](class-discovery.html)
+
+<a name="Algorithms-Recommenders/CollaborativeFiltering"></a>
+### Recommenders / Collaborative Filtering
+
+Mahout contains both simple non-distributed recommender implementations and
+distributed Hadoop-based recommenders.
+
+ * [Non-distributed recommenders ("Taste")](recommender-documentation.html)
+ (integrated)
+ * [Distributed Item-Based Collaborative Filtering](itembased-collaborative-filtering.html)
+ (integrated)
+ * [Collaborative Filtering using a parallel matrix factorization](collaborative-filtering-with-als-wr.html)
+ (integrated)
+ * [First-timer FAQ](recommender-first-timer-faq.html)
+
+<a name="Algorithms-VectorSimilarity"></a>
+### Vector Similarity
+
+Mahout contains implementations that allow one to compare one or more
+vectors with another set of vectors.  This can be useful if one is, for
+instance, trying to calculate the pairwise similarity between all documents
+(or a subset of docs) in a corpus.
+
+* RowSimilarityJob -- Builds an inverted index and then computes distances
+between items that have co-occurrences.  This is a fully distributed
+calculation.
+* VectorDistanceJob -- Does a map side join between a set of "seed" vectors
+and all of the input vectors.
+
+<a name="Algorithms-Other"></a>
+### Other
+
+ * [Collocations](collocations.html)
+
+<a name="Algorithms-Non-MapReducealgorithms"></a>
+### Non-MapReduce algorithms
+
+Some algorithms and applications appeared on the mailing list, that have
+not been published in map reduce form so far. As we do not restrict
+ourselves to Hadoop-only versions, these proposals are listed here.
+
+
+

Added: mahout/site/mahout_cms/content/users/basics/collections.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/collections.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/collections.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/collections.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,90 @@
+Title: Collections
+TODO: Organize these somehow, add one-line blurbs
+Organize by usage? (classification, recommendation etc.)
+
+<a name="Collections-CollectionsofCollections"></a>
+## Collections of Collections
+
+- [ML Data](http://mldata.org/about/)
+ ... repository supported by Pascal 2.
+- [DBPedia](http://wiki.dbpedia.org/Downloads30)
+- [UCI Machine Learning Repo](http://archive.ics.uci.edu/ml/)
+- [http://mloss.org/community/blog/2008/sep/19/data-sources/](http://mloss.org/community/blog/2008/sep/19/data-sources/)
+- [Linked Library Data](http://ckan.net/group/lld)
+ via CKAN
+- [InfoChimps](http://infochimps.com/)
+ Free and purchasable datasets
+- [http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle](http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle)
+ LinkedIn discussion of lots of data sets
+
+<a name="Collections-CategorizationData"></a>
+## Categorization Data
+
+- [20Newsgroups](http://people.csail.mit.edu/jrennie/20Newsgroups/)
+- [RCV1 data set](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm)
+- [10 years of CLEF Data](http://direct.dei.unipd.it/)
+- [http://ece.ut.ac.ir/DBRG/Hamshahri/](http://ece.ut.ac.ir/DBRG/Hamshahri/)
+ (Approximately 160k categorized docs)
+There is a newer beta verson here:[http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/](http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/)
+ (Approximately 320k categorized docs)
+- Lending Club load data [https://www.lendingclub.com/info/download-data.action](https://www.lendingclub.com/info/download-data.action)
+
+<a name="Collections-RecommendationData"></a>
+## Recommendation Data
+
+- [Netflix Prize/Dataset](http://www.netflixprize.com/download)
+- [Book usage and recommendation data from the University of Huddersfield](http://library.hud.ac.uk/data/usagedata/)
+- [Last.fm](http://denoiserthebetter.posterous.com/music-recommendation-datasets)
+ \- Non-commercial use only
+- [Amazon Product Review Data via Jindal and Liu](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
+ -- Scroll down
+- [GroupLens/MovieLens Movie Review Dataset](http://www.grouplens.org/node/73)
+
+<a name="Collections-MultilingualData"></a>
+## Multilingual Data
+
+- [http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php](http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php)
+ \- 308,000 subtitle files covering about 18,900 movies in 59 languages
+(July 2006 numbers). This is a curated collection of subtitles from an
+aggregation site, [http://www.openSubTitles.org]
+The original site, OpenSubtitles.org, is up to 1.6m subtitles files.
+- [Statistical Machine Translation](http://www.statmt.org/)
+ \- devoted to all things language translation. Includes multilingual
+corpuses of European and Canadian legal tomes.
+
+<a name="Collections-Geospatial"></a>
+## Geospatial
+
+- [Natural Earth Data](http://www.naturalearthdata.com/)
+- [Open Street Maps](http://wiki.openstreetmap.org/wiki/Main_Page)
+And other crowd-sourced mapping data sites.
+
+<a name="Collections-Airline"></a>
+## Airline
+
+- [Open Flights](http://openflights.org/)
+ \- Crowd-sourced database of airlines, flights, airports, times, etc.
+- [Airline on-time information - 1987-2008](http://stat-computing.org/dataexpo/2009/)
+ \- 120m CSV records, 12G uncompressed
+
+<a name="Collections-GeneralResources"></a>
+## General Resources
+
+- [theinfo](http://theinfo.org/)
+- [WordNet](http://wordnet.princeton.edu/obtain)
+- [Common Crawl](http://www.commoncrawl.org/)
+ \- freely available web crawl on EC2
+
+<a name="Collections-Stuff"></a>
+## Stuff
+
+- [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html](http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html)
+- [4 Universities Data Set](http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/)
+- [Large crawl of Twitter](http://an.kaist.ac.kr/traces/WWW2010.html)
+- [UniProt](http://beta.uniprot.org/)
+- [http://www.icwsm.org/2009/data/](http://www.icwsm.org/2009/data/)
+- [http://data.gov](http://data.gov)
+- [http://www.ckan.net/](http://www.ckan.net/)
+- [http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world](http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world)
+- [http://data.gov.uk/](http://data.gov.uk/)
+- [51,000 US Congressional Bills tagged](http://www.ark.cs.cmu.edu/bills/)

Added: mahout/site/mahout_cms/content/users/basics/collocations.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/collocations.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/collocations.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/collocations.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,390 @@
+Title: Collocations
+<a name="Collocations-CollocationsinMahout"></a>
+# Collocations in Mahout
+
+A collocation is defined as a sequence of words or terms which co-occur
+more often than would be expected by chance. Statistically relevant
+combinations of terms identify additional lexical units which can be
+treated as features in a vector-based representation of a text. A detailed
+discussion of collocations can be found on wikipedia [1](http://en.wikipedia.org/wiki/Collocation)
+.
+ 
+
+<a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a>
+## Log-Likelihood based Collocation Identification
+
+Mahout provides an implementation of a collocation identification algorithm
+which scores collocations using log-likelihood ratio. The log-likelihood
+score indicates the relative usefulness of a collocation with regards other
+term combinations in the text. Collocations with the highest scores in a
+particular corpus will generally be more useful as features.
+
+Calculating the LLR is very straightforward and is described concisely in
+Ted Dunning's blog post [2](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
+. Ted describes the series of counts reqired to calculate the LLR for two
+events A and B in order to determine if they co-occur more often than pure
+chance. These counts include the number of times the events co-occur (k11),
+the number of times the events occur without each other (k12 and k21), and
+the number of times anything occurs. These counts are summarized in the
+following table:
+
+<table>
+<tr><td> </td><td> Event A </td><td> Everything but Event A </td></tr>
+<tr><td> Event B </td><td> A and B together (k11) </td><td>  B but not A (k12) </td></tr>
+<tr><td> Everything but Event B </td><td> A but not B (k21) </td><td> Neither B nor A (k22) </td></tr>
+</table>
+
+For the purposes of collocation identification, it is useful to begin by
+thinking in word pairs, bigrams. In this case the leading or head term from
+the pair corresponds to A from the table above, B corresponds to the
+trailing or tail term, while neither B nor A is the total number of word
+pairs in the corpus less those containing B, A or both B and A.
+
+Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio
+is computed by looking at the number of occurences of that word pair in the
+corpus, the number of word pairs that begin with 'oscillation' but end with
+something other than 'overthruster', the number of word pairs that end with
+'overthruster' begin with something other than 'oscillation' and the number
+of word pairs in the corpus that contain neither 'oscillation' and
+overthruster.
+
+This can be extended from bigrams to trigrams, 4-grams and beyond. In these
+cases, the current algorithm uses the first token of the ngram as the head
+of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it
+were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is
+treated as the head while 'kong cavaliers' is treated as the tail. Future
+versions of this algorithm will allow for variations in which tokens of the
+ngram are treated as the head and tail.
+
+Beyond ngrams, it is often useful to inspect cases where individual words
+occur around other interesting features of the text such as sentence
+boundaries.
+
+<a name="Collocations-GeneratingNGrams"></a>
+## Generating NGrams
+
+The tools that the collocation identification algorithm are embeeded within
+either consume tokenized text as input or provide the ability to specify an
+implementation of the Lucene Analyzer class perform tokenization in order
+to form ngrams. The tokens are passed through a Lucene ShingleFilter to
+produce NGrams of the desired length. 
+
+Given the text "Alice was beginning to get very tired" as an example,
+Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get',
+'very' and 'tired', while the ShingleFilter with a max NGram size set to 3
+produces the shingles 'alice beginning', 'alice beginning get', 'beginning
+get', 'beginning get very', 'get very', 'get very tired' and 'very tired'.
+Note that both bigrams and trigrams are produced here. A future enhancement
+to the existing algorithm would involve limiting the output to a particular
+gram size as opposed to solely specifiying a max ngram size.
+
+<a name="Collocations-RunningtheCollocationIdentificationAlgorithm."></a>
+## Running the Collocation Identification Algorithm.
+
+There are a couple ways to run the llr-based collocation algorithm in
+mahout
+
+<a name="Collocations-Whencreatingvectorsfromasequencefile"></a>
+### When creating vectors from a sequence file
+
+The llr collocation identifier is integrated into the process that is used
+to create vectors from sequence files of text keys and values. Collocations
+are generated when the --maxNGramSize (-ng) option is not specified and
+defaults to 2 or is set to a number of 2 or greater. The --minLLR option
+can be used to control the cutoff that prevents collocations below the
+specified LLR score from being emitted, and the --minSupport argument can
+be used to filter out collocations that appear below a certain number of
+times. 
+
+
+    bin/mahout seq2sparse
+    
+    Usage:									    
+     [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize	    
+    <chunkSize> --output <output> --input <input> --minDF <minDF>
+--maxDFPercent	  
+    <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR>
+--numReducers  
+    <numReducers> --maxNGramSize <ngramSize> --overwrite --help		    
+    --sequentialAccessVector]
+    Options 								    
+      --minSupport (-s) minSupport	      (Optional) Minimum Support. Default   
+    				      Value: 2				    
+      --analyzerName (-a) analyzerName    The class name of the analyzer	    
+      --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
+MB  
+      --output (-o) output		      The output directory		    
+      --input (-i) input		      input dir containing the documents in 
+    				      sequence file format		    
+      --minDF (-md) minDF		      The minimum document frequency. 
+Default  
+    				      is 1				    
+      --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
+DF.    
+    				      Can be used to remove really high     
+    				      frequency terms. Expressed as an
+integer  
+    				      between 0 and 100. Default is 99.     
+      --weight (-wt) weight 	      The kind of weight to use. Currently
+TF   
+    				      or TFIDF				    
+      --norm (-n) norm		      The norm to use, expressed as either
+a    
+    				      float or "INF" if you want to use the 
+    				      Infinite norm.  Must be greater or
+equal  
+    				      to 0.  The default is not to
+normalize    
+      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood  
+    				      Ratio(Float)  Default is 1.0	    
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
+    				      Default Value: 1			    
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
+to  
+    				      create (2 = bigrams, 3 = trigrams,
+etc)   
+    				      Default Value:2			    
+      --overwrite (-w)		      If set, overwrite the output
+directory    
+      --help (-h)			      Print out help			    
+      --sequentialAccessVector (-seq)     (Optional) Whether output vectors
+should	
+    				      be SequentialAccessVectors If set
+true	
+    				      else false 
+
+
+<a name="Collocations-CollocDriver"></a>
+### CollocDriver
+
+*TODO*
+
+
+    bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
+    
+    Usage:									    
+     [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite    
+    --minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers>     
+    --analyzerName <analyzerName> --preprocess --unigram --help]
+    Options 								    
+      --input (-i) input		      The Path for input files. 	    
+      --output (-o) output		      The Path write output to		    
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
+to  
+    				      create (2 = bigrams, 3 = trigrams,
+etc)   
+    				      Default Value:2			    
+      --overwrite (-w)		      If set, overwrite the output
+directory    
+      --minSupport (-s) minSupport	      (Optional) Minimum Support. Default   
+    				      Value: 2				    
+      --minLLR (-ml) minLLR 	      (Optional)The minimum Log Likelihood  
+    				      Ratio(Float)  Default is 1.0	    
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.    
+    				      Default Value: 1			    
+      --analyzerName (-a) analyzerName    The class name of the analyzer	    
+      --preprocess (-p)		      If set, input is
+SequenceFile<Text,Text>  
+    				      where the value is the document, 
+which	
+    				      will be tokenized using the specified 
+    				      analyzer. 			    
+      --unigram (-u)		      If set, unigrams will be emitted in
+the   
+    				      final output alongside collocations   
+      --help (-h)			      Print out help	      
+
+
+<a name="Collocations-Algorithmdetails"></a>
+## Algorithm details
+
+This section describes the implementation of the collocation identification
+algorithm in terms of the map-reduce phases that are used to generate
+ngrams and count the frequencies required to perform the log-likelihood
+calculation. Unless otherwise noted, classes that are indicated in
+CamelCase can be found in the mahout-utils module under the package
+org.apache.mahout.utils.nlp.collocations.llr
+
+The algorithm is implemented in two map-reduce passes:
+
+<a name="Collocations-Pass1:CollocDriver.generateCollocations(...)"></a>
+### Pass 1: CollocDriver.generateCollocations(...)
+
+Generates NGrams and counts frequencies for ngrams, head and tail subgrams.
+
+<a name="Collocations-Map:CollocMapper"></a>
+#### Map: CollocMapper
+
+Input k: Text (documentId), v: StringTuple (tokens) 
+
+Each call to the mapper passes in the full set of tokens for the
+corresponding document using a StringTuple. The ShingleFilter is run across
+these tokens to produce ngrams of the desired length. ngrams and
+frequencies are collected across the entire document.
+
+Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte[](.html)
+ form of the ngram when collecting an ngram as the value.
+
+
+    head_key(EMPTY) -> (head subgram, head frequency)
+    head_key(ngram) -> (ngram, ngram frequency) 
+    tail_key(EMPTY) -> (tail subgram, tail frequency)
+    tail_key(ngram) -> (ngram, ngram frequency)
+
+
+subgram and ngram values are packaged in Gram objects.
+
+For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When
+the pass is complete, this counter will hold the total number of ngrams
+encountered in the input which is used as a part of the LLR calculation.
+
+Output k: GramKey (head or tail subgram), v: Gram (head, tail or ngram with
+frequency)
+
+<a name="Collocations-Combiner:CollocCombiner"></a>
+#### Combiner: CollocCombiner
+
+Input k: GramKey, v:Gram (as above)
+
+This phase merges the counts for unique ngrams or ngram fragments across
+multiple documents. The combiner treats the entire GramKey as the key and
+as such, identical tuples from separate documents are passed into a single
+call to the combiner's reduce method, their frequencies are summed and a
+single tuple is passed out via the collector.
+
+Output k: GramKey, v:Gram
+
+<a name="Collocations-Reduce:CollocReducer"></a>
+#### Reduce: CollocReducer
+
+Input k: GramKey, v: Gram (as above)
+
+The CollocReducer employs the Hadoop secondary sort strategy to avoid
+caching ngram tuples in memory in order to calculate total ngram and
+subgram frequencies. The GramKeyPartitioner ensures that tuples with the
+same primary key are sent to the same reducer while the
+GramKeyGroupComparator ensures that iterator provided by the reduce method
+first returns the subgram and then returns ngram values grouped by ngram.
+This eliminates the need to cache the values returned by the iterator in
+order to calculate total frequencies for both subgrams and ngrams. There
+input will consist of multiple frequencies for each (subgram_key, subgram)
+or (subgram_key, ngram) tuple; one from each map task executed in which the
+particular subgram was found.
+The input will be traversed in the following order:
+
+
+    (head subgram, frequency 1)
+    (head subgram, frequency 2)
+    ... 
+    (head subgram, frequency N)
+    (ngram 1, frequency 1)
+    (ngram 1, frequency 2)
+    ...
+    (ngram 1, frequency N)
+    (ngram 2, frequency 1)
+    (ngram 2, frequency 2)
+    ...
+    (ngram 2, frequency N)
+    ...
+    (ngram N, frequency 1)
+    (ngram N, frequency 2)
+    ...
+    (ngram N, frequency N)
+
+
+Where all of the ngrams above share the same head. Data is presented in the
+same manner for the tail subgrams.
+
+As the values for a subgram or ngram are traversed, frequencies are
+accumulated. Once all values for a subgram or ngram are processed the
+resulting key/value pairs are passed to the collector as long as the ngram
+frequency is equal to or greater than the specified minSupport. When an
+ngram is skipped in this way the Skipped.LESS_THAN_MIN_SUPPORT counter to
+be incremented.
+
+Pairs are passed to the collector in the following format:
+
+
+    ngram, ngram frequency -> subgram subgram frequency
+
+
+In this manner, the output becomes an unsorted version of the following:
+
+
+    ngram 1, frequency -> ngram 1 head, head frequency
+    ngram 1, frequency -> ngram 1 tail, tail frequency
+    ngram 2, frequency -> ngram 2 head, head frequency
+    ngram 2, frequency -> ngram 2 tail, tail frequency
+    ngram N, frequency -> ngram N head, head frequency
+    ngram N, frequency -> ngram N tail, tail frequency
+
+
+Output is in the format k:Gram (ngram, frequency), v:Gram (subgram,
+frequency)
+
+<a name="Collocations-Pass2:CollocDriver.computeNGramsPruneByLLR(...)"></a>
+### Pass 2: CollocDriver.computeNGramsPruneByLLR(...)
+
+Pass 1 has calculated full frequencies for ngrams and subgrams, Pass 2
+performs the LLR calculation.
+
+<a name="Collocations-MapPhase:IdentityMapper(org.apache.hadoop.mapred.lib.IdentityMapper)"></a>
+#### Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper)
+
+This phase is a no-op. The data is passed through unchanged. The rest of
+the work for llr calculation is done in the reduce phase.
+
+<a name="Collocations-ReducePhase:LLRReducer"></a>
+#### Reduce Phase: LLRReducer
+
+Input is k:Gram, v:Gram (as above)
+
+This phase receives the head and tail subgrams and their frequencies for
+each ngram (with frequency) produced for the input:
+
+
+    ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency
+    ngram 2, frequency -> ngram 2 head, frequency; ngram 2 tail, frequency
+    ...
+    ngram 1, frequency -> ngram N head, frequency; ngram N tail, frequency
+
+
+It also reads the full ngram count obtained from the first pass, passed in
+as a configuration option. The parameters to the llr calculation are
+calculated as follows:
+
+k11 = f_n
+k12 = f_h - f_n
+k21 = f_t - f_n
+k22 = N - ((f_h + f_t) - f_n)
+
+Where f_n is the ngram frequency, f_h and f_t the frequency of head and
+tail and N is the total number of ngrams.
+
+Tokens with a llr below that of the specified minimum llr are dropped and
+the Skipped.LESS_THAN_MIN_LLR counter is incremented.
+
+Output is k: Text (ngram), v: DoubleWritable (llr score)
+
+<a name="Collocations-Unigrampass-through."></a>
+### Unigram pass-through.
+
+By default in seq2sparse, or if the -u option is provided to the
+CollocDriver, unigrams (single tokens) will be passed through the job and
+each token's frequency will be calculated. As with ngrams, unigrams are
+subject to filtering with minSupport and minLLR.
+
+<a name="Collocations-References"></a>
+## References
+
+\[1\](1\.html)
+ http://en.wikipedia.org/wiki/Collocation
+\[2\](2\.html)
+ http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
+
+
+<a name="Collocations-Discussion"></a>
+## Discussion
+
+* http://comments.gmane.org/gmane.comp.apache.mahout.user/5685 - Reuters
+example

Added: mahout/site/mahout_cms/content/users/basics/creating-vectors-from-text.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/creating-vectors-from-text.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/creating-vectors-from-text.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/creating-vectors-from-text.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,162 @@
+Title: Creating Vectors from Text
++*Mahout_0.2*+
+{toc:style=disc|indent=20px}
+
+<a name="CreatingVectorsfromText-Introduction"></a>
+# Introduction
+
+For clustering documents it is usually necessary to convert the raw text
+into vectors that can then be consumed by the clustering [Algorithms](algorithms.html)
+.  These approaches are described below.
+
+<a name="CreatingVectorsfromText-FromLucene"></a>
+# From Lucene
+
+*NOTE: Your Lucene index must be created with the same version of Lucene
+used in Mahout.  Check Mahout's POM file to get the version number,
+otherwise you will likely get "Exception in thread "main"
+org.apache.lucene.index.CorruptIndexException: Unknown format version: -11"
+as an error.*
+
+Mahout has utilities that allow one to easily produce Mahout Vector
+representations from a Lucene (and Solr, since they are they same) index.
+
+For this, we assume you know how to build a Lucene/Solr index.	For those
+who don't, it is probably easiest to get up and running using [Solr](http://lucene.apache.org/solr)
+ as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
+index.	For those wanting to use just Lucene, see the Lucene [website|http://lucene.apache.org/java]
+ or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike
+McCandless.
+
+To get started, make sure you get a fresh copy of Mahout from [SVN](http://cwiki.apache.org/MAHOUT/buildingmahout.html)
+ and are comfortable building it. It defines interfaces and implementations
+for efficiently iterating over a Data Source (it only supports Lucene
+currently, but should be extensible to databases, Solr, etc.) and produces
+a Mahout Vector file and term dictionary which can then be used for
+clustering.   The main code for driving this is the Driver program located
+in the org.apache.mahout.utils.vectors package.  The Driver program offers
+several input options, which can be displayed by specifying the --help
+option.  Examples of running the Driver are included below:
+
+<a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a>
+## Generating an output file from a Lucene Index
+
+
+    $MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE
+INDEX> \
+       --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> --dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO]
+ \
+       <--max <Number of vectors to output>> <--norm {INF|integer >= 0}>
+<--idField <Name of the idField in the Lucene index>>
+
+
+<a name="CreatingVectorsfromText-Create50VectorsfromanIndex"></a>
+### Create 50 Vectors from an Index 
+
+    $MAHOUT_HOME/bin/mahout lucene.vector --dir
+<PATH>/wikipedia/solr/data/index --field body \
+        --dictOut <PATH>/solr/wikipedia/dict.txt --output
+<PATH>/solr/wikipedia/out.txt --max 50
+
+This uses the index specified by --dir and the body field in it and writes
+out the info to the output dir and the dictionary to dict.txt.	It only
+outputs 50 vectors.  If you don't specify --max, then all the documents in
+the index are output.
+
+<a name="CreatingVectorsfromText-Normalize50VectorsfromaLuceneIndexusingthe[L_2Norm](http://en.wikipedia.org/wiki/Lp_space)"></a>
+### Normalize 50 Vectors from a Lucene Index using the [L_2 Norm|http://en.wikipedia.org/wiki/Lp_space]
+
+    $MAHOUT_HOME/bin/mahout lucene.vector --dir
+<PATH>/wikipedia/solr/data/index --field body \
+          --dictOut <PATH>/solr/wikipedia/dict.txt --output
+<PATH>/solr/wikipedia/out.txt --max 50 --norm 2
+
+
+<a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a>
+# From Directory of Text documents
+Mahout has utilities to generate Vectors from a directory of text
+documents. Before creating the vectors, you need to convert the documents
+to SequenceFile format. SequenceFile is a hadoop class which allows us to
+write arbitary key,value pairs into it. The DocumentVectorizer requires the
+key to be a Text with a unique document id, and value to be the Text
+content in UTF-8 format.
+
+You may find Tika (http://lucene.apache.org/tika) helpful in converting
+binary documents to text.
+
+<a name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a>
+## Converting directory of documents to SequenceFile format
+Mahout has a nifty utility which reads a directory path including its
+sub-directories and creates the SequenceFile in a chunked manner for us.
+the document id generated is <PREFIX><RELATIVE PATH FROM
+PARENT>/document.txt
+
+From the examples directory run
+
+    $MAHOUT_HOME/bin/mahout seqdirectory \
+    --input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
+    <-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
+    <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
+    <-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
+
+
+<a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a>
+## Creating Vectors from SequenceFile
+
++*Mahout_0.3*+
+
+From the sequence file generated from the above step run the following to
+generate vectors. 
+
+    $MAHOUT_HOME/bin/mahout seq2sparse \
+    -i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND
+DICTIONARY IS GENERATED> \
+    <-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
+    <-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
+    <-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT>
+org.apache.lucene.analysis.standard.StandardAnalyzer> \
+    <--minSupport <MINIMUM SUPPORT> 2> \
+    <--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
+    <--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
+    <--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
+    <-seq <Create SequentialAccessVectors>{false|true required for running some
+algorithms(LDA,Lanczos)}>"
+
+
+--minSupport is the min frequency for the word to  be considered as a
+feature. --minDF is the min number of documents the word needs to be in
+--maxDFPercent is the max value of the expression (document frequency of a
+word/total number of document) to be considered as good feature to be in
+the document. This helps remove high frequency features like stop words
+
+<a name="CreatingVectorsfromText-Background"></a>
+# Background
+
+*
+http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
+*
+http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
+
+<a name="CreatingVectorsfromText-FromaDatabase"></a>
+# From a Database
+
++*TODO:*+
+
+<a name="CreatingVectorsfromText-Other"></a>
+# Other
+
+<a name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a>
+## Converting existing vectors to Mahout's format
+
+If you are in the happy position to already own a document (as in: texts,
+images or whatever item you wish to treat) processing pipeline, the
+question arises of how to convert the vectors into the Mahout vector
+format. Probably the easiest way to go would be to implement your own
+Iterable<Vector> (called VectorIterable in the example below) and then
+reuse the existing VectorWriter classes:
+
+
+    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
+configuration, outfile, LongWritable.class, SparseVector.class);
+    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+

Added: mahout/site/mahout_cms/content/users/basics/creating-vectors.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/creating-vectors.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/creating-vectors.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/creating-vectors.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,6 @@
+Title: Creating Vectors
+<a name="CreatingVectors-UtilitiesforCreatingVectors"></a>
+# Utilities for Creating Vectors
+
+1. [Text](creating-vectors-from-text.html)
+1. [ARFF](creating-vectors-from-weka's-arff-format.html)

Added: mahout/site/mahout_cms/content/users/basics/dimensional-reduction.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/dimensional-reduction.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/dimensional-reduction.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/dimensional-reduction.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,457 @@
+Title: Dimensional Reduction
+Matrix algebra underpins the way many Big Data algorithms and data
+structures are composed: full-text search can be viewed as doing matrix
+multiplication of the term-document matrix by the query vector (giving a
+vector over documents where the components are the relevance score),
+computing co-occurrences in a collaborative filtering context (people who
+viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest)
+is taking the squaring the user-item interaction matrix, calculating users
+who are k-degrees separated from each other in a social network or
+web-graph can be found by looking at the k-fold product of the graph
+adjacency matrix, and the list goes on (and these are all cases where the
+linear structure of the matrix is preserved!)
+
+Each of these examples deal with cases of matrices which tend to be
+tremendously large (often millions to tens of millions to hundreds of
+millions of rows or more, by sometimes a comparable number of columns), but
+also rather sparse. Sparse matrices are nice in some respects: dense
+matrices which are 10^7 on a side would have 100 trillion non-zero entries!
+But the sparsity is often problematic, because any given two rows (or
+columns) of the matrix may have zero overlap. Additionally, any
+machine-learning work done on the data which comprises the rows has to deal
+with what is known as "the curse of dimensionality", and for example, there
+are too many columns to train most regression or classification problems on
+them independently.
+
+One of the more useful approaches to dealing with such huge sparse data
+sets is the concept of dimensionality reduction, where a lower dimensional
+space of the original column (feature) space of your data is found /
+constructed, and your rows are mapped into that subspace (or sub-manifold).
+ In this reduced dimensional space, "important" components to distance
+between points are exaggerated, and unimportant ones washed away, and
+additionally, sparsity of your rows is traded for drastically reduced
+dimensional, but dense "signatures". While this loss of sparsity can lead
+to its own complications, a proper dimensionality reduction can help reveal
+the most important features of your data, expose correlations among your
+supposedly independent original variables, and smooth over the zeroes in
+your correlation matrix.
+
+One of the most straightforward techniques for dimensionality reduction is
+the matrix decomposition: singular value decomposition, eigen
+decomposition, non-negative matrix factorization, etc. In their truncated
+form these decompositions are an excellent first approach toward linearity
+preserving unsupervised feature selection and dimensional reduction. Of
+course, sparse matrices which don't fit in RAM need special treatment as
+far as decomposition is concerned. Parallelizable and/or stream-oriented
+algorithms are needed.
+
+<a name="DimensionalReduction-SingularValueDecomposition"></a>
+# Singular Value Decomposition
+
+Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 applied), are two scalable implementations of SVD, a stream-oriented implementation using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb's paper ([Gorrell and Webb 2005](-http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html)
+); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm]
+ implementation, both single-threaded, and in the
+o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce
+(series of) job(s) in o.a.m.math.hadoop.decomposer package (core module).
+Coming soon: stochastic decomposition.
+
+See also:
+
+ *
+https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition
+
+<a name="DimensionalReduction-Lanczos"></a>
+## Lanczos
+
+The Lanczos algorithm is designed for eigen-decomposition, but like any
+such algorithm, getting singular vectors out of it is immediate (singular
+vectors of matrix A are just the eigenvectors of A^t * A or A * A^t). 
+Lanczos works by taking a starting seed vector *v* (with cardinality equal
+to the number of columns of the matrix A), and repeatedly multiplying A by
+the result: *v'* = A.times(*v*) (and then subtracting off what is
+proportional to previous *v'*'s, and building up an auxiliary matrix of
+projections).  In the case where A is not square (in general: not
+symmetric), then you actually want to repeatedly multiply A*A^t by *v*:
+*v'* = (A * A^t).times(*v*), or equivalently, in Mahout,
+A.timesSquared(*v*) (timesSquared is merely an optimization: by changing
+the order of summation in A*A^t.times(*v*), you can do the same computation
+as one pass over the rows of A instead of two).
+
+After *k* iterations of *v_i* = A.timesSquared(*v_(i-1)*), a *k*- by -*k*
+tridiagonal matrix has been created (the auxiliary matrix mentioned above),
+out of which a good (often extremely good) approximation to *k* of the
+singular values (and with the basis spanned by the *v_i*, the *k* singular
+*vectors* may also be extracted) of A may be efficiently extracted.  Which
+*k*?  It's actually a spread across the entire spectrum: the first few will
+most certainly be the largest singular values, and the bottom few will be
+the smallest, but you have no guarantee that just because you have the n'th
+largest singular value of A, that you also have the (n-1)'st as well.  A
+good rule of thumb is to try and extract out the top 3k singular vectors
+via Lanczos, and then discard the bottom two thirds, if you want primarily
+the largest singular values (which is the case for using Lanczos for
+dimensional reduction).
+
+<a name="DimensionalReduction-ParallelizationStragegy"></a>
+### Parallelization Stragegy
+
+Lanczos is "embarassingly parallelizable": matrix multiplication of a
+matrix by a vector may be carried out row-at-a-time without communication
+until at the end, the results of the intermediate matrix-by-vector outputs
+are accumulated on one final vector.  When it's truly A.times(*v*), the
+final accumulation doesn't even have collision / synchronization issues
+(the outputs are individual separate entries on a single vector), and
+multicore approaches can be very fast, and there should also be tricks to
+speed things up on Hadoop.  In the asymmetric case, where the operation is
+A.timesSquared(*v*), the accumulation does require synchronization (the
+vectors to be summed have nonzero elements all across their range), but
+delaying writing to disk until Mapper close(), and remembering that having
+a Combiner be the same as the Reducer, the bottleneck in accumulation is
+nowhere near a single point.
+
+<a name="DimensionalReduction-Mahoutusage"></a>
+### Mahout usage
+
+The Mahout DistributedLanzcosSolver is invoked by the
+<MAHOUT_HOME>/bin/mahout svd command. This command takes the following
+arguments (which can be reproduced by just entering the command with no
+arguments):
+
+
+    Job-Specific Options:							    
+      --input (-i) input			  Path to job input directory.	    
+      --output (-o) output			  The directory pathname for
+output.    
+      --numRows (-nr) numRows		  Number of rows of the input
+matrix	  
+      --numCols (-nc) numCols		  Number of columns of the input
+matrix 
+      --rank (-r) rank			  Desired decomposition rank (note: 
+    					  only roughly 1/4 to 1/3 of these
+will 
+    					  have the top portion of the
+spectrum) 
+      --symmetric (-sym) symmetric		  Is the input matrix square and    
+    					  symmetric?			    
+      --cleansvd (-cl) cleansvd		  Run the EigenVerificationJob to
+clean 
+    					  the eigenvectors after SVD	    
+      --maxError (-err) maxError		  Maximum acceptable error	    
+      --minEigenvalue (-mev) minEigenvalue	  Minimum eigenvalue to keep the
+vector 
+    					  for				    
+      --inMemory (-mem) inMemory		  Buffer eigen matrix into memory
+(if   
+    					  you have enough!)		    
+      --help (-h)				  Print out help		    
+      --tempDir tempDir			  Intermediate output directory     
+      --startPhase startPhase		  First phase to run		    
+      --endPhase endPhase			  Last phase to run		    
+
+
+The short form invocation may be used to perform the SVD on the input data: 
+
+      <MAHOUT_HOME>/bin/mahout svd \
+      --input (-i) <Path to input matrix> \   
+      --output (-o) <The directory pathname for output> \	
+      --numRows (-nr) <Number of rows of the input matrix> \   
+      --numCols (-nc) <Number of columns of the input matrix> \
+      --rank (-r) <Desired decomposition rank> \
+      --symmetric (-sym) <Is the input matrix square and symmetric>    
+
+
+The --input argument is the location on HDFS where a
+SequenceFile<Writable,VectorWritable> (preferably
+SequentialAccessSparseVectors instances) lies which you wish to decompose. 
+Each vector of which has --numcols entries.  --numRows is the number of
+input rows and is used to properly size the matrix data structures.
+
+After execution, the --output directory will have a file named
+"rawEigenvectors" containing the raw eigenvectors. As the
+DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
+eigenvalues aren't valid, and also scales all of the eigenvalues down by
+the max eignenvalue (to avoid floating point overflow), there is an
+additional step which spits out the nice correctly scaled (and
+non-spurious) eigenvector/value pairs. This is done by the "cleansvd" shell
+script step (c.f. EigenVerificationJob).
+
+If you have run he short form svd invocation above and require this
+"cleaning" of the eigen/singular output you can run "cleansvd" as a
+separate command:
+
+      <MAHOUT_HOME>/bin/mahout cleansvd \
+      --eigenInput <path to raw eigenvectors> \
+      --corpusInput <path to corpus> \
+      --output <path to output directory> \
+      --maxError <maximum allowed error. Default is 0.5> \
+      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
+      --inMemory <true if the eigenvectors can all fit into memory. Default
+false>
+
+
+The --corpusInput is the input path from the previous step, --eigenInput is
+the output from the previous step (<output>/rawEigenvectors), and --output
+is the desired output path (same as svd argument). The two "cleaning"
+params are --maxError - the maximum allowed 1-cosAngle(v,
+A.timesSquared(v)), and --minEigenvalue.  Eigenvectors which have too large
+error, or too small eigenvalue are discarded.  Optional argument:
+--inMemory, if you have enough memory on your local machine (not on the
+hadoop cluster nodes!) to load all eigenvectors into memory at once (at
+least 8 bytes/double * rank * numCols), then you will see some speedups on
+this cleaning process.
+
+After execution, the --output directory will have a file named
+"cleanEigenvectors" containing the clean eigenvectors. 
+
+These two steps can also be invoked together by the svd command by using
+the long form svd invocation:
+
+      <MAHOUT_HOME>/bin/mahout svd \
+      --input (-i) <Path to input matrix> \   
+      --output (-o) <The directory pathname for output> \	
+      --numRows (-nr) <Number of rows of the input matrix> \   
+      --numCols (-nc) <Number of columns of the input matrix> \
+      --rank (-r) <Desired decomposition rank> \
+      --symmetric (-sym) <Is the input matrix square and symmetric> \  
+      --cleansvd "true"   \
+      --maxError <maximum allowed error. Default is 0.5> \
+      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
+      --inMemory <true if the eigenvectors can all fit into memory. Default
+false>
+
+
+After execution, the --output directory will contain two files: the
+"rawEigenvectors" and the "cleanEigenvectors".
+
+TODO: also allow exclusion based on improper orthogonality (currently
+computed, but not checked against constraints).
+
+<a name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a>
+#### Example: SVD of ASF Mail Archives on Amazon Elastic MapReduce
+
+This section walks you through a complete example of running the Mahout SVD
+job on Amazon Elastic MapReduce cluster and then preparing the output to be
+used for clustering. This example was developed as part of the effort to
+benchmark Mahout's clustering algorithms using a large document set (see [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
+). Specifically, we use the ASF mail archives located at
+http://aws.amazon.com/datasets/7791434387204566.  You will need to likely
+run seq2sparse on these first.	See
+$MAHOUT_HOME/examples/bin/build-asf-email.sh (on trunk) for examples of
+processing this data.
+
+At a high level, the steps we're going to perform are:
+
+bin/mahout svd (original -> svdOut)
+bin/mahout cleansvd ...
+bin/mahout transpose svdOut -> svdT
+bin/mahout transpose original -> originalT
+bin/mahout matrixmult originalT svdT -> newMatrix
+bin/mahout kmeans newMatrix
+
+The bulk of the content for this section was extracted from the Mahout user
+mailing list, see: [Using SVD with Canopy/KMeans](http://search.lucidimagination.com/search/document/6e5889ee6f0f253b/using_svd_with_canopy_kmeans#66a50fe017cebbe8)
+ and [Need a little help with using SVD|http://search.lucidimagination.com/search/document/748181681ae5238b/need_a_little_help_with_using_svd#134fb2771fd52928]
+
+Note: Some of this work is due in part to credits donated by the Amazon
+Elastic MapReduce team.
+
+<a name="DimensionalReduction-1.LaunchEMRCluster"></a>
+##### 1. Launch EMR Cluster
+
+For a detailed explanation of the steps involved in launching an Amazon
+Elastic MapReduce cluster for running Mahout jobs, please read the
+"Building Vectors for Large Document Sets" section of [Mahout on Elastic MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)
+.
+
+In the remaining steps below, remember to replace JOB_ID with the Job ID of
+your EMR cluster.
+
+<a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a>
+##### 2. Load Mahout 0.5+ JAR into S3
+
+These steps were created with the mahout-0.5-SNAPSHOT because they rely on
+the patch for [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
+
+<a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a>
+##### 3. Copy TFIDF Vectors into HDFS
+
+Before running your SVD job on the vectors, you need to copy them from S3
+to your EMR cluster's HDFS.
+
+
+    elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
+      --arg
+s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
+\
+      --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors \
+      -j JOB_ID
+
+
+<a name="DimensionalReduction-4.RuntheSVDJob"></a>
+##### 4. Run the SVD Job
+
+Now you're ready to run the SVD job on the vectors stored in HDFS:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg svd \
+      --arg -i --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors
+\
+      --arg -o --arg /asf-mail-archives/mahout/svd \
+      --arg --rank --arg 100 \
+      --arg --numCols --arg 20444 \
+      --arg --numRows --arg 6076937 \
+      --arg --cleansvd --arg "true" \
+      -j JOB_ID
+
+
+This will run 100 iterations of the LanczosSolver SVD job to produce 87
+eigenvectors in:
+
+
+    /asf-mail-archives/mahout/svd/cleanEigenvectors
+
+
+Only 87 eigenvectors were produced because of the cleanup step, which
+removes any duplicate eigenvectors caused by convergence issues and numeric
+overflow and any that don't appear to be "eigen" enough (ie, they don't
+satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix
+
+<a name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a>
+##### 5. Transform your TFIDF Vectors into Mahout Matrix
+
+The tfidf vectors created by the seq2sparse job are
+SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these
+vectors into a matrix form that is a
+SequenceFile<IntWritable,VectorWritable> and a
+SequenceFile<IntWritable,Text> (where the original one is the join of these
+new ones, on the new int key).
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg rowid \
+      --arg
+-Dmapred.input.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors
+\
+      --arg
+-Dmapred.output.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix
+\
+      -j JOB_ID
+
+
+This is not a distributed job and will only run on the master server in
+your EMR cluster. The job produces the following output:
+
+
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/docIndex
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix
+
+
+where docIndex is the SequenceFile<IntWritable,Text> and matrix is
+SequenceFile<IntWritable,VectorWritable>.
+
+<a name="DimensionalReduction-6.TransposetheMatrix"></a>
+##### 6. Transpose the Matrix
+
+Our ultimate goal is to multiply the TFIDF vector matrix times our SVD
+eigenvectors. For the mathematically inclined, from the rowid job, we now
+have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E
+is p x n (p=87, n=20444). So to multiply these two matrices, I need to
+transpose E so that the number of columns in T equals the number of rows in
+E (i.e. E^T is n x p) the result of the matrixmult would give me an m x p
+matrix (m=6076937, p=87).
+
+However, in practice, computing the matrix product of two matrices as a
+map-reduce job is efficiently done as a map-side join on two row-based
+matrices with the same number of rows, and the columns are the ones which
+are different.	In particular, if you take a matrix X which is represented
+as a set of numRowsX rows, each of which has numColsX, and another matrix
+with numRowsY == numRowsX, each of which has numColsY (!= numColsX), then
+by summing the outer-products of each of the numRowsX pairs of vectors, you
+get a matrix of with numRowsZ == numColsX, and numColsZ == numColsY (if you
+instead take the reverse outer product of the vector pairs, you can end up
+with the transpose of this final result, with numRowsZ == numColsY, and
+numColsZ == numColsX). - Jake Mannix
+
+Thus, we need to transpose the matrix using Mahout's Transpose Job:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg transpose \
+      --arg -i --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix \
+      --arg --numRows --arg 6076937 \
+      --arg --numCols --arg 20444 \
+      --arg --tempDir --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
+      -j JOB_ID
+
+
+This job requires the patch to [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
+
+The job creates the following output:
+
+
+    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose
+
+
+<a name="DimensionalReduction-7.TransposeEigenvectors"></a>
+##### 7. Transpose Eigenvectors
+
+If you followed Jake's explanation in step 6 above, then you know that we
+also need to transpose the eigenvectors:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg transpose \
+      --arg -i --arg /asf-mail-archives/mahout/svd/cleanEigenvectors \
+      --arg --numRows --arg 87 \
+      --arg --numCols --arg 20444 \
+      --arg --tempDir --arg /asf-mail-archives/mahout/svd/transpose \
+      -j JOB_ID
+
+
+Note: You need to use the same number of reducers that was used for
+transposing the matrix you are multiplying the vectors with.
+
+The job creates the following output:
+
+
+    /asf-mail-archives/mahout/svd/transpose
+
+
+<a name="DimensionalReduction-8.MatrixMultiplication"></a>
+##### 8. Matrix Multiplication
+
+Lastly, we need to multiply the transposed vectors using Mahout's
+matrixmult job:
+
+
+    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
+      --main-class org.apache.mahout.driver.MahoutDriver \
+      --arg matrixmult \
+      --arg --numRowsA --arg 20444 \
+      --arg --numColsA --arg 6076937 \
+      --arg --numRowsB --arg 20444 \
+      --arg --numColsB --arg 87 \
+      --arg --inputPathA --arg
+/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
+      --arg --inputPathB --arg /asf-mail-archives/mahout/svd/transpose \
+      -j JOB_ID
+
+
+This job produces output such as:
+
+
+    /user/hadoop/productWith-189
+
+
+<a name="DimensionalReduction-Resources"></a>
+# Resources
+
+* http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm
+*
+http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

Added: mahout/site/mahout_cms/content/users/basics/gaussian-discriminative-analysis.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/gaussian-discriminative-analysis.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/gaussian-discriminative-analysis.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/gaussian-discriminative-analysis.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,14 @@
+Title: Gaussian Discriminative Analysis
+<a name="GaussianDiscriminativeAnalysis-GaussianDiscriminativeAnalysis"></a>
+# Gaussian Discriminative Analysis
+
+Gaussian Discriminative Analysis is a tool for multigroup classification
+based on extending linear discriminant analysis. The paper on the approach
+is located at http://citeseer.ist.psu.edu/4617.html (note, for some reason
+the paper is backwards, in that page 1 is at the end)
+
+<a name="GaussianDiscriminativeAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="GaussianDiscriminativeAnalysis-Designofpackages"></a>
+## Design of packages

Added: mahout/site/mahout_cms/content/users/basics/independent-component-analysis.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/independent-component-analysis.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/independent-component-analysis.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/independent-component-analysis.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,11 @@
+Title: Independent Component Analysis
+<a name="IndependentComponentAnalysis-IndependentComponentAnalysis"></a>
+# Independent Component Analysis
+
+See also: Principal Component Analysis.
+
+<a name="IndependentComponentAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="IndependentComponentAnalysis-Designofpackages"></a>
+## Design of packages

Added: mahout/site/mahout_cms/content/users/basics/mahout-collections.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/mahout-collections.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/mahout-collections.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/mahout-collections.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,52 @@
+Title: mahout-collections
+<a name="mahout-collections-Introduction"></a>
+# Introduction
+
+The Mahout Collections library is a set of container classes that address
+some limitations of the standard collections in Java. [This presentation](http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/oopsla08%20memory-efficient%20java%20slides.pdf)
+ describes a number of performance problems with the standard collections. 
+
+Mahout collections addresses two of the more glaring: the lack of support
+for primitive types and the lack of open hashing.
+
+<a name="mahout-collections-PrimitiveTypes"></a>
+# Primitive Types
+
+The most visible feature of Mahout Collections is the large collection of
+primitive type collections. Given Java's asymmetrical support for the
+primitive types, the only efficient way to handle them is with many
+classes. So, there are ArrayList-like containers for all of the primitive
+types, and hash maps for all the useful combinations of primitive type and
+object keys and values.
+
+These classes do not, in general, implement interfaces from *java.util*.
+Even when the *java.util* interfaces could be type-compatible, they tend
+to include requirements that are not consistent with efficient use of
+primitive types.
+
+<a name="mahout-collections-OpenAddressing"></a>
+# Open Addressing
+
+All of the sets and maps in Mahout Collections are open-addressed hash
+tables. Open addressing has a much smaller memory footprint than chaining.
+Since the purpose of these collections is to avoid the memory cost of
+autoboxing, open addressing is a consistent design choice.
+
+<a name="mahout-collections-Sets"></a>
+# Sets
+
+Mahout Collections includes open hash sets. Unlike *java.util*, a set is
+not a recycled hash table; the sets are separately implemented and do not
+have any additional storage usage for unused keys.
+
+<a name="mahout-collections-CreditwhereCreditisdue"></a>
+# Credit where Credit is due
+
+The implementation of Mahout Collections is derived from [Cern Colt](http://acs.lbl.gov/~hoschek/colt/)
+.
+
+
+
+
+
+

Added: mahout/site/mahout_cms/content/users/basics/mahout.ga.tutorial.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/mahout.ga.tutorial.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/mahout.ga.tutorial.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/mahout.ga.tutorial.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,36 @@
+Title: Mahout.GA.Tutorial
+<a name="Mahout.GA.Tutorial-HowtodistributethefitnessevaluationusingMahout.GA"></a>
+# How to distribute the fitness evaluation using Mahout.GA
+
+In any Watchmaker program, you'll have to create an instance of a
+StandaloneEvolutionEngine. For the TSP example this is done in the
+EvolutionaryTravellingSalesman class:
+
+{code:java}
+private EvolutionEngine<List<String>>
+getEngine(CandidateFactory<List<String>> candidateFactory,
+EvolutionaryOperator<List<?>> pipeline, Random rng) {
+  return new StandaloneEvolutionEngine<List<String>>(candidateFactory,
+pipeline, new RouteEvaluator(distances), selectionStrategy, rng);
+}
+
+    
+    The RouteEvaluator class is where the fitness of each individual is
+evaluated, if we want to distribute the evaluation over a Hadoop Cluster,
+all we have to is wrap the evaluator in a MahoutFitnessEvaluator, and
+instead of a StandaloneEvolutionEngine we'll use a STEvolutionEngine :
+    
+    {code:java}
+    private EvolutionEngine<List<String>>
+getEngine(CandidateFactory<List<String>> candidateFactory,
+EvolutionaryOperator<List<?>> pipeline, Random rng) {
+      MahoutFitnessEvaluator<List<String>> evaluator = new
+MahoutFitnessEvaluator<List<String>>(new RouteEvaluator(distances));
+      return new STEvolutionEngine<List<String>>(candidateFactory, pipeline,
+evaluator, selectionStrategy, rng);
+    }
+
+
+And voila! your code is ready to run on Hadoop. The complete running
+example is available with the examples in the
+org/apache/mahout/ga/watchmaker/travellingsalesman directory

Added: mahout/site/mahout_cms/content/users/basics/mahoutintegration.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/mahoutintegration.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/mahoutintegration.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/mahoutintegration.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1 @@
+Title: MahoutIntegration

Added: mahout/site/mahout_cms/content/users/basics/matrix-and-vector-needs.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/matrix-and-vector-needs.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/matrix-and-vector-needs.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/matrix-and-vector-needs.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,82 @@
+Title: Matrix and Vector Needs
+<a name="MatrixandVectorNeeds-Intro"></a>
+# Intro
+
+Most ML algorithms require the ability to represent multidimensional data
+concisely and to be able to easily perform common operations on that data.
+MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality,
+along with a set of common operations on their instances. Vectors and
+matrices are provided with sparse and dense implementations that are memory
+resident and are suitable for manipulating intermediate results within
+mapper, combiner and reducer implementations. They are not intended for
+applications requiring vectors or matrices that exceed the size of a single
+JVM, though such applications might be able to utilize them within a larger
+organizing framework.
+
+<a name="MatrixandVectorNeeds-Background"></a>
+## Background
+
+See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser)
+
+<a name="MatrixandVectorNeeds-Vectors"></a>
+## Vectors
+
+Mahout supports a Vector interface that defines the following operations over all implementation classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double[](.html)
+ that is storage and access efficient. The class SparseVector implements
+vectors as a HashMap<Integer, Double> that is surprisingly fast and
+efficient. For sparse vectors, the size() method returns the current number
+of elements whereas the cardinality() method returns the number of
+dimensions it holds. An additional VectorView class allows views of an
+underlying vector to be specified by the viewPart() method. See the
+JavaDocs for more complete definitions.
+
+<a name="MatrixandVectorNeeds-Matrices"></a>
+## Matrices
+
+Mahout also supports a Matrix interface that defines a similar set of operations over all implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements matrices as a double[](.html)
+[] that is storage and access efficient. The class SparseRowMatrix
+implements matrices as a Vector[] holding the rows of the matrix in a
+SparseVector, and the symmetric class SparseColumnMatrix implements
+matrices as a Vector[] holding the columns in a SparseVector. Each of these
+classes can quickly produce a given row or column, respectively. A fourth
+class SparseMatrix, uses a HashMap<Integer, Vector> which is also a
+SparseVector. For sparse matrices, the size() method returns an int\[2\]
+containing the actual row and column sizes whereas the cardinality() method
+returns an int\[2\] with the number of dimensions of each. An additional
+MatrixView class allows views of an underlying matrix to be specified by
+the viewPart() method. See the JavaDocs for more complete definitions.
+
+The Matrix interface does not currently provide invert or determinant
+methods, though these are desirable. It is arguable that the
+implementations of SparseRowMatrix and SparseColumnMatrix ought to use the
+HashMap<Integer, Vector> implementations and that SparseMatrix should
+instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of
+sparse matrices can also be envisioned that support different storage and
+access characteristics. Because the arguments of assignColumn and assignRow
+operations accept all forms of Vector, it is possible to construct
+instances of sparse matrices containing dense rows or columns. See the
+JavaDocs for more complete definitions.
+
+For applications like PageRank/TextRank, iterative approaches to calculate
+eigenvectors would also be useful. Batching of row/column operations would
+also be useful, such as perhaps assignRow or assighColumn accepting
+UnaryFunction and BinaryFunction arguments.
+
+
+<a name="MatrixandVectorNeeds-Ideas"></a>
+## Ideas
+
+As Vector and Matrix implementations are currently memory-resident, very
+large instances greater than available memory are not supported. An
+extended set of implementations that use HBase (BigTable) in Hadoop to
+represent their instances would facilitate applications requiring such
+large collections.  
+See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6)
+See [Hama](http://wiki.apache.org/hadoop/Hama)
+
+
+<a name="MatrixandVectorNeeds-References"></a>
+## References
+
+Have a look at the old parallel computing libraries like [ScalaPACK](http://www.netlib.org/scalapack/)
+, others

Added: mahout/site/mahout_cms/content/users/basics/principal-components-analysis.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/principal-components-analysis.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/principal-components-analysis.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/principal-components-analysis.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,23 @@
+Title: Principal Components Analysis
+<a name="PrincipalComponentsAnalysis-PrincipalComponentsAnalysis"></a>
+# Principal Components Analysis
+
+PCA is used to reduce high dimensional data set to lower dimensions. PCA
+can be used to identify patterns in data, express the data in a lower
+dimensional space. That way, similarities and differences can be
+highlighted. It is mostly used in face recognition and image compression.
+There are several flaws one has to be aware of when working with PCA:
+
+* Linearity assumption - data is assumed to be linear combinations of some
+basis. There exist non-linear methods such as kernel PCA that alleviate
+that problem.
+* Principal components are assumed to be orthogonal. ICA tries to cope with
+this limitation.
+* Mean and covariance are assumed to be statistically important.
+* Large variances are assumed to have important dynamics.
+
+<a name="PrincipalComponentsAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="PrincipalComponentsAnalysis-Designofpackages"></a>
+## Design of packages

Added: mahout/site/mahout_cms/content/users/basics/quickstart.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/quickstart.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/quickstart.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/quickstart.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,64 @@
+Title: Quickstart
+<a name="Quickstart-DownloadandInstallation"></a>
+# Download and Installation
+
+* [Download and installation](buildingmahout.html)
+
+<a name="Quickstart-RunningtheExamples"></a>
+# Running the Examples
+
+Mahout runs on [Apache Hadoop ](-http://hadoop.apache.org.html)
+. So, to run these examples, install the latest compatible{footnote}When
+using a release, see the release notes{footnote}{footnote}When using trunk,
+see _parent/pom.xml_{footnote} [Hadoop Common release | http://www.apache.org/dyn/closer.cgi/hadoop/common/]
+. 
+
+{display-footnotes}
+
+<a name="Quickstart-Clustering"></a>
+## Clustering
+
+* [Clustering of synthetic control data](clustering-of-synthetic-control-data.html)
+* [Visualizing Sample Clusters](visualizing-sample-clusters.html)
+* [Clustering Seinfeld Episodes](clustering-seinfeld-episodes.html)
+
+
+<a name="Quickstart-Classification"></a>
+## Classification
+
+* [Twenty Newsgroups](twenty-newsgroups.html)
+* [Wikipedia Bayes Example](wikipedia-bayes-example.html)
+
+<a name="Quickstart-GeneticProgramming"></a>
+## Genetic Programming
+
+* [Watchmaker](mahout.ga.tutorial.html)
+
+<a name="Quickstart-DecisionForest"></a>
+## Decision Forest
+
+* [Breiman Example](breiman-example.html)
+* [Partial Implementation](partial-implementation.html)
+
+<a name="Quickstart-Recommendationmining"></a>
+## Recommendation mining
+
+This package comes with four examples based on netflix data, bookcrossing,
+grouplens and jester data.
+
+* [RecommendationExamples](recommendationexamples.html)
+
+<a name="Quickstart-WheretoGoNext"></a>
+# Where to Go Next
+
+* If you are working with text, read more on [Creating Vectors from Text](creating-vectors-from-text.html)
+* To learn more on grouping items by similarity and identifying clusters
+read more on [Clustering your data](clusteringyourdata.html)
+* If you want to classify incoming items into predefinied categories read
+more on [Classifying your data ](classifyingyourdata.html)
+* To know how to Mine frequent patterns from your data read more on [Parallel Frequent Pattern Mining](parallel-frequent-pattern-mining.html)
+* To read more on building recommendation engines have a look at the [Recommender (Taste) documentation ](recommender-documentation.html)
+ and [Taste hadoop commandline|TasteCommandLine]
+
+Go back to the [Main Wiki Page](mahout-wiki.html)
+ for more information. 

Added: mahout/site/mahout_cms/content/users/basics/svd---singular-value-decomposition.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/svd---singular-value-decomposition.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/svd---singular-value-decomposition.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/svd---singular-value-decomposition.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,46 @@
+Title: SVD - Singular Value Decomposition
+{excerpt}Singular Value Decomposition is a form of product decomposition of
+a matrix in which a rectangular matrix A is decomposed into a product U s
+V' where U and V are orthonormal and s is a diagonal matrix.{excerpt}  The
+values of A can be real or complex, but the real case dominates
+applications in machine learning.  The most prominent properties of the SVD
+are:
+
+  * The decomposition of any real matrix has only real values
+  * The SVD is unique except for column permutations of U, s and V
+  * If you take only the largest n values of s and set the rest to zero,
+you have a least squares approximation of A with rank n.  This allows SVD
+to be used very effectively in least squares regression and makes partial
+SVD useful.
+  * The SVD can be computed accurately for singular or nearly singular
+matrices.  For a matrix of rank n, only the first n singular values will be
+non-zero.  This allows SVD to be used for solution of singular linear
+systems.  The columns of U and V corresponding to zero singular values
+define the null space of A.
+  * The partial SVD of very large matrices can be computed very quickly
+using stochastic decompositions.  See http://arxiv.org/abs/0909.4061v1 for
+details.  Gradient descent can also be used to compute partial SVD's and is
+very useful where some values of the matrix being decomposed are not known.
+
+In collaborative filtering and text retrieval, it is common to compute the
+partial decomposition of the user x item interaction matrix or the document
+x term matrix.	This allows the projection of users and items (or documents
+and terms) into a common vector space representation that is often referred
+to as the latent semantic representation.  This process is sometimes called
+Latent Semantic Analysis and has been very effective in the analysis of the
+Netflix dataset.
+
+Dimension Reduction in Mahout:
+ * https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
+
+ See Also:
+ * http://www.kwon3d.com/theory/jkinem/svd.html
+ * http://en.wikipedia.org/wiki/Singular_value_decomposition
+ * http://en.wikipedia.org/wiki/Latent_semantic_analysis
+ * http://en.wikipedia.org/wiki/Netflix_Prize
+ *
+http://www.amazon.com/Understanding-Complex-Datasets-Decompositions-Knowledge/dp/1584888326
+ * http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
+ *
+http://www.quora.com/What-s-the-best-parallelized-sparse-SVD-code-publicly-available
+ * [understanding Mahout Hadoop SVD thread](http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E)

Added: mahout/site/mahout_cms/content/users/basics/system-requirements.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/system-requirements.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/system-requirements.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/system-requirements.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,11 @@
+Title: System Requirements
+* Java 1.6.x or greater.
+* Maven 3.x to build the source code.
+
+CPU, Disk and Memory requirements are based on the many choices made in
+implementing your application with Mahout (document size, number of
+documents, and number of hits retrieved to name a few.)
+
+Several of the Mahout algorithms are implemented to work on Hadoop
+clusters. If not advertised differently, those implementations work with
+Hadoop 0.20.0 or greater.

Added: mahout/site/mahout_cms/content/users/basics/tf-idf---term-frequency-inverse-document-frequency.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/basics/tf-idf---term-frequency-inverse-document-frequency.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/basics/tf-idf---term-frequency-inverse-document-frequency.mdtext (added)
+++ mahout/site/mahout_cms/content/users/basics/tf-idf---term-frequency-inverse-document-frequency.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,15 @@
+Title: TF-IDF - Term Frequency-Inverse Document Frequency
+{excerpt}Is a weight measure often used in information retrieval and text
+mining. This weight is a statistical measure used to evaluate how important
+a word is to a document in a collection or corpus. The importance increases
+proportionally to the number of times a word appears in the document but is
+offset by the frequency of the word in the corpus.{excerpt} In other words
+if a term/word appears lots in a document but also appears lots in the
+corpus/collection as a whole it will get a lower score. An example of this
+would be "the", "and", "it" but depending on your source material it maybe
+other words that are very common to the source matter.
+
+
+ See Also:
+ * http://en.wikipedia.org/wiki/Tf%E2%80%93idf
+ * http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html

Added: mahout/site/mahout_cms/content/users/classification/bayesian-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/bayesian-commandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/bayesian-commandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/bayesian-commandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,53 @@
+Title: bayesian-commandline
+<a name="bayesian-commandline-Introduction"></a>
+# Introduction
+
+This quick start page describes how to run the naive bayesian and
+complementary naive bayesian classification algorithms on a Hadoop cluster.
+
+<a name="bayesian-commandline-Steps"></a>
+# Steps
+
+<a name="bayesian-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+In the examples directory type:
+
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.bayes.<JOB>"
+-Dexec.args="<OPTIONS>"
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.cbayes.<JOB>"
+-Dexec.args="<OPTIONS>"
+
+
+<a name="bayesian-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.1 release, the
+job will be mahout-core-0.1.jar
+* (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/core/target/mahout-core-<MAHOUT VERSION>.job
+org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver <OPTIONS>
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="bayesian-commandline-Commandlineoptions"></a>
+# Command line options
+
+    BayesDriver, BayesThetaNormalizerDriver, CBayesNormalizedWeightDriver,
+CBayesDriver, CBayesThetaDriver, CBayesThetaNormalizerDriver,
+BayesWeightSummerDriver, BayesFeatureDriver, BayesTfIdfDriver Usage:
+     [--input <input> --output <output> --help]
+    
+    Options
+    
+      --input (-i) input	  The Path for input Vectors. Must be a
+SequenceFile of Writable, Vector.
+      --output (-o) output	  The directory pathname for output points.
+      --help (-h)		  Print out help.
+