You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by is...@apache.org on 2013/11/03 22:36:27 UTC
svn commit: r1538467 [17/20] - in /mahout/site/mahout_cms: ./ cgi-bin/ content/ content/css/ content/developers/ content/general/ content/images/ content/js/ content/users/ content/users/basics/ content/users/classification/ content/users/clustering/ c...

Added: mahout/site/mahout_cms/content/users/classification/bayesian.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/bayesian.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/bayesian.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/bayesian.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,66 @@
+Title: Bayesian
+<a name="Bayesian-Intro"></a>
+# Intro
+
+Mahout currently has two implementations of Bayesian classifiers.  One is
+the traditional Naive Bayes approach, and the other is called Complementary
+Naive Bayes.
+
+<a name="Bayesian-Implementations"></a>
+# Implementations
+
+[NaiveBayes](naivebayes.html)
+ ([MAHOUT-9|http://issues.apache.org/jira/browse/MAHOUT-9])
+
+[Complementary Naive Bayes](complementary-naive-bayes.html)
+ ([MAHOUT-60|http://issues.apache.org/jira/browse/MAHOUT-60])
+
+The Naive Bayes implementations in Mahout follow the paper [http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf)
+ Before we get to the actual algorithm lets discuss the terminology
+
+Given, in an input set of classified documents: 
+1. j = 0 to N features 
+1. k = 0 to L labels
+
+Then:
+
+1. Normalized Frequency for a term(feature) in a document is calculated by
+dividing the term frequency by the root mean square of terms frequencies in
+that document
+1. Weight Normalized Tf for a given feature in a given label = sum of
+Normalized Frequency of the feature across all the documents in the label. 
+1. Weight Normalized Tf-Idf for a given feature in a label is the Tf-idf
+calculated using standard idf multiplied by the Weight Normalized Tf
+
+Once Weight Normalized Tf-idf(W-N-Tf-idf) is calculated, the final weight
+matrix for Bayes and Cbayes are calculated as follows
+
+We calculate the sum of W-N-Tf-idf for all the features in a label called
+as Sigma_k or sumLabelWeight
+
+For Bayes
+
+    Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N  ) ]
+
+For CBayes
+
+We calculate the Sum of W-N-Tf-Idf across all labels for a given feature.
+We call this sumFeatureWeight of Sigma_j
+Also we sum the entire W-N-Tf-Idf weights for all feature,label pair in the
+train set. Call this Sigma_jSigma_k
+
+Final Weight is calculated as
+
+    Weight = Log [ ( Sigma_j - W-N-Tf-Idf + alpha_i ) / ( Sigma_jSigma_k - Sigma_k + N  ) ]
+
+
+<a name="Bayesian-Examples"></a>
+# Examples
+
+In Mahout's example code, there are two samples that can be used:
+
+1. [Wikipedia Bayes Example](wikipedia-bayes-example.html)
+ - Classify Wikipedia data.
+
+1. [Twenty Newsgroups](twenty-newsgroups.html)
+ - Classify the classic Twenty Newsgroups data.

Added: mahout/site/mahout_cms/content/users/classification/breiman-example.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/breiman-example.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/breiman-example.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/breiman-example.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,79 @@
+Title: Breiman Example
+<a name="BreimanExample-Introduction"></a>
+# Introduction
+
+This quick start page shows how to run the Breiman example. It implements
+the test procedure described in Breiman's paper [1](1.html)
+. 
+The basic algorithm is as follows :
+* repeat I iterations
+* foreach iteration do
+ ** 10% of the dataset is kept apart as a testing set 
+ ** build two forests using the training set, one with m=int(log2(M)+1)
+(called Random-Input) and one with m=1 (called Single-Input)
+ ** choose the forest that gave the lowest oob error estimation to compute
+the test set error
+ ** compute the test set error using the Single Input Forest (test error),
+this demonstrates that even with m=1, Decision Forests give comparable
+results to greater values of m
+ ** compute the mean test set error using every tree of the chosen forest
+(tree error). This should indicate how well a single Decision Tree performs
+* compute the mean test error for all iterations
+* compute the mean tree error for all iterations
+
+<a name="BreimanExample-Steps"></a>
+# Steps
+<a name="BreimanExample-Downloadthedata"></a>
+## Download the data
+* The current implementation is compatible with the UCI repository file
+format. Here are links to some of the datasets used in Breiman's paper:
+ ** glass : http://archive.ics.uci.edu/ml/datasets/Glass+Identification
+ ** breast cancer :
+http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
+ ** diabetes : http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
+ ** sonar :
+http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)
+ ** ionosphere : http://archive.ics.uci.edu/ml/datasets/Ionosphere
+ ** vehicle : [http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)](http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes))
+ ** german : [http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)](http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
+* Put the data in HDFS: {code}$HADOOP_HOME/bin/hadoop fs -put <PATH TO
+DATA> testdata{code}
+
+<a name="BreimanExample-BuildtheJobfiles"></a>
+## Build the Job files
+* In $MAHOUT_HOME/ run: {code}mvn install -DskipTests{code}
+
+<a name="BreimanExample-Generateafiledescriptorforthedataset:"></a>
+## Generate a file descriptor for the dataset: 
+for the glass dataset (glass.data), run :
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar
+org.apache.mahout.classifier.df.tools.Describe -p testdata/glass.data -f
+testdata/glass.info -d I 9 N L
+
+The "I 9 N L" string indicates the nature of the variables. which means 1
+ignored(I) attribute, followed by 9 numerical(N) attributes, followed by
+the label(L)
+* you can also use C for categorical (nominal) attributes
+
+<a name="BreimanExample-Runtheexample"></a>
+## Run the example
+
+    $HADOOP_HOME/hadoop jar
+$MAHOUT_HOME/examples/target/mahout-examples-<VERSION>-job.jar
+org.apache.mahout.classifier.df.BreimanExample -d testdata/glass.data -ds
+testdata/glass.info -i 10 -t 100
+
+which builds 100 trees (-t argument) and repeats the test 10 iterations (-i
+argument) 
+* The example outputs the following results:
+** Selection error : mean test error for the selected forest on all
+iterations
+** Single Input error : mean test error for the single input forest on all
+iterations
+** One Tree error : mean single tree error on all iterations
+** Mean Random Input Time : mean build time for random input forests on all
+iterations
+** Mean Single Input Time : mean build time for single input forests on all
+iterations

Added: mahout/site/mahout_cms/content/users/classification/class-discovery.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/class-discovery.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/class-discovery.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/class-discovery.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,150 @@
+Title: Class Discovery
+<a name="ClassDiscovery-ClassDiscovery"></a>
+# Class Discovery
+
+See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf
+
+CDGA uses a Genetic Algorithm to discover a classification rule for a given
+dataset. 
+A dataset can be seen as a table:
+
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>row 1</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>row 2</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr>
+<tr><td>row M</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+An attribute can be numerical, for example a "temperature" attribute, or
+categorical, for example a "color" attribute. For classification purposes,
+one of the categorical attributes is designated as a *label*, which means
+that its value defines the *class* of the rows.
+A classification rule can be represented as follows:
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>weight</td><td>w1</td><td>w2</td><td>...</td><td>wN</td></tr>
+<tr><td>operator</td><td>op1</td><td>op2</td><td>...</td><td>opN</td></tr>
+<tr><td>value</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+For a given *target* class and a weight *threshold*, the classification
+rule can be read :
+
+
+    for each row of the dataset
+      if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1
+rule.value1)) &&
+         (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2
+rule.value2)) &&
+         ...
+         (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN
+rule.valueN)) then
+        row is part of the target class
+
+
+*Important:* The label attribute is not evaluated by the rule.
+
+The threshold parameter allows some conditions of the rule to be skipped if
+their weight is too small. The operators available depend on the attribute
+types:
+* for a numerical attributes, the available operators are '<' and '>='
+* for categorical attributes, the available operators are '!=' and '=='
+
+The "threshold" and "target" are user defined parameters, and because the
+label is always a categorical attribute, the target is the (zero based)
+index of the class label value in all the possible values of the label. For
+example, if the label attribute can have the following values (blue, brown,
+green), then a target of 1 means the "blue" class.
+
+For example, we have the following dataset (the label attribute is "Eyes
+Color"):
+<table>
+<tr><th> </th><th>Age</th><th>Eyes Color</th><th>Hair Color</th></tr>
+<tr><td>row 1</td><td>16</td><td>brown</td><td>dark</td></tr>
+<tr><td>row 2</td><td>25</td><td>green</td><td>light</td></tr>
+<tr><td>row 3</td><td>12</td><td>blue</td><td>light</td></tr>
+and a classification rule:
+<tr><td>weight</td><td>0</td><td>1</td></tr>
+<tr><td>operator</td><td><</td><td>!=</td></tr>
+<tr><td>value</td><td>20</td><td>light</td></tr>
+and the following parameters: threshold = 1 and target = 0 (brown).
+</table>
+
+This rule can be read as follows:
+
+    for each row of the dataset
+      if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
+         (1 < 1 || (1 >= 1 && row.value2 != light)) then
+        row is part of the "brown Eye Color" class
+
+
+Please note how the rule skipped the label attribute (Eye Color), and how
+the first condition is ignored because its weight is < threshold.
+
+<a name="ClassDiscovery-Runningtheexample:"></a>
+# Running the example:
+NOTE: Substitute in the appropriate version for the Mahout JOB jar
+
+1. cd <MAHOUT_HOME>/examples
+1. ant job
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc wdbc{code}
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc.infos wdbc.infos{code}
+1. {code}<HADOOP_HOME>/bin/hadoop jar
+<MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job
+org.apache.mahout.ga.watchmaker.cd.CDGA
+<MAHOUT_HOME>/examples/src/test/resources/wdbc 1 0.9 1 0.033 0.1 0 100 10
+
+    CDGA needs 9 parameters:
+    * param 1 : path of the directory that contains the dataset and its infos
+file
+    * param 2 : target class
+    * param 3 : threshold
+    * param 4 : number of crossover points for the multi-point crossover
+    * param 5 : mutation rate
+    * param 6 : mutation range
+    * param 7 : mutation precision
+    * param 8 : population size
+    * param 9 : number of generations before the program stops
+    
+    For more information about 4th parameter, please see [Multi-point Crossover|http://www.geatbx.com/docu/algindex-03.html#P616_36571]
+.
+    For a detailed explanation about the 5th, 6th and 7th parameters, please
+see [Real Valued Mutation|http://www.geatbx.com/docu/algindex-04.html#P659_42386]
+.
+    
+    *TODO*: Fill in where to find the output and what it means.
+    
+    h1. The info file:
+    To run properly, CDGA needs some informations about the dataset. Each
+dataset should be accompanied by an .infos file that contains the needed
+informations. for each attribute a corresponding line in the info file
+describes it, it can be one of the following:
+    * IGNORED
+      if the attribute is ignored
+    * LABEL, val1, val2,...
+      if the attribute is the label (class), and its possible values
+    * CATEGORICAL, val1, val2,...
+      if the attribute is categorial (nominal), and its possible values
+    * NUMERICAL, min, max
+      if the attribute is numerical, and its min and max values
+    
+    This file can be generated automaticaly using a special tool available with
+CDGA.
+    
+
+
+*  the tool searches for an existing infos file (*must be filled by the
+user*), in the same directory of the dataset with the same name and with
+the ".infos" extension, that contain the type of the attributes:
+  ** 'N' numerical attribute
+  ** 'C' categorical attribute
+  ** 'L' label (this also a categorical attribute)
+  ** 'I' to ignore the attribute
+  each attribute is in a separate 
+* A Hadoop job is used to parse the dataset and collect the informations.
+This means that *the dataset can be distributed over HDFS*.
+* the results are written back in the same .info file, with the correct
+format needed by CDGA.

Added: mahout/site/mahout_cms/content/users/classification/classifyingyourdata.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/classifyingyourdata.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/classifyingyourdata.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/classifyingyourdata.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,47 @@
+Title: ClassifyingYourData
++*Mahout_0.2*+
+
+After you've done the [Quickstart](quickstart.html)
+ and are familiar with the basics of Mahout, it is time to build a
+classifier from your own data. 
+
+The following pieces *may* be useful for in getting started:
+
+<a name="ClassifyingYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format
+(which has changed since Mahout 0.1)
+
+* See [Creating Vectors](creating-vectors.html)
+
+<a name="ClassifyingYourData-TextPreparation"></a>
+## Text Preparation
+
+* See [Creating Vectors from Text](creating-vectors-from-text.html)
+*
+http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
+
+<a name="ClassifyingYourData-RunningtheProcess"></a>
+# Running the Process
+
+<a name="ClassifyingYourData-NaiveBayes"></a>
+## Naive Bayes
+
+Background: [Naive Bayes Classification ](-bayesian-.html)
+
+Documentation of running naive bayes from the command line: [bayesian-commandline](bayesian-commandline.html)
+
+<a name="ClassifyingYourData-C-Bayes"></a>
+## C-Bayes
+
+Background: [C-Bayes Classification ](-https://issues.apache.org/jira/browse/mahout-60-.html)
+
+Documentation of running c-bayes from the command line: [c-bayes-commandline](c-bayes-commandline.html)
+
+<a name="ClassifyingYourData-RandomForests"></a>
+## Random Forests
+
+Background: [Random Forests Classification ](-http://cwiki.apache.org/mahout/random-forests.html-.html)
+
+Documentation of running random forests from the command line: [Breiman Example](breiman-example.html)

Added: mahout/site/mahout_cms/content/users/classification/complementary-naive-bayes.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/complementary-naive-bayes.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/complementary-naive-bayes.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/complementary-naive-bayes.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,16 @@
+Title: Complementary Naive Bayes
+<a name="ComplementaryNaiveBayes-Introduction"></a>
+# Introduction
+
+See ([MAHOUT-60](http://issues.apache.org/jira/browse/MAHOUT-60)
+)
+
+
+
+
+
+<a name="ComplementaryNaiveBayes-OtherResources"></a>
+# Other Resources
+
+See [NaiveBayes](naivebayes.html)
+ ([MAHOUT-9|http://issues.apache.org/jira/browse/MAHOUT-9])

Added: mahout/site/mahout_cms/content/users/classification/locally-weighted-linear-regression.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/locally-weighted-linear-regression.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/locally-weighted-linear-regression.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/locally-weighted-linear-regression.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,20 @@
+Title: Locally Weighted Linear Regression
+
+<a name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a>
+# Locally Weighted Linear Regression
+
+Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians,
+use the data to build a parameterized model. After training, the model is
+used for predictions and the data are generally discarded. In contrast,
+"memory-based" methods are non-parametric approaches that explicitly retain
+the training data, and use it each time a prediction needs to be made.
+Locally weighted regression (LWR) is a memory-based method that performs a
+regression around a point of interest using only training data that are
+"local" to that point. Source:
+http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html
+
+<a name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a>
+## Strategy for parallel regression
+
+<a name="LocallyWeightedLinearRegression-Designofpackages"></a>
+## Design of packages

Added: mahout/site/mahout_cms/content/users/classification/logistic-regression.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/logistic-regression.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/logistic-regression.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/logistic-regression.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,108 @@
+Title: Logistic Regression
+<a name="LogisticRegression-LogisticRegression(SGD)"></a>
+# Logistic Regression (SGD)
+
+Logistic regression is a model used for prediction of the probability of
+occurrence of an event. It makes use of several predictor variables that
+may be either numerical or categories.
+
+Logistic regression is the standard industry workhorse that underlies many
+production fraud detection and advertising quality and targeting products. 
+The Mahout implementation uses Stochastic Gradient Descent (SGD) to all
+large training sets to be used.
+
+For a more detailed analysis of the approach, have a look at the thesis of
+Paul Komarek:
+
+http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en
+
+See MAHOUT-228 for the main JIRA issue for SGD.
+
+
+<a name="LogisticRegression-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+The bad news is that SGD is an inherently sequential algorithm.  The good
+news is that it is blazingly fast and thus it is not a problem for Mahout's
+implementation to handle training sets of tens of millions of examples. 
+With the down-sampling typical in many data-sets, this is equivalent to a
+dataset with billions of raw training examples.
+
+The SGD system in Mahout is an online learning algorithm which means that
+you can learn models in an incremental fashion and that you can do
+performance testing as your system runs.  Often this means that you can
+stop training when a model reaches a target level of performance.  The SGD
+framework includes classes to do on-line evaluation using cross validation
+(the CrossFoldLearner) and an evolutionary system to do learning
+hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). 
+The AdaptiveLogisticRegression system makes heavy use of threads to
+increase machine utilization.  The way it works is that it runs 20
+CrossFoldLearners in separate threads, each with slightly different
+learning parameters.  As better settings are found, these new settings are
+propagating to the other learners.
+
+<a name="LogisticRegression-Designofpackages"></a>
+## Design of packages
+
+There are three packages that are used in Mahout's SGD system.	These
+include
+
+* The vector encoding package (found in
+org.apache.mahout.vectorizer.encoders)
+
+* The SGD learning package (found in org.apache.mahout.classifier.sgd)
+
+* The evolutionary optimization system (found in org.apache.mahout.ep)
+
+<a name="LogisticRegression-Featurevectorencoding"></a>
+### Feature vector encoding
+
+Because the SGD algorithms need to have fixed length feature vectors and
+because it is a pain to build a dictionary ahead of time, most SGD
+applications use the hashed feature vector encoding system that is rooted
+at FeatureVectorEncoder.
+
+The basic idea is that you create a vector, typically a
+RandomAccessSparseVector, and then you use various feature encoders to
+progressively add features to that vector.  The size of the vector should
+be large enough to avoid feature collisions as features are hashed.
+
+There are specialized encoders for a variety of data types.  You can
+normally encode either a string representation of the value you want to
+encode or you can encode a byte level representation to avoid string
+conversion.  In the case of ContinuousValueEncoder and
+ConstantValueEncoder, it is also possible to encode a null value and pass
+the real value in as a weight.	This avoids numerical parsing entirely in
+case you are getting your training data from a system like Avro.
+
+Here is a class diagram for the encoders package:
+
+!vector-class-hierarchy.png|border=1!
+
+<a name="LogisticRegression-SGDLearning"></a>
+### SGD Learning
+
+For the simplest applications, you can construct an
+OnlineLogisticRegression and be off and running.  Typically, though, it is
+nice to have running estimates of performance on held out data.  To do
+that, you should use a CrossFoldLearner which keeps a stable of five (by
+default) OnlineLogisticRegression objects.  Each time you pass a training
+example to a CrossFoldLearner, it passes this example to all but one of its
+children as training and passes the example to the last child to evaluate
+current performance.  The children are used for evaluation in a round-robin
+fashion so, if you are using the default 5 way split, all of the children
+get 80% of the training data for training and get 20% of the data for
+evaluation.
+
+To avoid the pesky need to configure learning rates, regularization
+parameters and annealing schedules, you can use the
+AdaptiveLogisticRegression.  This class maintains a pool of
+CrossFoldLearners and adapts learning rates and regularization on the fly
+so that you don't have to.
+
+Here is a class diagram for the classifiers.sgd package.  As you can see,
+the number of twiddlable knobs is pretty large.  For some examples, see the
+TrainNewsGroups example code.
+
+!sgd-class-hierarchy.png|border=1!
+

Added: mahout/site/mahout_cms/content/users/classification/naivebayes.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/naivebayes.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/naivebayes.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/naivebayes.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,39 @@
+Title: NaiveBayes
+<a name="NaiveBayes-NaiveBayes"></a>
+# Naive Bayes
+
+Naive Bayes is an algorithm that can be used to classify objects into
+usually binary categories. It is one of the most common learning algorithms
+in spam filters. Despite its simplicity and rather naive assumptions it has
+proven to work surprisingly well in practice.
+
+Before applying the algorithm, the objects to be classified need to be
+represented by numerical features. In the case of e-mail spam each feature
+might indicate whether some specific word is present or absent in the mail
+to classify. The algorithm comes in two phases: Learning and application.
+During learning, a set of feature vectors is given to the algorithm, each
+vector labeled with the class the object it represents, belongs to. From
+that it is deduced which combination of features appears with high
+probability in spam messages. Given this information, during application
+one can easily compute the probability of a new message being either spam
+or not.
+
+The algorithm does make several assumptions, that are not true for most
+datasets, but make computations easier. The worst probably being, that all
+features of an objects are considered independent. In practice, that means,
+given the phrase "Statue of Liberty" was already found in a text, does not
+influence the probability of seeing the phrase "New York" as well.
+
+<a name="NaiveBayes-StrategyforaparallelNaiveBayes"></a>
+## Strategy for a parallel Naive Bayes
+
+See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)
+.
+
+
+<a name="NaiveBayes-Examples"></a>
+## Examples
+
+[20Newsgroups](20newsgroups.html)
+ - Example code showing how to train and use the Naive Bayes classifier
+using the 20 Newsgroups data available at [http://people.csail.mit.edu/jrennie/20Newsgroups/]

Added: mahout/site/mahout_cms/content/users/classification/neural-network.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/neural-network.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/neural-network.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/neural-network.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,16 @@
+Title: Neural Network
+<a name="NeuralNetwork-NeuralNetworks"></a>
+# Neural Networks
+
+Neural Networks are a means for classifying multi dimensional objects. We
+concentrate on implementing back propagation networks with one hidden layer
+as these networks have been covered by the [2006 NIPS map reduce paper](http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf)
+. Those networks are capable of learning not only linear separating hyper
+planes but arbitrary decision boundaries.
+
+<a name="NeuralNetwork-Strategyforparallelbackpropagationnetwork"></a>
+## Strategy for parallel backpropagation network
+
+
+<a name="NeuralNetwork-Designofimplementation"></a>
+## Design of implementation

Added: mahout/site/mahout_cms/content/users/classification/random-forests.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/random-forests.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/random-forests.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/random-forests.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,228 @@
+Title: Random Forests
+<a name="RandomForests-HowtogrowaDecisionTree"></a>
+### How to grow a Decision Tree
+
+source : \[3\](3\.html)
+
+LearnUnprunedTree(*X*,*Y*)
+
+Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* =
+the value of the *j*'th attribute in the *i*'th input datapoint. Each
+column consists of either all real values or all categorical values.
+Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output
+class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
+Output: An Unpruned decision tree
+
+
+If all records in *X* have identical values in all their attributes (this
+includes the case where *R<2*), return a Leaf Node predicting the majority
+output, breaking ties randomly. This case also includes
+If all values in *Y* are the same, return a Leaf Node predicting this value
+as the output
+Else
+&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
+&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute is
+categorical
+*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information
+Gain)&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute is
+real-valued)
+*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
+&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the
+splitting attribute we'll use)
+&nbsp;&nbsp;&nbsp; If *j\** is categorical then
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v* of the *j*'th
+attribute
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let
+*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*.
+Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}* =
+LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree node,
+splitting on *j*'th attribute. The number of children equals the number of
+values of the *j*'th attribute, and the *v*'th child is
+*Child{*}{*}{^}v{^}*
+&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split
+threshold
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}* = subset
+of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* =
+corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* =
+LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
+in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding
+subset of *Y*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}* =
+LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on
+*j*'th attribute. It has two children corresponding to whether the *j*'th
+attribute is above or below the given threshold.
+
+*Note*: There are alternatives to Information Gain for splitting nodes
+&nbsp;
+
+<a name="RandomForests-Informationgain"></a>
+### Information gain
+
+source : \[3\](3\.html)
+1. h4. nominal attributes
+
+suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
+P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
+&nbsp;
+H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
+H(Y\|X=v) = the entropy of Y among only those records in which X has value
+v
+H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
+IG(Y\|X) = H(Y) - H(Y\|X)
+1. h4. real-valued attributes
+
+suppose X is real valued
+define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
+define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
+define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)
+
+<a name="RandomForests-HowtogrowaRandomForest"></a>
+### How to grow a Random Forest
+
+source : \[1\](1\.html)
+
+Each tree is grown as follows:
+1. if the number of cases in the training set is *N*, sample *N* cases at
+random \-but with replacement, from the original data. This sample will be
+the training set for the growing tree.
+1. if there are *M* input variables, a number *m << M* is specified such
+that at each node, *m* variables are selected at random out of the *M* and
+the best split on these *m* is used to split the node. The value of *m* is
+held constant during the forest growing.
+1. each tree is grown to its large extent possible. There is no pruning.
+
+<a name="RandomForests-RandomForestparameters"></a>
+### Random Forest parameters
+
+source : \[2\](2\.html)
+Random Forests are easy to use, the only 2 parameters a user of the
+technique has to determine are the number of trees to be used and the
+number of variables (*m*) to be randomly selected from the available set of
+variables.
+Breinman's recommendations are to pick a large number of trees, as well as
+the square root of the number of variables for *m*.
+&nbsp;
+
+<a name="RandomForests-Howtopredictthelabelofacase"></a>
+### How to predict the label of a case
+
+Classify(*node*,*V*)
+&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute
+= j* then the split is done on the *j*'th attribute
+
+&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where
+*V{*}{*}{~}j{~}* = the value of the *j*'th attribute.
+&nbsp;&nbsp;&nbsp; Output: label of *V*
+
+&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
+&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value predicted
+by *node*
+
+&nbsp;&nbsp; &nbsp;Else
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *j =
+node.attribute*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j* is
+categorical then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *v* = *V{*}{*}{~}j{~}*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's
+value *v*
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)
+
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else *j* is
+real-valued
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *t = node.threshold* (split threshold)
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}* = child
+node corresponding to (*<t*)
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}LO{^}*,*V*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Else
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}* =
+child node corresponding to (*>=t*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}HI{^}*,*V*)
+&nbsp;
+
+<a name="RandomForests-Theoutofbag(oob)errorestimation"></a>
+### The out of bag (oob) error estimation
+
+source : \[1\](1\.html)
+
+in random forests, there is no need for cross-validation or a separate test
+set to get an unbiased estimate of the test set error. It is estimated
+internally, during the run, as follows:
+* each tree is constructed using a different bootstrap sample from the
+original data. About one-third of the cases left of the bootstrap sample
+and not used in the construction of the _kth_ tree.
+* put each case left out in the construction of the _kth_ tree down the
+_kth{_}tree to get a classification. In this way, a test set classification
+is obtained for each case in about one-thrid of the trees. At the end of
+the run, take *j* to be the class that got most of the the votes every time
+case *n* was _oob_. The proportion of times that *j* is not equal to the
+true class of *n* averaged over all cases is the _oob error estimate_. This
+has proven to be unbiased in many tests.
+
+<a name="RandomForests-OtherRFuses"></a>
+### Other RF uses
+
+source : \[1\](1\.html)
+* variable importance
+* gini importance
+* proximities
+* scaling
+* prototypes
+* missing values replacement for the training set
+* missing values replacement for the test set
+* detecting mislabeled cases
+* detecting outliers
+* detecting novelties
+* unsupervised learning
+* balancing prediction error
+Please refer to \[1\](1\.html)
+ for a detailed description
+
+<a name="RandomForests-References"></a>
+### References
+
+\[1\](1\.html)
+&nbsp; Random Forests - Classification Description
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)
+\[2\](2\.html)
+&nbsp; B. Larivière & D. Van Den Poel, 2004. "Predicting Customer Retention
+and Profitability by Using Random Forests and Regression Forests
+Techniques,"
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty of
+Economics and Business Administration, Ghent University, Belgium 04/282,
+Ghent University,
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics and
+Business Administration.
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online : [http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html)
+\[3\](3\.html)
+&nbsp; Decision Trees - Andrew W. Moore\[4\]
+&nbsp; &nbsp; &nbsp; &nbsp; http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html)
+\[4\](4\.html)
+&nbsp; Information Gain - Andrew W. Moore
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials)

Added: mahout/site/mahout_cms/content/users/classification/restricted-boltzmann-machines.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/restricted-boltzmann-machines.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/restricted-boltzmann-machines.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/restricted-boltzmann-machines.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,43 @@
+Title: Restricted Boltzmann Machines
+NOTE: This implementation is a Work-In-Progress, at least till September,
+2010. 
+
+The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375)
+. 
+
+<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a>
+### Boltzmann Machines
+Boltzmann Machines are a type of stochastic neural networks that closely
+resemble physical processes. They define a network of units with an overall
+energy that is evolved over a period of time, until it reaches thermal
+equilibrium. 
+
+However, the convergence speed of Boltzmann machines that have
+unconstrained connectivity is low.
+
+<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a>
+### Restricted Boltzmann Machines
+Restricted Boltzmann Machines are a variant, that are 'restricted' in the
+sense that connections between hidden units of a single layer are _not_
+allowed. In addition, stacking multiple RBM's is also feasible, with the
+activities of the hidden units forming the base for a higher-level RBM. The
+combination of these two features renders RBM's highly usable for
+parallelization. 
+
+In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
+SVD and k-NN approaches, and contributed immensely to the final solution.
+
+<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a>
+### RBM's in Apache Mahout
+An implementation of Restricted Boltzmann Machines is being developed for
+Apache Mahout as a Google Summer of Code 2010 project. A recommender
+interface will also be provided. The key aims of the implementation are:
+1. Accurate - should replicate known results, including those of the Netflix
+Prize
+1. Fast - The implementation uses Map-Reduce, hence, it should be fast
+1. Scale - Should scale to large datasets, with a design whose critical
+parts don't need a dependency between the amount of memory on your cluster
+systems and the size of your dataset
+
+You can view the patch as it develops [here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm)
+.

Added: mahout/site/mahout_cms/content/users/classification/support-vector-machines.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/support-vector-machines.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/support-vector-machines.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/support-vector-machines.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,37 @@
+Title: Support Vector Machines
+<a name="SupportVectorMachines-SupportVectorMachines"></a>
+# Support Vector Machines
+
+As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used
+to solve the task of assigning objects to classes. However, the way this
+task is solved is completely different to the setting in Naive Bayes.
+
+Each object is considered to be a point in _n_ dimensional feature space,
+_n_ being the number of features used to describe the objects numerically.
+In addition each object is assigned a binary label, let us assume the
+labels are "positive" and "negative". During learning, the algorithm tries
+to find a hyperplane in that space, that perfectly separates positive from
+negative objects.
+It is trivial to think of settings where this might very well be
+impossible. To remedy this situation, objects can be assigned so called
+slack terms, that punish mistakes made during learning appropriately. That
+way, the algorithm is forced to find the hyperplane that causes the least
+number of mistakes.
+
+Another way to overcome the problem of there being no linear hyperplane to
+separate positive from negative objects is to simply project each feature
+vector into an higher dimensional feature space and search for a linear
+separating hyperplane in that new space. Usually the main problem with
+learning in high dimensional feature spaces is the so called curse of
+dimensionality. That is, there are fewer learning examples available than
+free parameters to tune. In the case of SVMs this problem is less
+detrimental, as SVMs impose additional structural constraints on their
+solutions. Each separating hyperplane needs to have a maximal margin to all
+training examples. In addition, that way, the solution may be based on the
+information encoded in only very few examples.
+
+<a name="SupportVectorMachines-Strategyforparallelization"></a>
+## Strategy for parallelization
+
+<a name="SupportVectorMachines-Designofpackages"></a>
+## Design of packages

Added: mahout/site/mahout_cms/content/users/classification/wikipedia-bayes-example.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/classification/wikipedia-bayes-example.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/classification/wikipedia-bayes-example.mdtext (added)
+++ mahout/site/mahout_cms/content/users/classification/wikipedia-bayes-example.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,40 @@
+Title: Wikipedia Bayes Example
+<a name="WikipediaBayesExample-Intro"></a>
+# Intro
+
+The Mahout Examples source comes with tools for classifying a Wikipedia
+data dump using either the Naive Bayes or Complementary Naive Bayes
+implementations in Mahout.  The example (described below) gets a Wikipedia
+dump and then splits it up into chunks.  These chunks are then further
+split by country.  From these splits, a classifier is trained to predict
+what country an unseen article should be categorized into.
+
+
+<a name="WikipediaBayesExample-Runningtheexample"></a>
+# Running the example
+
+1. download the wikipedia data set [here ](-http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.html)
+1. unzip the bz2 file to get the enwiki-latest-pages-articles.xml. 
+1. Create directory $MAHOUT_HOME/examples/temp and copy the xml file into
+this directory
+1. Chunk the Data into pieces: {code}$MAHOUT_HOME/bin/mahout
+wikipediaXMLSplitter -d
+$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
+wikipedia/chunks -c 64{code} {quote}*We strongly suggest you backup the
+results to some other place so that you don't have to do this step again in
+case it gets accidentally erased*{quote}
+1. This would have created the chunks in HDFS. Verify the same by executing
+{code}hadoop fs -ls wikipedia/chunks{code} and it'll list all the xml files
+as chunk-0001.xml and so on.
+1. Create the countries based Split of wikipedia dataset.
+{code}$MAHOUT_HOME/bin/mahout  wikipediaDataSetCreator	-i wikipedia/chunks
+-o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt
+
+    # Verify the creation of input data set by executing {code} hadoop fs -ls
+wikipediainput {code} and you'll be able to see part-r-00000 file inside
+wikipediainput directory
+    # Train the classifier: {code}$MAHOUT_HOME/bin/mahout trainclassifier -i
+wikipediainput -o wikipediamodel{code}. The model file will be available in
+the wikipediamodel folder in HDFS.
+    # Test the classifier: {code}$MAHOUT_HOME/bin/mahout testclassifier -m
+wikipediamodel -d wikipediainput{code}

Added: mahout/site/mahout_cms/content/users/clustering/20newsgroups.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/20newsgroups.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/20newsgroups.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/20newsgroups.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,5 @@
+Title: 20Newsgroups
+<a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a>
+# Naive Bayes using 20 Newsgroups Data
+
+See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)

Added: mahout/site/mahout_cms/content/users/clustering/canopy-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/canopy-clustering.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/canopy-clustering.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/canopy-clustering.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,181 @@
+Title: Canopy Clustering
+<a name="CanopyClustering-CanopyClustering"></a>
+# Canopy Clustering
+
+[Canopy Clustering](http://www.kamalnigam.com/papers/canopy-kdd00.pdf)
+ is a very simple, fast and surprisingly accurate method for grouping
+objects into clusters. All objects are represented as a point in a
+multidimensional feature space. The algorithm uses a fast approximate
+distance metric and two distance thresholds T1 > T2 for processing. The
+basic algorithm is to begin with a set of points and remove one at random.
+Create a Canopy containing this point and iterate through the remainder of
+the point set. At each point, if its distance from the first point is < T1,
+then add the point to the cluster. If, in addition, the distance is < T2,
+then remove the point from the set. This way points that are very close to
+the original will avoid all further processing. The algorithm loops until
+the initial set is empty, accumulating a set of Canopies, each containing
+one or more points. A given point may occur in more than one Canopy.
+
+Canopy Clustering is often used as an initial step in more rigorous
+clustering techniques, such as [K-Means Clustering](k-means-clustering.html)
+. By starting with an initial clustering the number of more expensive
+distance measurements can be significantly reduced by ignoring points
+outside of the initial canopies.
+
+<a name="CanopyClustering-Strategyforparallelization"></a>
+## Strategy for parallelization
+
+Looking at the sample Hadoop implementation in [http://code.google.com/p/canopy-clustering/](http://code.google.com/p/canopy-clustering/)
+ the processing is done in 3 M/R steps:
+1. The data is massaged into suitable input format
+1. Each mapper performs canopy clustering on the points in its input set and
+outputs its canopies' centers
+1. The reducer clusters the canopy centers to produce the final canopy
+centers
+1. The points are then clustered into these final canopies
+
+Some ideas can be found in [Cluster computing and MapReduce](http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html)
+ lecture video series \[by Google(r)\]; Canopy Clustering is discussed in [lecture #4|http://www.youtube.com/watch?v=1ZDybXl212Q]
+. Slides can be found [here|https://code.google.com/edu/submissions/mapreduce-minilecture/lec4-clustering.ppt]
+. Finally here is the [Wikipedia page|http://en.wikipedia.org/wiki/Canopy_clustering_algorithm]
+.
+
+<a name="CanopyClustering-Designofimplementation"></a>
+## Design of implementation
+
+The implementation accepts as input Hadoop SequenceFiles containing
+multidimensional points (VectorWritable). Points may be expressed either as
+dense or sparse Vectors and processing is done in two phases: Canopy
+generation and, optionally, Clustering.
+
+<a name="CanopyClustering-Canopygenerationphase"></a>
+### Canopy generation phase
+
+During the map step, each mapper processes a subset of the total points and
+applies the chosen distance measure and thresholds to generate canopies. In
+the mapper, each point which is found to be within an existing canopy will
+be added to an internal list of Canopies. After observing all its input
+vectors, the mapper updates all of its Canopies and normalizes their totals
+to produce canopy centroids which are output, using a constant key
+("centroid") to a single reducer. The reducer receives all of the initial
+centroids and again applies the canopy measure and thresholds to produce a
+final set of canopy centroids which is output (i.e. clustering the cluster
+centroids). The reducer output format is: SequenceFile(Text, Canopy) with
+the _key_ encoding the canopy identifier. 
+
+<a name="CanopyClustering-Clusteringphase"></a>
+### Clustering phase
+
+During the clustering phase, each mapper reads the Canopies produced by the
+first phase. Since all mappers have the same canopy definitions, their
+outputs will be combined during the shuffle so that each reducer (many are
+allowed here) will see all of the points assigned to one or more canopies.
+The output format will then be: SequenceFile(IntWritable,
+WeightedVectorWritable) with the _key_ encoding the canopyId. The
+WeightedVectorWritable has two fields: a double weight and a VectorWritable
+vector. Together they encode the probability that each vector is a member
+of the given canopy.
+
+<a name="CanopyClustering-RunningCanopyClustering"></a>
+## Running Canopy Clustering
+
+The canopy clustering algorithm may be run using a command-line invocation
+on CanopyDriver.main or by making a Java call to CanopyDriver.run(...).
+Both require several arguments:
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout canopy \
+        -i <input vectors directory> \
+        -o <output working directory> \
+        -dm <DistanceMeasure> \
+        -t1 <T1 threshold> \
+        -t2 <T2 threshold> \
+        -t3 <optional reducer T1 threshold> \
+        -t4 <optional reducer T2 threshold> \
+        -cf <optional cluster filter size (default: 0)> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Canopies>
+        -xm <execution method: sequential or mapreduce>
+
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. measure: the fully-qualified class name of an instance of DistanceMeasure
+which will be used for the clustering.
+1. t1: the T1 distance threshold used for clustering.
+1. t2: the T2 distance threshold used for clustering.
+1. t3: the optional T1 distance threshold used by the reducer for
+clustering. If not specified, T1 is used by the reducer.
+1. t4: the optional T2 distance threshold used by the reducer for
+clustering. If not specified, T2 is used by the reducer.
+1. clusterFilter: the minimum size for canopies to be output by the
+algorithm. Affects both sequential and mapreduce execution modes, and
+mapper and reducer outputs.
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. runSequential: a boolean indicating, if true, that the computation is to
+be run in memory using the reference Canopy implementation. Note: that the
+sequential implementation performs a single pass through the input vectors
+whereas the MapReduce implementation performs two passes (once in the
+mapper and again in the reducer). The MapReduce implementation will
+typically produce less clusters than the sequential implementation as a
+result.
+
+After running the algorithm, the output directory will contain:
+1. clusters-0: a directory containing SequenceFiles(Text, Canopy) produced
+by the algorithm. The Text _key_ contains the cluster identifier of the
+Canopy.
+1. clusteredPoints: (if runClustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the canopyId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weight indicates
+the probability that the vector is a member of the canopy. For canopy
+clustering, the weights are computed as 1/(1+distance) where the distance
+is between the cluster center and the vector using the chosen
+DistanceMeasure.
+
+<a name="CanopyClustering-Examples"></a>
+# Examples
+
+The following images illustrate Canopy clustering applied to a set of
+randomly-generated 2-d data points. The points are generated using a normal
+distribution centered at a mean location and with a constant standard
+deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](http://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=3.0
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=0.5
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed. 
+
+!SampleData.png!
+
+In the second image, the resulting canopies are shown superimposed upon the
+sample data. Each canopy is represented by two circles, with radius T1 and
+radius T2.
+
+!Canopy.png!
+
+The third image uses the same values of T1 and T2 but only superimposes
+canopies covering more than 10% of the population. This is a bit better
+representation of the data but it still has lots of room for improvement.
+The advantage of Canopy clustering is that it is single-pass and fast
+enough to iterate runs using different T1, T2 parameters and display
+thresholds.
+
+!Canopy10.png!
+

Added: mahout/site/mahout_cms/content/users/clustering/canopy-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/canopy-commandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/canopy-commandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/canopy-commandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,69 @@
+Title: canopy-commandline
+<a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a>
+# Running Canopy Clustering from the Command Line
+Mahout's Canopy clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Canopy on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout canopy <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+
+<a name="canopy-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="canopy-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input			     Path to job input directory.
+Must  
+    					     be a SequenceFile of	    
+    					     VectorWritable		    
+      --output (-o) output			     The directory pathname for
+output. 
+      --overwrite (-ow)			     If present, overwrite the
+output	 
+    					     directory before running job   
+      --distanceMeasure (-dm) distanceMeasure    The classname of the	    
+    					     DistanceMeasure. Default is    
+    					     SquaredEuclidean		    
+      --t1 (-t1) t1 			     T1 threshold value 	    
+      --t2 (-t2) t2 			     T2 threshold value 	    
+      --clustering (-cl)			     If present, run clustering
+after	
+    					     the iterations have taken
+place	 
+      --help (-h)				     Print out help		    
+

Added: mahout/site/mahout_cms/content/users/clustering/cluster-dumper.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/cluster-dumper.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/cluster-dumper.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/cluster-dumper.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,79 @@
+Title: Cluster Dumper
+<a name="ClusterDumper-Introduction"></a>
+# Introduction
+
+Clustering tasks in Mahout will output data in the format of a SequenceFile
+(Text, Cluster) and the Text is a cluster identifier string. To analyze
+this output we need to convert the sequence files to a human readable
+format and this is achieved using the clusterdump utility.
+
+<a name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a>
+# Steps for analyzing cluster output using clusterdump utility
+
+After you've executed a clustering tasks (either examples or real-world),
+you can run clusterdumper in 2 modes.
+1. [Hadoop Environment](-#hadoopenvironment.html)
+1. [Standalone Java Program ](-#standalonejavaprogram.html)
+
+<a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a>
+### Hadoop Environment {anchor:HadoopEnvironment}
+
+If you have setup your HADOOP_HOME environment variable, you can use the
+command line utility "mahout" to execute the ClusterDumper on Hadoop. In
+this case we wont need to get the output clusters to our local machines.
+The utility will read the output clusters present in HDFS and output the
+human-readable cluster values into our local file system. Say you've just
+executed the [synthetic control example ](clustering-of-synthetic-control-data.html)
+ and want to analyze the output, you can execute
+
+    
+    h3. Standalone Java Program {anchor:StandaloneJavaProgram}
+    
+    ClusterDumper can be run using CLI. If your HADOOP_HOME environment
+variable is not set, you can execute ClusterDumper using "mahout" command
+line utility.
+    # get the output data from hadoop into your local machine. For example, in
+the case where you've executed a clustering example use
+
+This will create a folder called output inside your $MAHOUT_HOME/examples
+and will have sub-folders for each cluster outputs and ClusteredPoints
+1. Run the clusterdump utility as follows
+
+    h5. Standalone Java Program through Eclipse
+    If you are using eclipse, setup mahout-utils as a project as specified in [Working with Maven in Eclipse|BuildingMahout#mahout_maven_eclipse]
+.
+    To execute ClusterDumper.java,
+    
+    * Under mahout-utils, Right-Click on ClusterDumper.java
+    * Choose Run-As, Run Configurations
+    * On the left menu, click on Java Application
+    * On the top-bar click on "New Launch Configuration"
+    * A new launch should be automatically created with project as
+"mahout-utils" and Main Class as
+"org.apache.mahout.utils.clustering.ClusterDumper"
+    * In the arguments tab, specify the below arguments
+    \--seqFileDir <MAHOUT_HOME>/examples/output/clusters-10 \--pointsDir
+<MAHOUT_HOME>/examples/output/clusteredPoints \--output
+<MAHOUT_HOME>/examples/output/clusteranalyze.txt
+    replace <MAHOUT_HOME> with the actual path of your $MAHOUT_HOME
+    * Hit run to execute the ClusterDumper using Eclipse.
+    Setting breakpoints etc should just work fine.
+    
+    h3. Reading the output file
+    
+    This will output the clusters into a file called clusteranalyze.txt inside
+$MAHOUT_HOME/examples/output
+    Sample data will look like
+
+CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, 29.937, 29.751, 30.054, 30.039, 30.126, 29.764, 29.835, 30.503, 29.876, 29.990, 29.605, 29.379, 30.120, 29.882, 30.161, 29.825, 30.074, 30.001, 30.421, 29.867, 29.736, 29.760, 30.192, 30.134, 30.082, 29.962, 29.512, 29.736, 29.594, 29.493, 29.761, 29.183, 29.517, 29.273, 29.161, 29.215, 29.731, 29.154, 29.113, 29.348, 28.981, 29.543, 29.192, 29.479, 29.406, 29.715, 29.344, 29.628, 29.074, 29.347, 29.812, 29.058, 29.177, 29.063, 29.607](29.922,-30.407,-30.373,-30.094,-29.886,-29.937,-29.751,-30.054,-30.039,-30.126,-29.764,-29.835,-30.503,-29.876,-29.990,-29.605,-29.379,-30.120,-29.882,-30.161,-29.825,-30.074,-30.001,-30.421,-29.867,-29.736,-29.760,-30.192,-30.134,-30.082,-29.962,-29.512,-29.736,-29.594,-29.493,-29.761,-29.183,-29.517,-29.273,-29.161,-29.215,-29.731,-29.154,-29.113,-29.348,-28.981,-29.543,-29.192,-29.479,-29.406,-29.715,-29.344,-29.628,-29.074,-29.347,-29.812,-29.058,-29.177,-29.063,-29.607.html)
+ r=[3.463, 3.351, 3.452, 3.438, 3.371, 3.569, 3.253, 3.531, 3.439, 3.472,
+3.402, 3.459, 3.320, 3.260, 3.430, 3.452, 3.320, 3.499, 3.302, 3.511,
+3.520, 3.447, 3.516, 3.485, 3.345, 3.178, 3.492, 3.434, 3.619, 3.483,
+3.651, 3.833, 3.812, 3.433, 4.133, 3.855, 4.123, 3.999, 4.467, 4.731,
+4.539, 4.956, 4.644, 4.382, 4.277, 4.918, 4.784, 4.582, 4.915, 4.607,
+4.672, 4.577, 5.035, 5.241, 4.731, 4.688, 4.685, 4.657, 4.912, 4.300] }
+
+    and on...
+    where CL-0 is the Cluster 0 and n=116 refers to the number of points observed by this cluster and c = \[29.922 ...\]
+ refers to the center of Cluster as a vector and r = \[3.463 ..\] refers to
+the radius of the cluster as a vector.

Added: mahout/site/mahout_cms/content/users/clustering/clustering-of-synthetic-control-data.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/clustering-of-synthetic-control-data.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/clustering-of-synthetic-control-data.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/clustering-of-synthetic-control-data.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,118 @@
+Title: Clustering of synthetic control data
+* [Introduction](#Clusteringofsyntheticcontroldata-Introduction)
+* [Problem description](#Clusteringofsyntheticcontroldata-Problemdescription)
+* [Pre-Prep](#Clusteringofsyntheticcontroldata-Pre-Prep)
+* [Perform Clustering](#Clusteringofsyntheticcontroldata-PerformClustering)
+* [Read / Analyze Output](#Clusteringofsyntheticcontroldata-Read/AnalyzeOutput)
+
+<a name="Clusteringofsyntheticcontroldata-Introduction"></a>
+# Introduction
+
+The example will demonstrate clustering of control charts which exhibits a
+time series. [Control charts ](http://en.wikipedia.org/wiki/Control_chart)
+ are tools used to determine whether or not a manufacturing or business
+process is in a state of statistical control. Such control charts are
+generated / simulated over equal time interval and available for use in UCI
+machine learning database. The data is described [here |http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html]
+.
+
+<a name="Clusteringofsyntheticcontroldata-Problemdescription"></a>
+# Problem description
+
+A time series of control charts needs to be clustered into their close knit
+groups. The data set we use is synthetic and so resembles real world
+information in an anonymized format. It contains six different classes
+(Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward
+shift). With these trends occurring on the input data set, the Mahout
+clustering algorithm will cluster the data into their corresponding class
+buckets. At the end of this example, you'll get to learn how to perform
+clustering using Mahout.
+
+<a name="Clusteringofsyntheticcontroldata-Pre-Prep"></a>
+# Pre-Prep
+
+Make sure you have the following covered before you work out the example.
+1. Input data set. Download it [here ](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data)
+.
+1. # Sample input data:
+Input consists of 600 rows and 60 columns. The rows from  1 - 100 contains
+Normal data. Rows from 101 - 200 contains cyclic data and so on.. More info [here ](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html)
+. Sample of how the data looks is like below.
+<table>
+<tr><th> \_time </th><th> \_time+x </th><th> \_time+2x </th><th> .. </th><th> \_time+60x </th></tr>
+<tr><td> 28.7812 </td><td> 34.4632 </td><td> 31.3381 </td><td> .. </td><td> 31.2834 </td></tr>
+<tr><td> 24.8923 </td><td> 25.741 </td><td> 27.5532 </td><td> .. </td><td> 32.8217 </td></tr>
+..
+..
+<tr><td> 35.5351 </td><td> 41.7067 </td><td> 39.1705 </td><td> 48.3964 </td><td> .. </td><td> 38.6103 </td></tr>
+<tr><td> 24.2104 </td><td> 41.7679 </td><td> 45.2228 </td><td> 43.7762 </td><td> .. </td><td> 48.8175 </td></tr>
+..
+..
+1. Setup Hadoop
+1. # Assuming that you have installed the latest compatible Hadooop, start
+the daemons using {code}$HADOOP_HOME/bin/start-all.sh {code} If you have
+issues starting Hadoop, please reference the [Hadoop quick start guide](http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html)
+1. # Copy the input to HDFS using 
+
+    $HADOOP_HOME/bin/hadoop fs -mkdir testdata
+    $HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data> testdata
+
+(HDFS input directory name should be testdata)
+1. Mahout Example job
+Mahout's mahout-examples-$MAHOUT_VERSION.job does the actual clustering
+task and so it needs to be created. This can be done as
+1. # cd $MAHOUT_HOME
+1. # 
+
+    mvn clean install		   // full build including all unit tests
+    mvn clean install -DskipTests=true // fast build without running unit tests
+
+You will see BUILD SUCCESSFUL once all the corresponding tasks are through.
+The job will be generated in $MAHOUT_HOME/examples/target/ and it's name
+will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.4
+release, the job will be mahout-examples-0.4.job.jar
+This completes the pre-requisites to perform clustering process using
+Mahout.
+
+<a name="Clusteringofsyntheticcontroldata-PerformClustering"></a>
+# Perform Clustering
+
+With all the pre-work done, clustering the control data gets real simple.
+1. Depending on which clustering technique to use, you can invoke the
+corresponding job as below
+1. # For [canopy ](canopy-clustering.html)
+:
+
+    ## For [kmeans |K-Means Clustering]
+:
+
+1. # For [fuzzykmeans ](fuzzy-k-means.html)
+:
+
+    ## For [dirichlet |Dirichlet Process Clustering]
+:
+
+1. # For [meanshift ](mean-shift-clustering.html)
+: {code}  $MAHOUT_HOME/bin/mahout
+org.apache.mahout.clustering.syntheticcontrol.meanshift.Job {code}
+1. Get the data out of HDFS{footnote}See [HDFS Shell ](-http://hadoop.apache.org/core/docs/current/hdfs_shell.html.html)
+{footnote}{footnote}The output directory is cleared when a new run starts
+so the results must be retrieved before a new run{footnote} and have a
+look{footnote}All jobs run ClusterDump after clustering with output data
+sent to the console{footnote} by following the below steps:
+
+<a name="Clusteringofsyntheticcontroldata-Read/AnalyzeOutput"></a>
+# Read / Analyze Output
+In order to read/analyze the output, you can use [clusterdump](cluster-dumper.html)
+ utility provided by Mahout. If you want to just read the output, follow
+the below steps. 
+1. Use {code}$HADOOP_HOME/bin/hadoop fs -lsr output {code}to view all
+outputs.
+1. Use {code}$HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
+{code} to copy them all to your local machine and the output data points
+are in vector format. This creates an output folder inside examples
+directory.
+1. Computed clusters are contained in _output/clusters-i_
+1. All result clustered points are placed into _output/clusteredPoints_
+
+{display-footnotes}

Added: mahout/site/mahout_cms/content/users/clustering/clustering-seinfeld-episodes.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/clustering-seinfeld-episodes.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/clustering-seinfeld-episodes.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/clustering-seinfeld-episodes.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,5 @@
+Title: Clustering Seinfeld Episodes
+Below is short tutorial on how to cluster Seinfeld episode transcripts with
+Mahout.
+
+http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/

Added: mahout/site/mahout_cms/content/users/clustering/clusteringyourdata.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/clusteringyourdata.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/clusteringyourdata.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/clusteringyourdata.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,156 @@
+Title: ClusteringYourData
++*Mahout_0.8*+
+
+After you've done the [Quickstart](quickstart.html)
+ and are familiar with the basics of Mahout, it is time to cluster your own
+data. 
+
+The following pieces *may* be useful for in getting started:
+
+<a name="ClusteringYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format
+(which has changed since Mahout 0.1)
+
+* See [Creating Vectors](creating-vectors.html)
+
+<a name="ClusteringYourData-TextPreparation"></a>
+## Text Preparation
+
+* See [Creating Vectors from Text](creating-vectors-from-text.html)
+*
+http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
+
+<a name="ClusteringYourData-RunningtheProcess"></a>
+# Running the Process
+
+<a name="ClusteringYourData-Canopy"></a>
+## Canopy
+
+Background: [canopy ](-canopy-clustering.html)
+
+Documentation of running canopy from the command line: [canopy-commandline](canopy-commandline.html)
+
+<a name="ClusteringYourData-kMeans"></a>
+## kMeans
+
+Background: [K-Means Clustering](k-means-clustering.html)
+
+Documentation of running kMeans from the command line: [k-means-commandline](k-means-commandline.html)
+
+Documentation of running fuzzy kMeans from the command line: [fuzzy-k-means-commandline](fuzzy-k-means-commandline.html)
+
+<a name="ClusteringYourData-Dirichlet"></a>
+## Dirichlet
+
+Background: [dirichlet ](-dirichlet-process-clustering.html)
+
+Documentation of running dirichlet from the command line: [dirichlet-commandline](dirichlet-commandline.html)
+
+<a name="ClusteringYourData-Mean-shift"></a>
+## Mean-shift
+
+Background:  [meanshift ](-mean-shift-clustering.html)
+
+Documentation of running mean shift from the command line: [mean-shift-commandline](mean-shift-commandline.html)
+
+<a name="ClusteringYourData-LatentDirichletAllocation"></a>
+## Latent Dirichlet Allocation
+
+Background and documentation: [LDA](-latent-dirichlet-allocation.html)
+
+Documentation of running LDA from the command line: [lda-commandline](lda-commandline.html)
+
+<a name="ClusteringYourData-RetrievingtheOutput"></a>
+# Retrieving the Output
+
+Mahout has a cluster dumper utility that can be used to retrieve and
+evaluate your clustering data.
+
+    ./bin/mahout clusterdump <OPTIONS>
+
+
+<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
+## The cluster dumper options are:
+
+      --help (-h)				   Print out help		    
+      --input (-i) input			   The directory containing
+Sequence    
+    					   Files for the Clusters	    
+      --output (-o) output			   The output file.  If not
+specified,  
+    					   dumps to the console.
+      --outputFormat (-of) outputFormat	   The optional output format to
+write
+    					   the results as. Options: TEXT,
+CSV, or GRAPH_ML		 
+      --substring (-b) substring		   The number of chars of the	    
+    					   asFormatString() to print	    
+      --pointsDir (-p) pointsDir		   The directory containing points  
+    					   sequence files mapping input
+vectors 
+    					   to their cluster.  If specified, 
+    					   then the program will output the 
+    					   points associated with a cluster 
+      --dictionary (-d) dictionary		   The dictionary file. 	    
+      --dictionaryType (-dt) dictionaryType    The dictionary file type	    
+    					   (text|sequencefile)
+      --distanceMeasure (-dm) distanceMeasure  The classname of the
+DistanceMeasure.
+    					   Default is SquaredEuclidean.     
+      --numWords (-n) numWords		   The number of top terms to print 
+      --tempDir tempDir			   Intermediate output directory
+      --startPhase startPhase		   First phase to run
+      --endPhase endPhase			   Last phase to run
+      --evaluate (-e)			   Run ClusterEvaluator and
+CDbwEvaluator over the
+    					   input. The output will be
+appended to the rest of
+    					   the output at the end.   
+
+
+More information on using clusterdump utility can be found [here](cluster-dumper.html)
+
+<a name="ClusteringYourData-ValidatingtheOutput"></a>
+# Validating the Output
+
+From Ted Dunning's response on See
+http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
+{quote}
+A principled approach to cluster evaluation is to measure how well the
+cluster membership captures the structure of unseen data.  A natural
+measure for this is to measure how much of the entropy of the data is
+captured by cluster membership.  For k-means and its natural L_2 metric,
+the natural cluster quality metric is the squared distance from the nearest
+centroid adjusted by the log_2 of the number of clusters.  This can be
+compared to the squared magnitude of the original data or the squared
+deviation from the centroid for all of the data.  The idea is that you are
+changing the representation of the data by allocating some of the bits in
+your original representation to represent which cluster each point is in. 
+If those bits aren't made up by the residue being small then your
+clustering is making a bad trade-off.
+
+In the past, I have used other more heuristic measures as well.  One of the
+key characteristics that I would like to see out of a clustering is a
+degree of stability.  Thus, I look at the fractions of points that are
+assigned to each cluster or the distribution of distances from the cluster
+centroid. These values should be relatively stable when applied to held-out
+data.
+
+For text, you can actually compute perplexity which measures how well
+cluster membership predicts what words are used.  This is nice because you
+don't have to worry about the entropy of real valued numbers.
+
+Manual inspection and the so-called laugh test is also important.  The idea
+is that the results should not be so ludicrous as to make you laugh.
+Unfortunately, it is pretty easy to kid yourself into thinking your system
+is working using this kind of inspection.  The problem is that we are too
+good at seeing (making up) patterns.
+{quote}
+
+
+<a name="ClusteringYourData-References"></a>
+# References
+
+* [Mahout archive references](http://www.lucidimagination.com/search/p:mahout?q=clustering)

Added: mahout/site/mahout_cms/content/users/clustering/dirichlet-commandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/content/users/clustering/dirichlet-commandline.mdtext?rev=1538467&view=auto
==============================================================================
--- mahout/site/mahout_cms/content/users/clustering/dirichlet-commandline.mdtext (added)
+++ mahout/site/mahout_cms/content/users/clustering/dirichlet-commandline.mdtext Sun Nov  3 21:36:23 2013
@@ -0,0 +1,91 @@
+Title: dirichlet-commandline
+<a name="dirichlet-commandline-RunningDirichletProcessClusteringfromtheCommandLine"></a>
+# Running Dirichlet Process Clustering from the Command Line
+Mahout's Dirichlet clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Dirichlet on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout dirichlet <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="dirichlet-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
+
+
+<a name="dirichlet-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout dirichlet -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="dirichlet-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input				Path to job input
+directory.    
+    						Must be a SequenceFile of   
+    						VectorWritable		    
+      --output (-o) output				The directory pathname for  
+    						output. 		    
+      --overwrite (-ow)				If present, overwrite the   
+    						output directory before
+running 
+    						job			    
+      --modelDistClass (-md) modelDistClass 	The ModelDistribution class 
+    						name. Defaults to	    
+    						NormalModelDistribution     
+      --modelPrototypeClass (-mp) prototypeClass	The ModelDistribution
+prototype 
+    						Vector class name. Defaults
+to  
+    						RandomAccessSparseVector    
+      --maxIter (-x) maxIter			The maximum number of	    
+    						iterations.		    
+      --alpha (-m) alpha				The alpha0 value for the    
+    						DirichletDistribution.
+Defaults 
+    						to 1.0			    
+      --k (-k) k					The number of clusters to   
+    						create			    
+      --help (-h)					Print out help		    
+      --maxRed (-r) maxRed				The number of reduce tasks. 
+    						Defaults to 2		    
+      --clustering (-cl)				If present, run clustering  
+    						after the iterations have
+taken 
+    						place			    
+      --emitMostLikely (-e) emitMostLikely		True if clustering should
+emit  
+    						the most likely point only, 
+    						false for threshold
+clustering. 
+    						Default is true 	    
+      --threshold (-t) threshold			The pdf threshold used for  
+    						cluster determination.
+Default  
+    						is 0			    
+