You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by sr...@apache.org on 2012/07/12 11:26:03 UTC

svn commit: r1360593 [17/17] - in /mahout/site/trunk: ./ cgi-bin/ content/ content/attachments/ content/attachments/101992/ content/attachments/116559/ content/attachments/22872433/ content/attachments/22872443/ content/attachments/23335706/ content/at...

Added: mahout/site/trunk/content/support-vector-machines.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/support-vector-machines.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/support-vector-machines.mdtext (added)
+++ mahout/site/trunk/content/support-vector-machines.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,37 @@
+Title: Support Vector Machines
+<a name="SupportVectorMachines-SupportVectorMachines"></a>
+# Support Vector Machines
+
+As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used
+to solve the task of assigning objects to classes. However, the way this
+task is solved is completely different to the setting in Naive Bayes.
+
+Each object is considered to be a point in _n_ dimensional feature space,
+_n_ being the number of features used to describe the objects numerically.
+In addition each object is assigned a binary label, let us assume the
+labels are "positive" and "negative". During learning, the algorithm tries
+to find a hyperplane in that space, that perfectly separates positive from
+negative objects.
+It is trivial to think of settings where this might very well be
+impossible. To remedy this situation, objects can be assigned so called
+slack terms, that punish mistakes made during learning appropriately. That
+way, the algorithm is forced to find the hyperplane that causes the least
+number of mistakes.
+
+Another way to overcome the problem of there being no linear hyperplane to
+separate positive from negative objects is to simply project each feature
+vector into an higher dimensional feature space and search for a linear
+separating hyperplane in that new space. Usually the main problem with
+learning in high dimensional feature spaces is the so called curse of
+dimensionality. That is, there are fewer learning examples available than
+free parameters to tune. In the case of SVMs this problem is less
+detrimental, as SVMs impose additional structural constraints on their
+solutions. Each separating hyperplane needs to have a maximal margin to all
+training examples. In addition, that way, the solution may be based on the
+information encoded in only very few examples.
+
+<a name="SupportVectorMachines-Strategyforparallelization"></a>
+## Strategy for parallelization
+
+<a name="SupportVectorMachines-Designofpackages"></a>
+## Design of packages

Added: mahout/site/trunk/content/svd---singular-value-decomposition.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/svd---singular-value-decomposition.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/svd---singular-value-decomposition.mdtext (added)
+++ mahout/site/trunk/content/svd---singular-value-decomposition.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,46 @@
+Title: SVD - Singular Value Decomposition
+{excerpt}Singular Value Decomposition is a form of product decomposition of
+a matrix in which a rectangular matrix A is decomposed into a product U s
+V' where U and V are orthonormal and s is a diagonal matrix.{excerpt}  The
+values of A can be real or complex, but the real case dominates
+applications in machine learning.  The most prominent properties of the SVD
+are:
+
+  * The decomposition of any real matrix has only real values
+  * The SVD is unique except for column permutations of U, s and V
+  * If you take only the largest n values of s and set the rest to zero,
+you have a least squares approximation of A with rank n.  This allows SVD
+to be used very effectively in least squares regression and makes partial
+SVD useful.
+  * The SVD can be computed accurately for singular or nearly singular
+matrices.  For a matrix of rank n, only the first n singular values will be
+non-zero.  This allows SVD to be used for solution of singular linear
+systems.  The columns of U and V corresponding to zero singular values
+define the null space of A.
+  * The partial SVD of very large matrices can be computed very quickly
+using stochastic decompositions.  See http://arxiv.org/abs/0909.4061v1 for
+details.  Gradient descent can also be used to compute partial SVD's and is
+very useful where some values of the matrix being decomposed are not known.
+
+In collaborative filtering and text retrieval, it is common to compute the
+partial decomposition of the user x item interaction matrix or the document
+x term matrix.	This allows the projection of users and items (or documents
+and terms) into a common vector space representation that is often referred
+to as the latent semantic representation.  This process is sometimes called
+Latent Semantic Analysis and has been very effective in the analysis of the
+Netflix dataset.
+
+Dimension Reduction in Mahout:
+ * dimensional-reduction.html
+
+ See Also:
+ * http://www.kwon3d.com/theory/jkinem/svd.html
+ * http://en.wikipedia.org/wiki/Singular_value_decomposition
+ * http://en.wikipedia.org/wiki/Latent_semantic_analysis
+ * http://en.wikipedia.org/wiki/Netflix_Prize
+ *
+http://www.amazon.com/Understanding-Complex-Datasets-Decompositions-Knowledge/dp/1584888326
+ * http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
+ *
+http://www.quora.com/What-s-the-best-parallelized-sparse-SVD-code-publicly-available
+ * [understanding Mahout Hadoop SVD thread](http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTinQ5K4XrM7naBWn8qoBXZGVobBot2RtjZSV4yOd@mail.gmail.com%3E)

Added: mahout/site/trunk/content/system-requirements.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/system-requirements.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/system-requirements.mdtext (added)
+++ mahout/site/trunk/content/system-requirements.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,11 @@
+Title: System Requirements
+* Java 1.6.x or greater.
+* Maven 2.x to build the source code.
+
+CPU, Disk and Memory requirements are based on the many choices made in
+implementing your application with Mahout (document size, number of
+documents, and number of hits retrieved to name a few.)
+
+Several of the Mahout algorithms are implemented to work on Hadoop
+clusters. If not advertised differently, those implementations work with
+Hadoop 0.20.0 or greater.

Added: mahout/site/trunk/content/tastecommandline.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/tastecommandline.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/tastecommandline.mdtext (added)
+++ mahout/site/trunk/content/tastecommandline.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,41 @@
+Title: TasteCommandLine
+<a name="TasteCommandLine-Introduction"></a>
+# Introduction 
+
+This quick start page describes how to run the hadoop based recommendation
+jobs of Mahout Taste on a Hadoop cluster. 
+
+<a name="TasteCommandLine-Steps"></a>
+# Steps 
+
+<a name="TasteCommandLine-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster 
+
+In the examples directory type, for example: 
+
+    mvn -q exec:java
+-Dexec.mainClass="org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob"
+-Dexec.args="<OPTIONS>" 
+
+
+<a name="TasteCommandLine-Runningitonthecluster"></a>
+## Running it on the cluster 
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.jar 
+* (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh 
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata 
+* Run the Job: $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/core/target/mahout-core-<MAHOUT VERSION>.job
+org.apache.mahout.cf.taste.hadoop.<JOB> <OPTIONS> 
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs. 
+
+<a name="TasteCommandLine-Commandlineoptions"></a>
+# Command line options 
+
+Specify only the command line option "--help" for a complete summary of
+available command line options. Or, refer to the javadoc for the "Job"
+class being run.

Added: mahout/site/trunk/content/testing.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/testing.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/testing.mdtext (added)
+++ mahout/site/trunk/content/testing.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,41 @@
+Title: Testing
+<a name="Testing-Intro"></a>
+# Intro
+
+As Mahout matures, solid testing procedures are needed.  This page and its
+children capture test plans along with ideas for improving our testing.
+
+<a name="Testing-TestPlans"></a>
+# Test Plans
+
+* [0.6](0.6.html)
+ - Test Plans for the 0.6 release
+There are no special plans except for unit tests, and user testing of the
+Hadoop jobs.
+
+<a name="Testing-TestIdeas"></a>
+# Test Ideas
+
+<a name="Testing-Regressions/Benchmarks/Integrations"></a>
+## Regressions/Benchmarks/Integrations
+* Algorithmic quality and speed are not tested, except in a few instances.
+Such tests often require much longer run times (minutes to hours), a
+running Hadoop cluster, and downloads of large datasets (in the megabytes). 
+* Standardized speed tests are difficult on different hardware. 
+* Unit tests of external integrations require access to externals: HDFS,
+S3, JDBC, Cassandra, etc. 
+
+Apache Jenkins is not able to support these environments. Commercial
+donations would help. 
+
+<a name="Testing-UnitTests"></a>
+## Unit Tests
+Mahout's current tests are almost entirely unit tests. Algorithm tests
+generally supply a few numbers to code paths and verify that expected
+numbers come out. 'mvn test' runs these tests. There is "positive" coverage
+of a great many utilities and algorithms. A much smaller percent include
+"negative" coverage (bogus setups, inputs, combinations).
+
+<a name="Testing-Other"></a>
+## Other
+

Added: mahout/site/trunk/content/tf-idf---term-frequency-inverse-document-frequency.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/tf-idf---term-frequency-inverse-document-frequency.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/tf-idf---term-frequency-inverse-document-frequency.mdtext (added)
+++ mahout/site/trunk/content/tf-idf---term-frequency-inverse-document-frequency.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,15 @@
+Title: TF-IDF - Term Frequency-Inverse Document Frequency
+{excerpt}Is a weight measure often used in information retrieval and text
+mining. This weight is a statistical measure used to evaluate how important
+a word is to a document in a collection or corpus. The importance increases
+proportionally to the number of times a word appears in the document but is
+offset by the frequency of the word in the corpus.{excerpt} In other words
+if a term/word appears lots in a document but also appears lots in the
+corpus/collection as a whole it will get a lower score. An example of this
+would be "the", "and", "it" but depending on your source material it maybe
+other words that are very common to the source matter.
+
+
+ See Also:
+ * http://en.wikipedia.org/wiki/Tf%E2%80%93idf
+ * http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html

Added: mahout/site/trunk/content/thirdparty-dependencies.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/thirdparty-dependencies.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/thirdparty-dependencies.mdtext (added)
+++ mahout/site/trunk/content/thirdparty-dependencies.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,23 @@
+Title: Thirdparty Dependencies
+If you have a dependency on a third party artifact that is not in Maven,
+you should:
+
+1. Ask the project to add it if at all possible.  Most open source projects
+want wider adoption, so this kind of request is often well received.
+1. If they won't add it, we may be able to add it to our Maven repo,
+assuming it can be published at the ASF at all (no GPL code, for instance).
+ Please ask on the mailing list first.
+1. Assuming it can be, then you need to sign and deploy the artifacts, as
+described below:
+  mvn gpg:sign-and-deploy-file	     
+-Durl=https://repository.apache.org/service/local/staging/deploy/maven2 
+-DrepositoryId=apache.releases.https -DgroupId=org.apache.mahout.foobar
+-DartifactId=foobar -Dversion=x.y -Dpackaging=jar -Dfile=foobar-x.y.jar
+1. Once it is deployed, go into http://repository.apache.org/ (use your SVN
+credentials to login in)
+1. Select Staging
+1. Find your repository artifacts
+1. Close them (this makes them publicly available, since you are closing the
+staging repo)
+1. Promote them. This adds them to the public Maven repo.
+

Added: mahout/site/trunk/content/top-down-clustering.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/top-down-clustering.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/top-down-clustering.mdtext (added)
+++ mahout/site/trunk/content/top-down-clustering.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,112 @@
+Title: Top Down Clustering
+<a name="TopDownClustering-TopDownClustering"></a>
+## Top Down Clustering
+
+Top Down clustering is a type of Hierarchical Clustering. It tries to find
+bigger clusters first and then does fine grained clustering on these
+clusters. Hence the name Top Down.
+
+Any clustering algorithm can be used to perform the Top Level Clustering (
+finding bigger clusters ) and the Bottom Level Clustering ( fine grained
+clustering on each of the top level clusters). So, all clustering
+algorithms available in Mahout, other than the MinHash Clustering algorithm
+( which is a "Bottom Up" Clustering Algorithm ), are suitable to be used
+for Top Down Clustering, on both Top Level and Bottom Level.
+
+The top level clustering output needs to be post processed in order to
+identify all top level clusters and, to group vectors into their respective
+top level clusters. So, that, the bottom level clustering can execute on
+each of them.
+
+The first step to execute the top down clustering, would be to run any
+clustering algorithm of your choice, preferably with clustering parameters
+which will produce bigger clusters. This would be the top level clustering.
+
+Then, the output of this clustering should be post processed, to group them
+into respective top level clusters. This can be done using
+*ClusterOutputPostProcessorDriver.*
+
+<a name="TopDownClustering-Designofimplementation"></a>
+## Design of implementation
+
+When any clustering algorithm runs, the output path stores data in two
+directories
+
+*clusteredPoints*
+
+*clusters-0-final*
+
+The clusteredPoints directory contains&nbsp;information in the form of
+_(clusterId,_ *{_}vector)_{*}*.*
+
+The clusters-*-final directory will hold the cluster centroids.
+
+Now, to further run clustering on the clusters found, the vectors belonging
+to different clusters needs to be stored in separate directories. This can
+be done using the *ClusterOutputPostProcessorDriver* as explained in the
+*{_}Usage{_}* section*.\*
+
+*ClusterOutputPostProcessorDriver* will need this output path as the input,
+and it will segregate it into separate clusters.
+
+So, after post processing, if you will check the output path provided to
+the *ClusterOutputPostProcessorDriver,* you will find directories with
+names as clusterId, i.e. 0,1,2,...,20,21,22,23,24.25....
+
+All these directories will store files containing the vectors for that
+particular cluster. Now, all of these directories can be provided as input
+to the bottom level clustering algorithm one by one. The bottom level
+clustering algorithm can then, cluster all the top level clusters as per
+the algorithm used.
+
+<a name="TopDownClustering-Running"></a>
+## Running
+
+<a name="TopDownClustering-JavaAPI"></a>
+### Java API
+
+*ClusterOutputPostProcessorDriver* has a run method
+
+*run(Path input, Path output, boolean runSequential)*
+
+The input parameter provided to run method is, _"the output path provided
+to the clustering algorithm"_, which would be post processed. It is the
+path of the directory containing clusters-*-final and clusteredPoints.
+
+The output parameter provided to run method is _"the path where the post
+processed data would be stored"_.
+
+The runSequential parameter provided to run method is _"If set to true,
+post processes it sequentially, else, uses, MapReduce to do it"_. Hint : If
+the clustering was done sequentially, make it sequential, else vice versa.
+
+<a name="TopDownClustering-CommandLineInvocation"></a>
+### Command Line Invocation
+
+The following script illustrates the use of the CLI interface to run a
+top-down clustering example based upon the Synthetic Control dataset. This
+example uses K-means to cluster the data set. Other algorithms can also be
+used as described above. To use this script, download the dataset into a
+<testdata> directory or run the examples/bin/cluster-syntheticcontrol.sh.
+
+
+    unset HADOOP_HOME
+    unset HADOOP_CONF_DIR
+    rm -rf data top bottom
+    ./bin/mahout org.apache.mahout.clustering.conversion.InputDriver -i
+<testdata> -o data
+    ./bin/mahout kmeans -i data -o top -c top/clusters-0 -k 5 -xm sequential
+-ow -cl -x 5
+    ./bin/mahout clusterdump -s top/clusters-*-final/
+    mkdir bottom
+    ./bin/mahout clusterpp -i top -o bottom/data -xm sequential
+    for x in `ls bottom/data $1`; do
+    echo
+    ./bin/mahout kmeans -i bottom/data/$x -o bottom/$x -c bottom/$x/clusters-0
+-k 5 -xm sequential -ow -cl -x 5;
+    ./bin/mahout clusterdump -s bottom/$x/clusters-*-final/;
+    done
+
+
+
+![Top Down Clustering](attachments/27832740/28016887.jpg)

Added: mahout/site/trunk/content/traveling-salesman.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/traveling-salesman.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/traveling-salesman.mdtext (added)
+++ mahout/site/trunk/content/traveling-salesman.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,24 @@
+Title: Traveling Salesman
+<a name="TravelingSalesman-Intro"></a>
+# Intro
+
+The Traveling Salesman Problem (TSP) is a classic computer science question
+classified as NP-Hard.	See
+http://en.wikipedia.org/wiki/Travelling_salesman_problem for background
+information.
+
+<a name="TravelingSalesman-EvolutionaryExample"></a>
+# Evolutionary Example
+
+As an example of evolutionary programming, Mahout has an example
+implementation that attempts to solve TSP.  It should be noted that the
+implementation of evolutionary programming in Mahout has not been
+maintained for some time and may be removed in the future due to lack of
+interest.
+
+To run the example, do:
+
+1. cd <MAHOUT_HOME>
+1. mvn install
+1. Run the Job: <pre><code>./bin/mahout
+org.apache.mahout.ga.watchmaker.travellingsalesman.TravellingSalesman</code></pre>

Added: mahout/site/trunk/content/twenty-newsgroups.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/twenty-newsgroups.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/twenty-newsgroups.mdtext (added)
+++ mahout/site/trunk/content/twenty-newsgroups.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,135 @@
+Title: Twenty Newsgroups
+<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a>
+## Twenty Newsgroups Classification Example
+
+<a name="TwentyNewsgroups-Introduction"></a>
+## Introduction
+
+The 20 Newsgroups data set is a collection of approximately 20,000
+newsgroup documents, partitioned (nearly) evenly across 20 different
+newsgroups. The 20 newsgroups collection has become a popular data set for
+experiments in text applications of machine learning techniques, such as
+text classification and text clustering. We will use Mahout Bayes
+Classifier to create a model that would classify a new document into one of
+the 20 newsgroup.
+
+<a name="TwentyNewsgroups-Prerequisites"></a>
+## Prerequisites
+
+* Mahout has been downloaded ([instructions here](http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup)
+)
+* Maven is available
+* Your environment has the following variables:
+<table>
+<tr><td> *HADOOP_HOME* </td><td> Environment variables refers to where Hadoop lives </td></tr>
+<tr><td> *MAHOUT_HOME* </td><td> Environment variables refers to where Mahout lives </td></tr>
+</table>
+
+<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
+## Instructions for running the example
+
+1. Start the hadoop daemons by executing the following commands
+
+    $ cd $HADOOP_HOME/bin
+    $ ./start-all.sh
+
+1. In the trunk directory of mahout, compile everything and create the
+mahout job:
+
+    $ cd $MAHOUT_HOME
+    $ mvn install
+
+1. Run the 20 newsgroup example by executing the script as below
+
+    $ ./examples/bin/build-20news-bayes.sh
+
+After MAHOUT-857 is committed (available when 0.6 is released), the command
+will be:
+
+    $ ./examples/bin/classify-20newsgroups.sh
+
+This later version allows you to also try out running Stochastic Gradient
+Descent (SGD) on the same data.
+
+The script performs the following
+1. # Downloads *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
+1. # Extracts dataset
+1. # Generates input dataset for training classifier
+1. # Generates input dataset for testing classifier
+1. # Trains the classifier
+1. # Tests the classifier
+
+Output might look like:
+
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+    a   b	c   d	e   f	g   h	i   j	k   l	m   n	o   p	q   r	s  
+t   u	<--Classified as
+    381 0	0   0	0   9	1   0	0   0	1   0	0   2	0   1	0   0	3  
+0   0	 |  398  a     = rec.motorcycles
+    1   284 0   0	0   0	1   0	6   3	11  0	66  3	0   1	6   0	4  
+9   0	 |  395  b     = comp.windows.x
+    2   0	339 2	0   3	5   1	0   0	0   0	1   1	12  1	7   0	2  
+0   0	 |  376  c     = talk.politics.mideast
+    4   0	1   327 0   2	2   0	0   2	1   1	0   5	1   4	12  0	2  
+0   0	 |  364  d     = talk.politics.guns
+    7   0	4   32	27  7	7   2	0   12	0   0	6   0	100 9	7   31	0  
+0   0	 |  251  e     = talk.religion.misc
+    10  0	0   0	0   359 2   2	0   1	3   0	1   6	0   1	0   0	11 
+0   0	 |  396  f     = rec.autos
+    0   0	0   0	0   1	383 9	1   0	0   0	0   0	0   0	0   0	3  
+0   0	 |  397  g     = rec.sport.baseball
+    1   0	0   0	0   0	9   382 0   0	0   0	1   1	1   0	2   0	2  
+0   0	 |  399  h     = rec.sport.hockey
+    2   0	0   0	0   4	3   0	330 4	4   0	5   12	0   0	2   0	12 
+7   0	 |  385  i     = comp.sys.mac.hardware
+    0   3	0   0	0   0	1   0	0   368 0   0	10  4	1   3	2   0	2  
+0   0	 |  394  j     = sci.space
+    0   0	0   0	0   3	1   0	27  2	291 0	11  25	0   0	1   0	13 
+18  0	 |  392  k     = comp.sys.ibm.pc.hardware
+    8   0	1   109 0   6	11  4	1   18	0   98	1   3	11  10	27  1	1  
+0   0	 |  310  l     = talk.politics.misc
+    0   11	0   0	0   3	6   0	10  6	11  0	299 13	0   2	13  0	7  
+8   0	 |  389  m     = comp.graphics
+    6   0	1   0	0   4	2   0	5   2	12  0	8   321 0   4	14  0	8  
+6   0	 |  393  n     = sci.electronics
+    2   0	0   0	0   0	4   1	0   3	1   0	3   1	372 6	0   2	1  
+2   0	 |  398  o     = soc.religion.christian
+    4   0	0   1	0   2	3   3	0   4	2   0	7   12	6   342 1   0	9  
+0   0	 |  396  p     = sci.med
+    0   1	0   1	0   1	4   0	3   0	1   0	8   4	0   2	369 0	1  
+1   0	 |  396  q     = sci.crypt
+    10  0	4   10	1   5	6   2	2   6	2   0	2   1	86  15	14  152 0  
+1   0	 |  319  r     = alt.atheism
+    4   0	0   0	0   9	1   1	8   1	12  0	3   6	0   2	0   0	341
+2   0	 |  390  s     = misc.forsale
+    8   5	0   0	0   1	6   0	8   5	50  0	40  2	1   0	9   0	3  
+256 0	 |  394  t     = comp.os.ms-windows.misc
+    0   0	0   0	0   0	0   0	0   0	0   0	0   0	0   0	0   0	0  
+0   0	 |  0	 u     = unknown
+
+
+<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
+## Complementary Naive Bayes
+
+To Train a CBayes Classifier using bi-grams
+
+    $> $MAHOUT_HOME/bin/mahout trainclassifier \
+      -i 20news-input \
+      -o newsmodel \
+      -type cbayes \
+      -ng 2 \
+      -source hdfs
+
+
+To Test a CBayes Classifier using bi-grams
+
+    $> $MAHOUT_HOME/bin/mahout testclassifier \
+      -m newsmodel \
+      -d 20news-input \
+      -type cbayes \
+      -ng 2 \
+      -source hdfs \
+      -method mapreduce
+

Added: mahout/site/trunk/content/use-an-existing-hadoop-ami.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/use-an-existing-hadoop-ami.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/use-an-existing-hadoop-ami.mdtext (added)
+++ mahout/site/trunk/content/use-an-existing-hadoop-ami.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,292 @@
+Title: Use an Existing Hadoop AMI
+The following process was developed for launching Hadoop clusters in EC2 in
+order to benchmark Mahout's clustering algorithms using a large document
+set (see Mahout-588). Specifically, we used the ASF mail archives that have
+been parsed and converted to the Hadoop SequenceFile format
+(block-compressed) and saved to a public S3 folder:
+s3://asf-mail-archives/mahout-0.4/sequence-files. Overall, there are
+6,094,444 key-value pairs in 283 files taking around 5.7GB of disk.
+
+You can also use Amazon's Elastic MapReduce, see [Mahout on Elastic MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)
+. However, using EC2 directly is slightly less expensive and provides
+greater visibility into the state of running jobs via the JobTracker Web
+UI. You can launch the EC2 cluster from your development machine; the
+following instructions were generated on Ubuntu workstation. We assume that
+you have successfully completed the Amazon EC2 Getting Started Guide, see [EC2 Getting Started Guide|http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/]
+.
+
+Note, this work was supported in part by the Amazon Web Services Apache
+Projects Testing Program.
+
+<a name="UseanExistingHadoopAMI-LaunchHadoopCluster"></a>
+## Launch Hadoop Cluster
+
+<a name="UseanExistingHadoopAMI-GatherAmazonEC2keys/securitycredentials"></a>
+#### Gather Amazon EC2 keys / security credentials
+
+You will need the following:
+AWS Account ID
+Access Key ID
+Secret Access Key
+X.509 certificate and private key (e.g. cert-aws.pem and pk-aws.pem)
+EC2 Key-Pair (ssh public and private keys) for the US-EAST region.
+
+Please make sure the file permissions are "-rw-------" (e.g. chmod 600
+gsg-keypair.pem). You can create a key-pair for the US-East region using
+the Amazon console. If you are confused about any of these terms, please
+see: [Understanding Access Credentials for AWS/EC2](http://alestic.com/2009/11/ec2-credentials)
+.
+
+You should also export the EC2_PRIVATE_KEY and EC2_CERT environment
+variables to point to your AWS Certificate and Private Key files, for
+example:
+
+
+    export EC2_PRIVATE_KEY=$DEV/aws/pk-aws.pem
+    export EC2_CERT=$DEV/aws/cert-aws.pem
+
+
+These are used by the ec2-api-tools command to interact with Amazon Web
+Services.
+
+<a name="UseanExistingHadoopAMI-InstallandConfiguretheAmazonEC2APITools:"></a>
+#### Install and Configure the Amazon EC2 API Tools:
+
+On Ubuntu, you'll need to enable the multi-verse in /etc/apt/sources.list
+to find the ec2-api-tools
+
+
+    apt-get update
+    apt-get install ec2-api-tools
+
+
+Once installed, verify you have access to EC2 by executing:
+
+
+    ec2-describe-images -x all | grep hadoop
+
+
+<a name="UseanExistingHadoopAMI-InstallHadoop0.20.2Locally"></a>
+#### Install Hadoop 0.20.2 Locally
+
+You need to install Hadoop locally in order to get access to the EC2
+cluster deployment scripts. We use */mnt/dev* as the base working
+directory because this process was originally conducted on an EC2 instance;
+be sure to replace this path with the correct path for your environment as
+you work through these steps.
+
+
+    sudo mkdir -p /mnt/dev/downloads
+    sudo chown -R ubuntu:ubuntu /mnt/dev
+    cd /mnt/dev/downloads
+    wget
+http://apache.mirrors.hoobly.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
+&& cd /mnt/dev && tar zxvf downloads/hadoop-0.20.2.tar.gz
+    ln -s hadoop-0.20.2 hadoop 
+
+
+The scripts we need are in $HADOOP_HOME/scr/contrib/ec2. There are other
+approaches to deploying a Hadoop cluster on EC2, such as Cloudera's [CDH3](https://docs.cloudera.com/display/DOC/Cloudera+Documentation+Home+Page)
+. We chose to use the contrib/ec2 scripts because they are very easy to use
+provided there is an existing Hadoop AMI available.
+
+<a name="UseanExistingHadoopAMI-Edithadoop-ec2-env.sh"></a>
+#### Edit hadoop-ec2-env.sh 
+
+Open hadoop/src/contrib/ec2/bin/hadoop-ec2-env.sh in your editor and set
+the Amazon security variables to match your environment, for example:
+
+
+    AWS_ACCOUNT_ID=####-####-####
+    AWS_ACCESS_KEY_ID=???
+    AWS_SECRET_ACCESS_KEY=???
+    EC2_KEYDIR=/mnt/dev/aws
+    KEY_NAME=gsg-keypair
+    PRIVATE_KEY_PATH=/mnt/dev/aws/gsg-keypair.pem
+
+
+The value of PRIVATE_KEY_PATH should be your EC2 key-pair pem file, such as
+/mnt/dev/aws/gsg-keypair.pem. This key-pair must be created in the US-East
+region.
+
+For Mahout, we recommended the following settings:
+
+
+    HADOOP_VERSION=0.20.2
+    S3_BUCKET=453820947548/bixolabs-public-amis
+    ENABLE_WEB_PORTS=true
+    INSTANCE_TYPE="m1.xlarge"
+
+
+You do not need to worry about changing any variables below the comment
+that reads "The following variables are only used when creating an AMI.".
+
+These settings will create a cluster of EC2 xlarge instances using the
+Hadoop 0.20.2 AMI provided by Bixo Labs.
+
+<a name="UseanExistingHadoopAMI-LaunchHadoopCluster"></a>
+#### Launch Hadoop Cluster
+
+
+    cd $HADOOP_HOME/src/contrib/ec2
+    bin/hadoop-ec2 launch-cluster mahout-clustering 2
+
+
+This will launch 3 xlarge instances (two workers + one for the NameNode aka
+"master"). It may take up to 5 minutes to launch a cluster named
+"mahout-clustering"; watch the console for errors. The cluster will launch
+in the US-East region so you won't incur any data transfer fees to/from
+US-Standard S3 buckets. You can re-use the cluster name for launching other
+clusters of different sizes. Behind the scenes, the Hadoop scripts will
+create two EC2 security groups that configure the firewall for accessing
+your Hadoop cluster.
+
+<a name="UseanExistingHadoopAMI-Launchproxy"></a>
+#### Launch proxy
+
+Assuming your cluster launched successfully, establish a SOCKS tunnel to
+your master node to access the JobTracker Web UI from your local browser.
+
+
+    bin/hadoop-ec2 proxy mahout-clustering &
+
+
+This command will output the URLs for the JobTracker and NameNode Web UI,
+such as:
+
+
+    JobTracker http://ec2-???-???-???-???.compute-1.amazonaws.com:50030
+
+
+<a name="UseanExistingHadoopAMI-SetupFoxyProxy(FireFoxplug-in)"></a>
+#### Setup FoxyProxy (FireFox plug-in)
+
+Once the FoxyProxy plug-in is installed in FireFox, go to Options >
+FoxyProxy Standard > Options to setup a proxy on localhost:6666 for the
+JobTracker and NameNode Web UI URLs from the previous step. For more
+information about FoxyProxy, please see: [FoxyProxy](http://getfoxyproxy.org/downloads.html)
+
+Now you are ready to run Mahout jobs in your cluster.
+
+<a name="UseanExistingHadoopAMI-LaunchClusteringJobfromMasterserver"></a>
+## Launch Clustering Job from Master server
+
+<a name="UseanExistingHadoopAMI-Logintothemasterserver:"></a>
+#### Login to the master server:
+
+
+    bin/hadoop-ec2 login mahout-clustering
+
+
+Hadoop does not start until all EC2 instances are running, look for java
+processes on the master server using: ps waux | grep java
+
+<a name="UseanExistingHadoopAMI-InstallMahout"></a>
+#### Install Mahout
+
+Since this is EC2, you have the most disk space on the master node in /mnt.
+
+<a name="UseanExistingHadoopAMI-Fromadistribution"></a>
+##### From a distribution
+
+NOTE: Substitute in the appropriate version number/URLs as necessary.  0.4
+is not the latest version of Mahout.
+
+    mkdir -p /mnt/dev/downloads
+    cd /mnt/dev/downloads
+    wget http://apache.mesi.com.ar//mahout/0.4/mahout-distribution-0.4.tar.gz
+&& cd /mnt/dev && tar zxvf downloads/mahout-distribution-0.4.tar.gz
+    ln -s mahout-distribution-0.4 mahout
+
+
+<a name="UseanExistingHadoopAMI-FromSource"></a>
+##### From Source
+
+
+    Install Subversion: >yum install subversion //Note, you can also use Git,
+so substitute in the appropriate URL
+    > svn co http://svn.apache.org/repos/asf/mahout/trunk mahout/trunk
+    Install Maven 3.x and put it in the path
+    > cd mahout/trunk
+    > mvn install //Optionally add -DskipTests
+
+
+<a name="UseanExistingHadoopAMI-ConfigureHadoop"></a>
+#### Configure Hadoop
+
+You'll want to increase the Max Heap Size for the data nodes
+(mapred.child.java.opts) and set the correct number of reduce tasks based
+on the size of your cluster. 
+
+
+    vi $HADOOP_HOME/conf/hadoop-site.xml
+
+
+(NOTE: if this file doesn't exist yet, then the cluster nodes are still
+starting up. Wait a few minutes and then try again.)
+
+Add the following properties:
+
+
+    <!-- Change 6 to the correct number for your cluster -->
+    <property>
+      <name>mapred.reduce.tasks</name>
+      <value>6</value>
+    </property>
+    
+    <property>
+      <name>mapred.child.java.opts</name>
+      <value>-Xmx4096m</value>
+    </property>
+
+
+You can safely run 3 reducers per node on EC2 xlarge instances with 4GB of
+max heap each. If you are using large instances, then you may be able to
+have 2 per node or only 1 if your jobs are CPU intensive.
+
+<a name="UseanExistingHadoopAMI-CopythevectorsfromS3toHDFS"></a>
+#### Copy the vectors from S3 to HDFS
+
+Use Hadoop's distcp command to copy the vectors from S3 to HDFS.
+
+
+    hadoop distcp -Dmapred.task.timeout=1800000 \
+    s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
+\
+    /asf-mail-archives/mahout-0.4/tfidf-vectors
+
+
+The files are stored in the US-Standard S3 bucket so there is no charge for
+data transfer to your EC2 cluster, as it is running in the US-EAST region.
+
+<a name="UseanExistingHadoopAMI-Launchtheclusteringjob(fromthemasterserver)"></a>
+#### Launch the clustering job (from the master server)
+
+
+    cd /mnt/dev/mahout
+    bin/mahout kmeans -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \
+      -c /asf-mail-archives/mahout-0.4/initial-clusters/ \
+      -o /asf-mail-archives/mahout-0.4/kmeans-clusters/ \
+      --numClusters 100 \
+      --maxIter 10 \
+      --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure
+\
+      --convergenceDelta 0.01 &
+
+  
+You can monitor the job using the JobTracker Web UI through FoxyProxy.
+
+<a name="UseanExistingHadoopAMI-DumpClusters"></a>
+#### Dump Clusters
+
+Once completed, you can view the results using Mahout's cluster dumper
+
+
+    bin/mahout clusterdump --seqFileDir
+/asf-mail-archives/mahout-0.4/kmeans-clusters/clusters-1/ \
+      --numWords 20 \
+      --dictionary
+s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/dictionary.file-0
+\
+      --dictionaryType sequencefile --output clusters.txt --substring 100
+

Added: mahout/site/trunk/content/using-mahout-with-python-via-jpype.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/using-mahout-with-python-via-jpype.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/using-mahout-with-python-via-jpype.mdtext (added)
+++ mahout/site/trunk/content/using-mahout-with-python-via-jpype.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,228 @@
+Title: Using Mahout with Python via JPype
+<a name="UsingMahoutwithPythonviaJPype-overview"></a>
+# overview
+This tutorial provides some sample code illustrating how we can read and
+write sequence files containing Mahout vectors from Python using JPype.
+This tutorial is intended for people who want to use Python for analyzing
+and plotting Mahout data. Using Mahout from Python turns out to be quite
+easy.
+
+This tutorial concerns the use of cPython (cython) as opposed to Jython.
+JPython wasn't an option for me, because  (to the best of my knowledge)
+JPython doesn't work with Python extensions numpy, matplotlib, or h5py
+which I rely on heavily.
+
+The instructions below explain how to setup a python script to read and
+write the output of Mahout clustering.
+
+You will first need to download and install the JPype package for python.
+
+The first step to setting up JPype is determining the path to the dynamic
+library for the jvm ; on linux this will be a .so file on and on windows it
+will be a .dll.
+
+In your python script, create a global variable with the path to this dll
+
+
+
+Next we need to figure out how we need to set the classpath for mahout. The
+easiest way to do this is to edit the script in "bin/mahout" to print out
+the classpath. Add the line "echo $CLASSPATH" to the script somewhere after
+the comment "run it" (this is line 195 or so). Execute the script to print
+out the classpath.  Copy this output and paste it into a variable in your
+python script. The result for me looks like the following
+
+
+
+
+Now we can create a function to start the jvm in python using jype
+
+    jvm=None
+    def start_jpype():
+    global jvm
+    if (jvm is None):
+    cpopt="-Djava.class.path={cp}".format(cp=classpath)
+    startJVM(jvmlib,"-ea",cpopt)
+    jvm="started"
+
+
+
+<a name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a>
+# Writing Named Vectors to Sequence Files from Python
+We can now use JPype to create sequence files which will contain vectors to
+be used by Mahout for kmeans. The example below is a function which creates
+vectors from two Gaussian distributions with unit variance.
+
+
+    def create_inputs(ifile,*args,**param):
+     """Create a sequence file containing some normally distributed
+    	ifile - path to the sequence file to create
+     """
+     
+     #matrix of the cluster means
+     cmeans=np.array([[1,1]
+,[-1,-1]],np.int)
+     
+     nperc=30  #number of points per cluster
+     
+     vecs=[]
+     
+     vnames=[]
+     for cind in range(cmeans.shape[0]
+):
+      pts=np.random.randn(nperc,2)
+      pts=pts+cmeans[cind,:]
+.reshape([1,cmeans.shape[1]])
+      vecs.append(pts)
+     
+      #names for the vectors
+      #names are just the points with an index
+      #we do this so we can validate by cross-refencing the name with the
+vector
+      vn=np.empty(nperc,dtype=(np.str,30))
+      for row in range(nperc):
+       vn[row]
+="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
+      vnames.append(vn)
+      
+     vecs=np.vstack(vecs)
+     vnames=np.hstack(vnames)
+     
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     writer=io.SequenceFile.createWriter(fs, conf, path,
+io.Text,VectorWritableCls)
+     
+     
+     vecwritable=VectorWritableCls()
+     for row in range(vecs.shape[0]
+):
+      nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:]
+)),vnames[row])
+      #need to wrap key and value because of overloading
+      wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
+      wrapval=JObject(vecwritable,io.Writable)
+      
+      vecwritable.set(nvector)
+      writer.append(wrapkey,wrapval)
+      
+     writer.close()
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a>
+# Reading the KMeans Clustered Points from Python
+Similarly we can use JPype to easily read the clustered points outputted by
+mahout.
+
+    def read_clustered_pts(ifile,*args,**param):
+     """Read the clustered points
+     ifile - path to the sequence file containing the clustered points
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     
+     
+     ReaderCls=io.__getattribute__("SequenceFile$Reader") 
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=reader.getKeyClass()()
+     
+    
+     valcls=reader.getValueClass()
+     vecwritable=valcls()
+     while (reader.next(key,vecwritable)):	
+      weight=vecwritable.getWeight()
+      nvec=vecwritable.getVector()
+      
+      cname=nvec.__class__.__name__
+      if (cname.rsplit('.',1)[1]
+=="NamedVector"):  
+       print "cluster={key} Name={name} x={x}
+y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
+      else:
+       raise NotImplementedError("Vector isn't a NamedVector. Need to
+modify/test the code to handle this case.")
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a>
+# Reading the KMeans Centroids
+Finally we can create a function to print out the actual cluster centers
+found by mahout,
+
+    def getClusters(ifile,*args,**param):
+     """Read the centroids from the clusters outputted by kmenas
+    	   ifile - Path to the sequence file containing the centroids
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     ReaderCls=io.__getattribute__("SequenceFile$Reader")
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=io.Text()
+     
+    
+     valcls=reader.getValueClass()
+    
+     vecwritable=valcls()
+     
+     while (reader.next(key,vecwritable)):	
+      center=vecwritable.getCenter()
+      
+      print "id={cid}
+center={center}".format(cid=vecwritable.getId(),center=center.values)
+      pass
+

Added: mahout/site/trunk/content/version-control.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/version-control.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/version-control.mdtext (added)
+++ mahout/site/trunk/content/version-control.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,36 @@
+Title: Version Control
+The Mahout source code resides in the [Apache Subversion (SVN)](http://subversion.tigris.org/)
+ repository. The command-line SVN client can be obtained [here|http://subversion.tigris.org/project_packages.html]
+. The TortoiseSVN GUI client for Windows can be obtained [here|http://tortoisesvn.tigris.org/]
+. There are also SVN plugins available for both [Eclipse|http://subclipse.tigris.org/]
+ and [IntelliJ IDEA|http://svnup.tigris.org/]
+. 
+
+There is also a [git](http://git-scm.com/)
+ repository for Mahout available [at Apache|http://git.apache.org/]
+.
+
+<a name="VersionControl-WebAccess(read-only)"></a>
+## Web Access (read-only)
+
+The source code can be browsed via the Web at [http://svn.apache.org/viewvc/mahout/](http://svn.apache.org/viewvc/mahout/)
+. No SVN client software is required. 
+
+<a name="VersionControl-AnonymousAccess(read-only)"></a>
+## Anonymous Access (read-only)
+
+The SVN URL for anonymous users is [http://svn.apache.org/repos/asf/mahout/trunk](http://svn.apache.org/repos/asf/mahout/trunk)
+. Instructions for anonymous SVN access are here. 
+
+<a name="VersionControl-CommitterAccess(read-write)"></a>
+## Committer Access (read-write)
+
+The SVN URL for committers is [https://svn.apache.org/repos/asf/mahout/trunk](https://svn.apache.org/repos/asf/mahout/trunk)
+. Instructions for committer SVN access are [here|https://cwiki.apache.org/confluence/display/MAHOUT/IssueTracker]
+. 
+
+<a name="VersionControl-Issues"></a>
+## Issues
+
+All bugs, improvements, patches, etc. should be logged in [JIRA](http://issues.apache.org/jira/browse/MAHOUT)
+.

Added: mahout/site/trunk/content/viewing-result.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/viewing-result.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/viewing-result.mdtext (added)
+++ mahout/site/trunk/content/viewing-result.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,10 @@
+Title: Viewing Result
+* [Algorithm Viewing pages](#ViewingResult-AlgorithmViewingpages)
+
+There are various technologies available to view the output of Mahout
+algorithms.
+* Clusters
+
+<a name="ViewingResult-AlgorithmViewingpages"></a>
+# Algorithm Viewing pages
+{pagetree:root=@self|excerpt=true|expandCollapseAll=true}

Added: mahout/site/trunk/content/viewing-results.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/viewing-results.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/viewing-results.mdtext (added)
+++ mahout/site/trunk/content/viewing-results.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,44 @@
+Title: Viewing Results
+<a name="ViewingResults-Intro"></a>
+# Intro
+
+Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
+sequence files or other data structures.  This page is intended to
+demonstrate the various ways one might inspect the outcome of various jobs.
+ The page is organized by algorithms.
+
+<a name="ViewingResults-GeneralUtilities"></a>
+# General Utilities
+
+<a name="ViewingResults-SequenceFileDumper"></a>
+## Sequence File Dumper
+
+
+<a name="ViewingResults-Clustering"></a>
+# Clustering
+
+<a name="ViewingResults-ClusterDumper"></a>
+## Cluster Dumper
+
+Run the following to print out all options:
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --help
+
+
+
+<a name="ViewingResults-Example"></a>
+### Example
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir
+./solr-clust-n2/out/clusters-2
+          --dictionary ./solr-clust-n2/dictionary.txt
+          --substring 100 --pointsDir ./solr-clust-n2/out/points/
+    
+
+
+
+<a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a>
+## Cluster Labels (MAHOUT-163)
+
+<a name="ViewingResults-Classification"></a>
+# Classification

Added: mahout/site/trunk/content/visualize-classification-results.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/visualize-classification-results.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/visualize-classification-results.mdtext (added)
+++ mahout/site/trunk/content/visualize-classification-results.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,2 @@
+Title: Visualize Classification Results
+Lorem whatsit.

Added: mahout/site/trunk/content/visualizing-sample-clusters.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/visualizing-sample-clusters.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/visualizing-sample-clusters.mdtext (added)
+++ mahout/site/trunk/content/visualizing-sample-clusters.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,76 @@
+Title: Visualizing Sample Clusters
+<a name="VisualizingSampleClusters-Introduction"></a>
+# Introduction
+
+Mahout provides examples to visualize sample clusters that gets created by
+various clustering algorithms like
+* Canopy Clustering
+* Dirichlet Process
+* KMeans
+* Fuzzy KMeans
+* MeanShift Canopy
+* Spectral KMeans
+* MinHash
+
+<a name="VisualizingSampleClusters-Note"></a>
+##### Note
+These are Swing programs. You have to be in a window system on the same
+machine you run these, or logged in via a "remote desktop" or VNC program.
+
+<a name="VisualizingSampleClusters-Pre-Prep"></a>
+# Pre - Prep
+
+For visualizing the clusters, you would just have to execute the Java
+classes under org.apache.mahout.clustering.display package in
+mahout-examples module. If you are using eclipse, setup mahout-examples as
+a project as specified in [Working with Maven in Eclipse](buildingmahout#mahout_maven_eclipse.html)
+.
+
+<a name="VisualizingSampleClusters-Visualizingclusters"></a>
+# Visualizing clusters
+
+The following classes in org.apache.mahout.clustering.display can be run
+without parameters to generate a sample data set and run the reference
+clustering implementations over them:
+1. DisplayClustering - generates 1000 samples from three, symmetric
+distributions. This is the same data set that is used by the following
+clustering programs. It displays the points on a screen and superimposes
+the model parameters that were used to generate the points. You can edit
+the generateSamples() method to change the sample points used by these
+programs.
+1. DisplayClustering - displays initial areas of generated points
+1. DisplayDirichlet - uses Dirichlet Process clustering
+1. DisplayCanopy - uses Canopy clustering
+1. DisplayKMeans - uses k-Means clustering
+1. DisplayFuzzyKMeans - uses Fuzzy k-Means clustering
+1. DisplayMeanShift - uses MeanShift clustering
+1. DisplaySpectralKMeans - uses Spectral KMeans via map-reduce algorithm
+
+If you are using Eclipse and have set it up as specified in Pre-Prep, just
+right-click on each of the classes mentioned above and choose "Run As -
+Java Application". To run these directly from the command line:
+
+    cd $MAHOUT_HOME/examples
+    mvn -q exec:java
+-Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
+    # substitute other names above for DisplayClustering
+    # Note: the DisplaySpectralKMeans program does a Hadoop job that takes 3
+minutes on a laptop. Set this MVN_OPTS=300m to give the program enough
+memory. You may find that some of the other programs also need more memory.
+
+
+Note:
+* Some of these programs display the sample points and then superimpose all
+of the clusters from each iteration. The last iteration's clusters are in
+bold red and the previous several are colored (orange, yellow, green, blue,
+magenta) in order after which all earlier clusters are in light grey. This
+helps to visualize how the clusters converge upon a solution over multiple
+iterations.
+
+* By changing the parameter values (k, ALPHA_0, numIterations) and the
+display SIGNIFICANCE you can obtain different results.
+
+<a name="VisualizingSampleClusters-ScreenCaptureAnimation"></a>
+# Screen Capture Animation
+See [Sample Clusters Animation](sample-clusters-animation.html)
+ for a screen caps of all the above programs, and an animated gif.

Added: mahout/site/trunk/content/what-it-is-the-decision-forest-?-it-is-same-as-random-forest?.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/what-it-is-the-decision-forest-%3F-it-is-same-as-random-forest%3F.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/what-it-is-the-decision-forest-?-it-is-same-as-random-forest?.mdtext (added)
+++ mahout/site/trunk/content/what-it-is-the-decision-forest-?-it-is-same-as-random-forest?.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,2 @@
+Title: what it is the decision forest ? It is same as random forest?
+what it is the decision forest ? It is same as random forest?  

Added: mahout/site/trunk/content/who-we-are.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/who-we-are.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/who-we-are.mdtext (added)
+++ mahout/site/trunk/content/who-we-are.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,58 @@
+Title: Who We Are
+<a name="WhoWeAre-Whoweare"></a>
+# Who we are
+
+Apache Mahout is maintained by a team of volunteer developers.
+
+<a name="WhoWeAre-CoreCommitters"></a>
+## Core Committers
+
+<table>
+<tr><th> Name </th><th> Mail </th><th> Webpage </th><th> PMC </th><th> A short summary of yourself </th></tr>
+<tr><td> Isabel Drost </td><td> isabel@... </td><td> [Homepage](http://isabel-drost.de)
+, [Blog</td><td>http://blog.isabel-drost.de]
+ </td><td> Yes </td><td> Passion for free software (development, but to some extend also
+the political and economic implications), interested in agile development
+and project management, lives in Germany. Follow me on Twitter @MaineC </td></tr>
+<tr><td> Ted Dunning </td><td> tdunning@... </td><td> [MapR Technologies](http://www.mapr.com)
+ </td><td> Yes </td><td> </td></tr>
+<tr><td> Jeff Eastman </td><td> jeastman@... </td><td> [Windward Solutions](http://www.windwardsolutions.com/)
+ </td><td> Yes </td><td> </td></tr>
+<tr><td> Drew Farris </td><td> drew@... </td><td> </td><td> Yes </td><td> </td></tr>
+  
+  
+<tr><td> Sean Owen </td><td> srowen@... </td><td> </td><td> Yes </td><td> </td></tr>
+  
+  
+  
+  
+  
+  
+  
+  
+  
+  
+<tr><td> Dmitriy Lyubimov </td><td> dlyubimov@... </td><td> [LinkedIn](http://www.linkedin.com/in/dlyubimov)
+ </td><td> Yes </td><td> Twitter: @dlieuOfTwit </td></tr>
+</table>
+
+
+<a name="WhoWeAre-EmeritusCommitters"></a>
+## Emeritus Committers
+
+* Niranjan Balasubramanian (nbalasub@...)
+* Otis Gospodnetic (otis@...)
+* David Hall (dlwh@...)
+* Erik Hatcher (ehatcher@...)
+* Ozgur Yilmazel (oyilmazel@...)
+* Dawid Weiss (dweiss@...)
+* Karl Wettin (kalle@...)
+* AbdelHakim Deneche (adeneche@...)
+
+Note that the email addresses above end with @apache.org.
+
+<a name="WhoWeAre-Contributors"></a>
+## Contributors
+
+Apache Mahout contributors and their contributions are listed at Apache [JIRA](http://issues.apache.org/jira/secure/ConfigureReport.jspa?versionId=-1&issueStatus=all&selectedProjectId=12310751&reportKey=com.sourcelabs.jira.plugin.report.contributions%3Acontributionreport&Next=Next)
+.

Added: mahout/site/trunk/content/wikipedia-bayes-example.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/trunk/content/wikipedia-bayes-example.mdtext?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/content/wikipedia-bayes-example.mdtext (added)
+++ mahout/site/trunk/content/wikipedia-bayes-example.mdtext Thu Jul 12 09:25:54 2012
@@ -0,0 +1,40 @@
+Title: Wikipedia Bayes Example
+<a name="WikipediaBayesExample-Intro"></a>
+# Intro
+
+The Mahout Examples source comes with tools for classifying a Wikipedia
+data dump using either the Naive Bayes or Complementary Naive Bayes
+implementations in Mahout.  The example (described below) gets a Wikipedia
+dump and then splits it up into chunks.  These chunks are then further
+split by country.  From these splits, a classifier is trained to predict
+what country an unseen article should be categorized into.
+
+
+<a name="WikipediaBayesExample-Runningtheexample"></a>
+# Running the example
+
+1. download the wikipedia data set [here ](http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.html)
+1. unzip the bz2 file to get the enwiki-latest-pages-articles.xml. 
+1. Create directory $MAHOUT_HOME/examples/temp and copy the xml file into
+this directory
+1. Chunk the Data into pieces: <pre><code>$MAHOUT_HOME/bin/mahout
+wikipediaXMLSplitter -d
+$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o
+wikipedia/chunks -c 64</code></pre> 
+> We strongly suggest you backup the results to some other place so that you don't have to do this step again in
+> case it gets accidentally erased
+1. This would have created the chunks in HDFS. Verify the same by executing
+<pre><code>hadoop fs -ls wikipedia/chunks</code></pre> and it'll list all the xml files
+as chunk-0001.xml and so on.
+1. Create the countries based Split of wikipedia dataset.
+<pre><code>$MAHOUT_HOME/bin/mahout  wikipediaDataSetCreator	-i wikipedia/chunks
+-o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt</code></pre>
+
+    # Verify the creation of input data set by executing <pre><code> hadoop fs -ls
+wikipediainput </code></pre> and you'll be able to see part-r-00000 file inside
+wikipediainput directory
+    # Train the classifier: <pre><code>$MAHOUT_HOME/bin/mahout trainclassifier -i
+wikipediainput -o wikipediamodel</code></pre>. The model file will be available in
+the wikipediamodel folder in HDFS.
+    # Test the classifier: <pre><code>$MAHOUT_HOME/bin/mahout testclassifier -m
+wikipediamodel -d wikipediainput</code></pre>

Added: mahout/site/trunk/lib/path.pm
URL: http://svn.apache.org/viewvc/mahout/site/trunk/lib/path.pm?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/lib/path.pm (added)
+++ mahout/site/trunk/lib/path.pm Thu Jul 12 09:25:54 2012
@@ -0,0 +1,39 @@
+package path;
+
+# taken from django's url.py
+
+our @patterns = (
+	[qr!\.mdtext$!, single_narrative => { template => "single_narrative.html" }],
+
+	[qr!/sitemap\.html$!, sitemap => { headers => { title => "Sitemap" }} ],
+
+) ;
+
+# for specifying interdependencies between files
+
+our %dependencies = (
+    "/sitemap.html" => [ grep s!^content!!, glob "content/*.mdtext" ],
+);
+
+1;
+
+=head1 LICENSE
+
+           Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+
+             http://www.apache.org/licenses/LICENSE-2.0
+
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+

Added: mahout/site/trunk/lib/view.pm
URL: http://svn.apache.org/viewvc/mahout/site/trunk/lib/view.pm?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/lib/view.pm (added)
+++ mahout/site/trunk/lib/view.pm Thu Jul 12 09:25:54 2012
@@ -0,0 +1,23 @@
+package view;
+use base 'ASF::View'; # see https://svn.apache.org/repos/infra/websites/cms/build/lib/ASF/View.pm
+
+1;
+
+=head1 LICENSE
+
+           Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+
+             http://www.apache.org/licenses/LICENSE-2.0
+
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.

Added: mahout/site/trunk/templates/single_narrative.html
URL: http://svn.apache.org/viewvc/mahout/site/trunk/templates/single_narrative.html?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/templates/single_narrative.html (added)
+++ mahout/site/trunk/templates/single_narrative.html Thu Jul 12 09:25:54 2012
@@ -0,0 +1,3 @@
+{% extends "skeleton.html" %}
+{% block title %}{{ headers.title }}{% endblock %}
+{% block content %}{{ content|markdown }}{% endblock %}

Added: mahout/site/trunk/templates/skeleton.html
URL: http://svn.apache.org/viewvc/mahout/site/trunk/templates/skeleton.html?rev=1360593&view=auto
==============================================================================
--- mahout/site/trunk/templates/skeleton.html (added)
+++ mahout/site/trunk/templates/skeleton.html Thu Jul 12 09:25:54 2012
@@ -0,0 +1,77 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+  <head>
+ <link rel="stylesheet" href="styles/site.css" type="text/css" />
+        <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+    
+    <script type="text/javascript" language="javascript">
+      var hide = null;
+      var show = null;
+      var children = null;
+
+      function init() {
+        /* Search form initialization */
+        var form = document.forms['search'];
+        if (form != null) {
+          form.elements['domains'].value = location.hostname;
+          form.elements['sitesearch'].value = location.hostname;
+        }
+
+        /* Children initialization */
+        hide = document.getElementById('hide');
+        show = document.getElementById('show');
+        children = document.all != null ?
+                   document.all['children'] :
+                   document.getElementById('children');
+        if (children != null) {
+          children.style.display = 'none';
+          show.style.display = 'inline';
+          hide.style.display = 'none';
+        }
+      }
+
+      function showChildren() {
+        children.style.display = 'block';
+        show.style.display = 'none';
+        hide.style.display = 'inline';
+      }
+
+      function hideChildren() {
+        children.style.display = 'none';
+        show.style.display = 'inline';
+        hide.style.display = 'none';
+      }
+    </script>
+    <title>{% block title %}{% endblock %}</title>
+  </head>
+  <body>
+      <div id="PageContent">
+      <div class="pageheader">
+         <span class="pagetitle">Apache Mahout : {% block title %}{% endblock %}</span>      
+      </div>
+
+      <div class="pagecontent" border="0" cellpadding="0" cellspacing="0" width="100%" bgcolor="#ffffff">
+        <div class="wiki-content">
+          {% block content %}{% endblock %}
+        </div>
+
+     </div>
+    </div>   
+<script type="text/javascript">
+
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-17359171-1']);
+  _gaq.push(['_setDomainName', 'none']);
+  _gaq.push(['_setAllowLinker', true]);
+  _gaq.push(['_trackPageview']);
+
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+  })();
+
+</script>
+  </body>
+</html>
+