You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ss...@apache.org on 2014/05/05 06:08:35 UTC
svn commit: r1592443 -
/mahout/site/mahout_cms/trunk/content/users/classification/twenty-newsgroups.mdtext
Author: ssc
Date: Mon May 5 04:08:34 2014
New Revision: 1592443
URL: http://svn.apache.org/r1592443
Log:
MAHOUT-1480 Clean up website on 20 newsgroups
Modified:
mahout/site/mahout_cms/trunk/content/users/classification/twenty-newsgroups.mdtext
Modified: mahout/site/mahout_cms/trunk/content/users/classification/twenty-newsgroups.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/classification/twenty-newsgroups.mdtext?rev=1592443&r1=1592442&r2=1592443&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/classification/twenty-newsgroups.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/classification/twenty-newsgroups.mdtext Mon May 5 04:08:34 2014
@@ -10,122 +10,162 @@ The 20 Newsgroups data set is a collecti
newsgroup documents, partitioned (nearly) evenly across 20 different
newsgroups. The 20 newsgroups collection has become a popular data set for
experiments in text applications of machine learning techniques, such as
-text classification and text clustering. We will use Mahout Bayes
-Classifier to create a model that would classify a new document into one of
+text classification and text clustering. We will use the [Mahout CBayes](http://mahout.apache.org/users/classification/bayesian.html)
+classifier to create a model that would classify a new document into one of
the 20 newsgroup.
<a name="TwentyNewsgroups-Prerequisites"></a>
-## Prerequisites
+### Prerequisites
* Mahout has been downloaded ([instructions here](http://apache.osuosl.org/mahout/))
* Maven is available
* Your environment has the following variables:
-<table>
-<tr><td> *HADOOP_HOME* </td><td> Environment variables refers to where Hadoop lives </td></tr>
-<tr><td> *MAHOUT_HOME* </td><td> Environment variables refers to where Mahout lives </td></tr>
-</table>
+ - **HADOOP_HOME** Environment variables refers to where Hadoop lives
+ - **MAHOUT_HOME** Environment variables refers to where Mahout lives
<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
-## Instructions for running the example
+### Instructions for running the example
-1. Start the hadoop daemons by executing the following commands
+1. If running Hadoop in cluster mode, Start the hadoop daemons by executing the following commands
- $ cd $HADOOP_HOME/bin
- $ ./start-all.sh
+ $ cd $HADOOP_HOME/bin
+ $ ./start-all.sh
+
+ Otherwise:
-1. In the trunk directory of mahout, compile everything and create the
-mahout job:
+ $ export MAHOUT_LOCAL=true
- $ cd $MAHOUT_HOME
- $ mvn install
+2. In the trunk directory of mahout, compile and install mahout:
-1. Run the 20 newsgroup example by executing:
+ $ cd $MAHOUT_HOME
+ $ mvn install
- $ ./examples/bin/classify-20newsgroups.sh
+3. Run the [20 newsgroup example script](http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh) by executing:
-The script performs the following
+ $ ./examples/bin/classify-20newsgroups.sh
-1. Asks you to select an classification algorithm: Complementary Naive Bayes, Naive Bayes or Stochastic Gradient Descent.
-2. Downloads *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
-3. Extracts dataset
-4. Generates input dataset for training classifier
-5. Generates input dataset for testing classifier
-6. Trains the classifier
-7. Tests the classifier
+4. You will be prompted to select a classification method algorithm:
+
+ 1. Complement Naive Bayes
+ 2. Naive Bayes
+ 3. Stochastic Gradient Descent
+Select 1 and the the script will perform the following:
+
+1. Create a working directory for the dataset and all input/output.
+2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory.
+3. Convert the full 20newsgroups dataset into a < Text, Text > sequence file.
+4. Convert and preprocesses the dataset into a < Text, VectorWritable > sequence file containing term frequencies for each document.
+5. Split the preprocessed dataset into training and testing sets.
+6. Train the classifier.
+7. Test the classifier.
Output might look like:
+
=======================================================
Confusion Matrix
-------------------------------------------------------
- a b c d e f g h i j k l m n o p q r s
-t u <--Classified as
- 381 0 0 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3
-0 0 | 398 a = rec.motorcycles
- 1 284 0 0 0 0 1 0 6 3 11 0 66 3 0 1 6 0 4
-9 0 | 395 b = comp.windows.x
- 2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2
-0 0 | 376 c = talk.politics.mideast
- 4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 2
-0 0 | 364 d = talk.politics.guns
- 7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0
-0 0 | 251 e = talk.religion.misc
- 10 0 0 0 0 359 2 2 0 1 3 0 1 6 0 1 0 0 11
-0 0 | 396 f = rec.autos
- 0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 0 3
-0 0 | 397 g = rec.sport.baseball
- 1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2
-0 0 | 399 h = rec.sport.hockey
- 2 0 0 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12
-7 0 | 385 i = comp.sys.mac.hardware
- 0 3 0 0 0 0 1 0 0 368 0 0 10 4 1 3 2 0 2
-0 0 | 394 j = sci.space
- 0 0 0 0 0 3 1 0 27 2 291 0 11 25 0 0 1 0 13
-18 0 | 392 k = comp.sys.ibm.pc.hardware
- 8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1
-0 0 | 310 l = talk.politics.misc
- 0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7
-8 0 | 389 m = comp.graphics
- 6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8
-6 0 | 393 n = sci.electronics
- 2 0 0 0 0 0 4 1 0 3 1 0 3 1 372 6 0 2 1
-2 0 | 398 o = soc.religion.christian
- 4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9
-0 0 | 396 p = sci.med
- 0 1 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1
-1 0 | 396 q = sci.crypt
- 10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0
-1 0 | 319 r = alt.atheism
- 4 0 0 0 0 9 1 1 8 1 12 0 3 6 0 2 0 0 341
-2 0 | 390 s = misc.forsale
- 8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3
-256 0 | 394 t = comp.os.ms-windows.misc
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0 0 | 0 u = unknown
+ a b c d e f g h i j k l m n o p q r s t <--Classified as
+ 381 0 0 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3 0 | 398 a = rec.motorcycles
+ 1 284 0 0 0 0 1 0 6 3 11 0 66 3 0 1 6 0 4 9 | 395 b = comp.windows.x
+ 2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2 0 | 376 c = talk.politics.mideast
+ 4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 2 0 | 364 d = talk.politics.guns
+ 7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0 0 | 251 e = talk.religion.misc
+ 10 0 0 0 0 359 2 2 0 1 3 0 1 6 0 1 0 0 11 0 | 396 f = rec.autos
+ 0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 0 3 0 | 397 g = rec.sport.baseball
+ 1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2 0 | 399 h = rec.sport.hockey
+ 2 0 0 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12 7 | 385 i = comp.sys.mac.hardware
+ 0 3 0 0 0 0 1 0 0 368 0 0 10 4 1 3 2 0 2 0 | 394 j = sci.space
+ 0 0 0 0 0 3 1 0 27 2 291 0 11 25 0 0 1 0 13 18 | 392 k = comp.sys.ibm.pc.hardware
+ 8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1 0 | 310 l = talk.politics.misc
+ 0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7 8 | 389 m = comp.graphics
+ 6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8 6 | 393 n = sci.electronics
+ 2 0 0 0 0 0 4 1 0 3 1 0 3 1 372 6 0 2 1 2 | 398 o = soc.religion.christian
+ 4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9 0 | 396 p = sci.med
+ 0 1 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1 1 | 396 q = sci.crypt
+ 10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0 1 | 319 r = alt.atheism
+ 4 0 0 0 0 9 1 1 8 1 12 0 3 6 0 2 0 0 341 2 | 390 s = misc.forsale
+ 8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3 256 | 394 t = comp.os.ms-windows.misc
+ =======================================================
+ Statistics
+ -------------------------------------------------------
+ Kappa 0.8808
+ Accuracy 90.8596%
+ Reliability 86.3632%
+ Reliability (standard deviation) 0.2131
+
+
+
<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
-## Complementary Naive Bayes
+## End to end commands to build a CBayes model for 20 Newsgroups:
+The [20 newsgroup example script](http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh) issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script:
-To Train a CBayes Classifier using bi-grams
+*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your **$PATH***
- $> $MAHOUT_HOME/bin/mahout trainclassifier \
- -i 20news-input \
- -o newsmodel \
- -type cbayes \
- -ng 2 \
- -source hdfs
-
-
-To Test a CBayes Classifier using bi-grams
-
- $> $MAHOUT_HOME/bin/mahout testclassifier \
- -m newsmodel \
- -d 20news-input \
- -type cbayes \
- -ng 2 \
- -source hdfs \
- -method mapreduce
+1. Create a working directory for the dataset and all input/output.
+
+ $ export WORK_DIR=/tmp/mahout-work-${USER}
+ $ mkdir -p ${WORK_DIR}
+
+2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory.
+
+ $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
+ -o ${WORK_DIR}/20news-bydate.tar.gz
+ $ mkdir -p ${WORK_DIR}/20news-bydate
+ $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
+ $ mkdir ${WORK_DIR}/20news-all
+ $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
+ * If you're running on a hadoop cluster
+
+ $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
+
+3. Convert the full 20newsgroups dataset into a < Text, Text > sequence file.
+
+ $ mahout seqdirectory
+ -i ${WORK_DIR}/20news-all
+ -o ${WORK_DIR}/20news-seq -ow
+
+4. Convert and preprocesses the dataset into a < Text, VectorWritable > sequence file containing term frequencies for each document.
+
+ $ mahout seq2sparse
+ -i ${WORK_DIR}/20news-seq
+ -o ${WORK_DIR}/20news-vectors
+ -lnorm
+ -nv
+ -wt tfidf
+If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bi-grams or -n 2 for L2 length normalization. See the [Creating vectors from text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) for a list of all se2sparse options.
+
+5. Split the preprocessed dataset into training and testing sets.
+
+ $ mahout split
+ -i ${WORK_DIR}/20news-vectors/tfidf-vectors
+ --trainingOutput ${WORK_DIR}/20news-train-vectors
+ --testOutput ${WORK_DIR}/20news-test-vectors
+ --randomSelectionPct 40
+ --overwrite --sequenceFiles -xm sequential
+
+6. Train the classifier.
+
+ $ mahout trainnb
+ -i ${WORK_DIR}/20news-train-vectors -el
+ -o ${WORK_DIR}/model
+ -li ${WORK_DIR}/labelindex
+ -ow
+ -c
+
+7. Test the classifier.
+
+ $ mahhout testnb
+ -i ${WORK_DIR}/20news-test-vectors
+ -m ${WORK_DIR}/model
+ -l ${WORK_DIR}/labelindex
+ -ow
+ -o ${WORK_DIR}/20news-testing
+ -c
+
+
\ No newline at end of file