You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2011/10/16 23:14:00 UTC
[CONF] Apache Mahout > Twenty Newsgroups
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Twenty Newsgroups (https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups)
Change Comment:
---------------------------------------------------------------------
removed the steps for running the example manually. Added information on the steps to run the example using scripts
Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h2. Twenty Newsgroups Classification Example
h2. Introduction
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. We will use Mahout Bayes Classifier to create a model that would classify a new document into one of the 20 newsgroup.
h2. Prerequisites
* Mahout has been downloaded ([instructions here|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup])
* Maven is available
* Your environment has the following variables:
| {{HADOOP_HOME}} | Environment variables refers to where Hadoop lives |
| {{MAHOUT_HOME}} | Environment variables refers to where Mahout lives |
h2. Instructions for running the example
# Start the hadoop daemons by executing the following commands
{noformat}
$ cd $HADOOP_HOME/bin
$ ./start-all.sh
{noformat}
# In the trunk directory of mahout, compile everything and create the mahout job:
{noformat}
$ cd $MAHOUT_HOME
$ mvn install
{noformat}
# Run the 20 newsgroup example by executing the script as below
{noformat}
$ ./examples/bin/build-20news-bayes.sh
{noformat}
The script performs the following
## Downloads {{20news-bydate.tar.gz}} from the [20newsgroups dataset|http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz]
## Extracts dataset
## Generates input dataset for training classifier
## Generates input dataset for testing classifier
## Trains the classifier
## Tests the classifier
Output might look like:
{noformat}
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t u <--Classified as
381 0 0 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3 0 0 | 398 a = rec.motorcycles
1 284 0 0 0 0 1 0 6 3 11 0 66 3 0 1 6 0 4 9 0 | 395 b = comp.windows.x
2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2 0 0 | 376 c = talk.politics.mideast
4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 2 0 0 | 364 d = talk.politics.guns
7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0 0 0 | 251 e = talk.religion.misc
10 0 0 0 0 359 2 2 0 1 3 0 1 6 0 1 0 0 11 0 0 | 396 f = rec.autos
0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 0 3 0 0 | 397 g = rec.sport.baseball
1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2 0 0 | 399 h = rec.sport.hockey
2 0 0 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12 7 0 | 385 i = comp.sys.mac.hardware
0 3 0 0 0 0 1 0 0 368 0 0 10 4 1 3 2 0 2 0 0 | 394 j = sci.space
0 0 0 0 0 3 1 0 27 2 291 0 11 25 0 0 1 0 13 18 0 | 392 k = comp.sys.ibm.pc.hardware
8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1 0 0 | 310 l = talk.politics.misc
0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7 8 0 | 389 m = comp.graphics
6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8 6 0 | 393 n = sci.electronics
2 0 0 0 0 0 4 1 0 3 1 0 3 1 372 6 0 2 1 2 0 | 398 o = soc.religion.christian
4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9 0 0 | 396 p = sci.med
0 1 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1 1 0 | 396 q = sci.crypt
10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0 1 0 | 319 r = alt.atheism
4 0 0 0 0 9 1 1 8 1 12 0 3 6 0 2 0 0 341 2 0 | 390 s = misc.forsale
8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3 256 0 | 394 t = comp.os.ms-windows.misc
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 u = unknown
{noformat}
h2. Complementary Naive Bayes
To Train a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout trainclassifier \
-i 20news-input \
-o newsmodel \
-type cbayes \
-ng 2 \
-source hdfs
{noformat}
To Test a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
-m newsmodel \
-d 20news-input \
-type cbayes \
-ng 2 \
-source hdfs \
-method mapreduce
{noformat}
Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action