You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2011/10/16 23:14:00 UTC

[CONF] Apache Mahout > Twenty Newsgroups

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Twenty Newsgroups (https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups)

Change Comment:
---------------------------------------------------------------------
removed the steps for running the example manually. Added information on the steps to run the example using scripts

Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h2. Twenty Newsgroups Classification Example

h2. Introduction

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. We will use Mahout Bayes Classifier to create a model that would classify a new document into one of the 20 newsgroup.

h2. Prerequisites

* Mahout has been downloaded ([instructions here|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup])
* Maven is available
* Your environment has the following variables:
| {{HADOOP_HOME}} | Environment variables refers to where Hadoop lives |
| {{MAHOUT_HOME}} | Environment variables refers to where Mahout lives |

h2. Instructions for running the example

# Start the hadoop daemons by executing the following commands
{noformat}
$ cd $HADOOP_HOME/bin
$ ./start-all.sh
{noformat}
# In the trunk directory of mahout, compile everything and create the mahout job:
{noformat}
$ cd $MAHOUT_HOME
$ mvn install
{noformat}
# Run the 20 newsgroup example by executing the script as below
{noformat}
$ ./examples/bin/build-20news-bayes.sh
{noformat}
The script performs the following
## Downloads {{20news-bydate.tar.gz}} from the [20newsgroups dataset|http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz]
## Extracts dataset
## Generates input dataset for training classifier
## Generates input dataset for testing classifier
## Trains the classifier
## Tests the classifier

Output might look like:
{noformat}
=======================================================
Confusion Matrix
-------------------------------------------------------
a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   <--Classified as
381 0   0   0   0   9   1   0   0   0   1   0   0   2   0   1   0   0   3   0   0    |  398  a     = rec.motorcycles
1   284 0   0   0   0   1   0   6   3   11  0   66  3   0   1   6   0   4   9   0    |  395  b     = comp.windows.x
2   0   339 2   0   3   5   1   0   0   0   0   1   1   12  1   7   0   2   0   0    |  376  c     = talk.politics.mideast
4   0   1   327 0   2   2   0   0   2   1   1   0   5   1   4   12  0   2   0   0    |  364  d     = talk.politics.guns
7   0   4   32  27  7   7   2   0   12  0   0   6   0   100 9   7   31  0   0   0    |  251  e     = talk.religion.misc
10  0   0   0   0   359 2   2   0   1   3   0   1   6   0   1   0   0   11  0   0    |  396  f     = rec.autos
0   0   0   0   0   1   383 9   1   0   0   0   0   0   0   0   0   0   3   0   0    |  397  g     = rec.sport.baseball
1   0   0   0   0   0   9   382 0   0   0   0   1   1   1   0   2   0   2   0   0    |  399  h     = rec.sport.hockey
2   0   0   0   0   4   3   0   330 4   4   0   5   12  0   0   2   0   12  7   0    |  385  i     = comp.sys.mac.hardware
0   3   0   0   0   0   1   0   0   368 0   0   10  4   1   3   2   0   2   0   0    |  394  j     = sci.space
0   0   0   0   0   3   1   0   27  2   291 0   11  25  0   0   1   0   13  18  0    |  392  k     = comp.sys.ibm.pc.hardware
8   0   1   109 0   6   11  4   1   18  0   98  1   3   11  10  27  1   1   0   0    |  310  l     = talk.politics.misc
0   11  0   0   0   3   6   0   10  6   11  0   299 13  0   2   13  0   7   8   0    |  389  m     = comp.graphics
6   0   1   0   0   4   2   0   5   2   12  0   8   321 0   4   14  0   8   6   0    |  393  n     = sci.electronics
2   0   0   0   0   0   4   1   0   3   1   0   3   1   372 6   0   2   1   2   0    |  398  o     = soc.religion.christian
4   0   0   1   0   2   3   3   0   4   2   0   7   12  6   342 1   0   9   0   0    |  396  p     = sci.med
0   1   0   1   0   1   4   0   3   0   1   0   8   4   0   2   369 0   1   1   0    |  396  q     = sci.crypt
10  0   4   10  1   5   6   2   2   6   2   0   2   1   86  15  14  152 0   1   0    |  319  r     = alt.atheism
4   0   0   0   0   9   1   1   8   1   12  0   3   6   0   2   0   0   341 2   0    |  390  s     = misc.forsale
8   5   0   0   0   1   6   0   8   5   50  0   40  2   1   0   9   0   3   256 0    |  394  t     = comp.os.ms-windows.misc
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    |  0    u     = unknown
{noformat}

h2. Complementary Naive Bayes

To Train a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout trainclassifier \
  -i 20news-input \
  -o newsmodel \
  -type cbayes \
  -ng 2 \
  -source hdfs
{noformat}

To Test a CBayes Classifier using bi-grams
{noformat}
$> $MAHOUT_HOME/bin/mahout testclassifier \
  -m newsmodel \
  -d 20news-input \
  -type cbayes \
  -ng 2 \
  -source hdfs \
  -method mapreduce
{noformat}

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action