You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jakub Stransky <st...@gmail.com> on 2014/12/01 14:09:25 UTC

20 news groups example

Hello experienced mahout users,

I am new to mahout and I am trying to run naive bayes classification
example with 20news groups categories. I do not userstand one thing which I
am unable to spot. To train categorization I need a labeled data. I don't
see the way how the label of a particular document is passed to training
the model.
I think that I understand TF and IDF etc. but simply dont see how label is
passes.

Could someone provide some insight into this?

Thx
Jakub

Re: 20 news groups example

Posted by 万代豊 <20...@gmail.com>.

Hi Jakub
To label the training data for Bayesian classification in Mahout, all you
do is just simply place your text training file into folders with the
desired label as folder names.
For example, in case of 20-news group, you can place your text into
following folders as,

[hadoop@localhost 20news-all]$ ls
alt.atheism              comp.sys.ibm.pc.hardware  misc.forsale
rec.sport.baseball  sci.electronics  soc.religion.christian
 talk.politics.misc
comp.graphics            comp.sys.mac.hardware     rec.autos
 rec.sport.hockey    sci.med          talk.politics.guns
 talk.religion.misc
comp.os.ms-windows.misc  comp.windows.x            rec.motorcycles
 sci.crypt           sci.space        talk.politics.mideast
[hadoop@localhost 20news-all]$

Mahout receives its folder/directory names as training data label and
assigns to the documents under each folders.

Send all these into HDFS and convert into SequenceFile.
[hadoop@localhost 20news-all] $ $HADOOP_HOME/bin/hadoop dfs -put *
20News-All

 [hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout seqdirectory -i
20News-All -o 20News-Seq

General Term-Vectors
[hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout seq2sparse -i
20News-Seq -o 20News-Vectors -lnorm -nv -wt tfidf

Split original labeled data into training data and test data (30%)
[hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout split -i
20News-Vectors/tfidf-vectors --trainingOutput 20News-Train-Vectors
--testOutput 20News-Test-Vectors --randomSelectionPct 30 --overwrite
--sequenceFiles --method sequential

You will now have these on your HDFS.

[hadoop@localhost 20news-all]$ $HADOOP_HOME/bin/hadoop dfs -ls
Found 11 items
drwxr-xr-x   - hadoop supergroup          0 2013-10-18 04:29
/user/hadoop/20News-All
drwxr-xr-x   - hadoop supergroup          0 2013-10-18 04:31
/user/hadoop/20News-Seq
drwxr-xr-x   - hadoop supergroup          0 2013-10-18 05:03
/user/hadoop/20News-Test-Vectors
drwxr-xr-x   - hadoop supergroup          0 2013-10-18 05:03
/user/hadoop/20News-Train-Vectors
drwxr-xr-x   - hadoop supergroup          0 2013-10-18 04:46
/user/hadoop/20News-Vectors

Train your model with 70% of the data as training data.
[hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout trainnb -i
20News-Train-Vectors -el -o 20News-NBModel -li 20News-LabelIndex -ow
[hadoop@localhost 20news-all]

Test your model and check the confusion matrix.
[hadoop@localhost 20news-all]$ $MAHOUT_HOME/bin/mahout testnb -i
20News-Test-Vectors -m 20News-NBModel -l 20News-LabelIndex -ow -o
20News-NB-Testing
[hadoop@localhost 20news-all]

You will see like,

13/10/18 05:23:33 INFO test.TestNaiveBayesDriver: Standard NB Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       5172   91.4912%
Incorrectly Classified Instances        :        481    8.5088%
Total Classified Instances              :       5653

=======================================================
Confusion Matrix
-------------------------------------------------------
a     b     c     d     e     f     g     h     i     j     k     l     m
  n     o     p     q     r     s     t     <--Classified as
234   0     0     1     0     0     0     0     0     0     0     1     0
  1     1     0     0     0     9     1     |  248   a     = alt.atheism
0     256   6     12   10   8     3     0     0     0     0     1     2     1
    1     0     0     0     1     0     |  301   b     = comp.graphics
1     15   236   30   5     10   3     0     0     0     0     2     0     0
    1     0     0     0     0     1     |  304   c     =
comp.os.ms-windows.misc
0     3     8     263   8     3     6     0     0     0     0     0     6
  0     0     0     0     0     0     0     |  297   d     =
comp.sys.ibm.pc.hardware
1     5     3     8     251   2     3     1     1     0     0     0     2
  0     0     0     0     0     0     0     |  277   e     =
comp.sys.mac.hardware
0     13   1     2     4     277   2     0     0     0     0     0     1     0
    2     0     0     0     0     0     |  302   f     = comp.windows.x
0     2     3     15   3     1     233   6     2     2     0     1     9     1
    2     0     0     1     0     1     |  282   g     = misc.forsale
0     2     1     1     3     0     8     255   3     0     0     0     4
  1     0     0     0     0     0     0     |  278   h     = rec.autos
0     0     0     0     0     0     0     6     270   0     0     0     0
  0     0     0     0     0     0     0     |  276   i     = rec.motorcycles
0     0     0     2     1     0     1     1     1     269   2     0     1
  0     1     0     0     0     0     0     |  279   j     =
rec.sport.baseball
0     1     0     0     2     0     1     0     1     3     276   0     0
  0     0     1     0     0     0     0     |  285   k     =
rec.sport.hockey
0     1     1     0     0     2     0     0     0     0     0     323   1
  2     0     0     0     3     1     0     |  334   l     = sci.crypt
0     3     0     9     7     2     3     4     0     0     1     3     260
  1     2     0     0     0     0     0     |  295   m     = sci.electronics
0     1     0     0     0     0     2     1     0     0     0     0     5
  299   1     2     0     1     0     1     |  313   n     = sci.med
0     0     0     0     2     1     0     0     0     0     0     0     1
  1     291   0     0     0     0     2     |  298   o     = sci.space
1     2     0     0     1     1     1     0     0     0     0     0     0
  4     0     281   3     0     4     1     |  299   p     =
soc.religion.christian
0     0     0     0     0     0     0     0     0     0     0     0     0
  0     0     0     295   1     0     2     |  298   q     =
talk.politics.mideast
0     0     0     0     0     0     0     0     0     0     0     1     0
  0     1     0     0     253   1     11    |  267   r     =
talk.politics.guns
16   1     0     0     0     1     0     1     0     0     0     0     0     0
    2     12   2     6     142   4     |  187   s     = talk.religion.misc
1     1     0     1     0     0     0     0     0     0     0     1     0
  3     3     0     0     13   2     208   |  233   t     =
talk.politics.misc


13/10/18 05:23:33 INFO driver.MahoutDriver: Program took 35037 ms (Minutes:
0.584)
[hadoop@localhost 20news-all]$

I thought that I've done this on 0.7 or 0.8. (Have not tried on 0.9 yet.)
Regards,,,
Y.Mandai

2014-12-01 22:09 GMT+09:00 Jakub Stransky <st...@gmail.com>:

> Hello experienced mahout users,
>
> I am new to mahout and I am trying to run naive bayes classification
> example with 20news groups categories. I do not userstand one thing which I
> am unable to spot. To train categorization I need a labeled data. I don't
> see the way how the label of a particular document is passed to training
> the model.
> I think that I understand TF and IDF etc. but simply dont see how label is
> passes.
>
> Could someone provide some insight into this?
>
> Thx
> Jakub
>