You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jakub Stransky <st...@gmail.com> on 2014/12/01 14:09:25 UTC
20 news groups example
Hello experienced mahout users,
I am new to mahout and I am trying to run naive bayes classification
example with 20news groups categories. I do not userstand one thing which I
am unable to spot. To train categorization I need a labeled data. I don't
see the way how the label of a particular document is passed to training
the model.
I think that I understand TF and IDF etc. but simply dont see how label is
passes.
Could someone provide some insight into this?
Thx
Jakub
Re: 20 news groups example
Posted by 万代豊 <20...@gmail.com>.
Hi Jakub
To label the training data for Bayesian classification in Mahout, all you
do is just simply place your text training file into folders with the
desired label as folder names.
For example, in case of 20-news group, you can place your text into
following folders as,
[hadoop@localhost 20news-all]$ ls
alt.atheism comp.sys.ibm.pc.hardware misc.forsale
rec.sport.baseball sci.electronics soc.religion.christian
talk.politics.misc
comp.graphics comp.sys.mac.hardware rec.autos
rec.sport.hockey sci.med talk.politics.guns
talk.religion.misc
comp.os.ms-windows.misc comp.windows.x rec.motorcycles
sci.crypt sci.space talk.politics.mideast
[hadoop@localhost 20news-all]$
Mahout receives its folder/directory names as training data label and
assigns to the documents under each folders.
Send all these into HDFS and convert into SequenceFile.
[hadoop@localhost 20news-all] $ $HADOOP_HOME/bin/hadoop dfs -put *
20News-All
[hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout seqdirectory -i
20News-All -o 20News-Seq
General Term-Vectors
[hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout seq2sparse -i
20News-Seq -o 20News-Vectors -lnorm -nv -wt tfidf
Split original labeled data into training data and test data (30%)
[hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout split -i
20News-Vectors/tfidf-vectors --trainingOutput 20News-Train-Vectors
--testOutput 20News-Test-Vectors --randomSelectionPct 30 --overwrite
--sequenceFiles --method sequential
You will now have these on your HDFS.
[hadoop@localhost 20news-all]$ $HADOOP_HOME/bin/hadoop dfs -ls
Found 11 items
drwxr-xr-x - hadoop supergroup 0 2013-10-18 04:29
/user/hadoop/20News-All
drwxr-xr-x - hadoop supergroup 0 2013-10-18 04:31
/user/hadoop/20News-Seq
drwxr-xr-x - hadoop supergroup 0 2013-10-18 05:03
/user/hadoop/20News-Test-Vectors
drwxr-xr-x - hadoop supergroup 0 2013-10-18 05:03
/user/hadoop/20News-Train-Vectors
drwxr-xr-x - hadoop supergroup 0 2013-10-18 04:46
/user/hadoop/20News-Vectors
Train your model with 70% of the data as training data.
[hadoop@localhost 20news-all] $MAHOUT_HOME/bin/mahout trainnb -i
20News-Train-Vectors -el -o 20News-NBModel -li 20News-LabelIndex -ow
[hadoop@localhost 20news-all]
Test your model and check the confusion matrix.
[hadoop@localhost 20news-all]$ $MAHOUT_HOME/bin/mahout testnb -i
20News-Test-Vectors -m 20News-NBModel -l 20News-LabelIndex -ow -o
20News-NB-Testing
[hadoop@localhost 20news-all]
You will see like,
13/10/18 05:23:33 INFO test.TestNaiveBayesDriver: Standard NB Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 5172 91.4912%
Incorrectly Classified Instances : 481 8.5088%
Total Classified Instances : 5653
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m
n o p q r s t <--Classified as
234 0 0 1 0 0 0 0 0 0 0 1 0
1 1 0 0 0 9 1 | 248 a = alt.atheism
0 256 6 12 10 8 3 0 0 0 0 1 2 1
1 0 0 0 1 0 | 301 b = comp.graphics
1 15 236 30 5 10 3 0 0 0 0 2 0 0
1 0 0 0 0 1 | 304 c =
comp.os.ms-windows.misc
0 3 8 263 8 3 6 0 0 0 0 0 6
0 0 0 0 0 0 0 | 297 d =
comp.sys.ibm.pc.hardware
1 5 3 8 251 2 3 1 1 0 0 0 2
0 0 0 0 0 0 0 | 277 e =
comp.sys.mac.hardware
0 13 1 2 4 277 2 0 0 0 0 0 1 0
2 0 0 0 0 0 | 302 f = comp.windows.x
0 2 3 15 3 1 233 6 2 2 0 1 9 1
2 0 0 1 0 1 | 282 g = misc.forsale
0 2 1 1 3 0 8 255 3 0 0 0 4
1 0 0 0 0 0 0 | 278 h = rec.autos
0 0 0 0 0 0 0 6 270 0 0 0 0
0 0 0 0 0 0 0 | 276 i = rec.motorcycles
0 0 0 2 1 0 1 1 1 269 2 0 1
0 1 0 0 0 0 0 | 279 j =
rec.sport.baseball
0 1 0 0 2 0 1 0 1 3 276 0 0
0 0 1 0 0 0 0 | 285 k =
rec.sport.hockey
0 1 1 0 0 2 0 0 0 0 0 323 1
2 0 0 0 3 1 0 | 334 l = sci.crypt
0 3 0 9 7 2 3 4 0 0 1 3 260
1 2 0 0 0 0 0 | 295 m = sci.electronics
0 1 0 0 0 0 2 1 0 0 0 0 5
299 1 2 0 1 0 1 | 313 n = sci.med
0 0 0 0 2 1 0 0 0 0 0 0 1
1 291 0 0 0 0 2 | 298 o = sci.space
1 2 0 0 1 1 1 0 0 0 0 0 0
4 0 281 3 0 4 1 | 299 p =
soc.religion.christian
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 295 1 0 2 | 298 q =
talk.politics.mideast
0 0 0 0 0 0 0 0 0 0 0 1 0
0 1 0 0 253 1 11 | 267 r =
talk.politics.guns
16 1 0 0 0 1 0 1 0 0 0 0 0 0
2 12 2 6 142 4 | 187 s = talk.religion.misc
1 1 0 1 0 0 0 0 0 0 0 1 0
3 3 0 0 13 2 208 | 233 t =
talk.politics.misc
13/10/18 05:23:33 INFO driver.MahoutDriver: Program took 35037 ms (Minutes:
0.584)
[hadoop@localhost 20news-all]$
I thought that I've done this on 0.7 or 0.8. (Have not tried on 0.9 yet.)
Regards,,,
Y.Mandai
2014-12-01 22:09 GMT+09:00 Jakub Stransky <st...@gmail.com>:
> Hello experienced mahout users,
>
> I am new to mahout and I am trying to run naive bayes classification
> example with 20news groups categories. I do not userstand one thing which I
> am unable to spot. To train categorization I need a labeled data. I don't
> see the way how the label of a particular document is passed to training
> the model.
> I think that I understand TF and IDF etc. but simply dont see how label is
> passes.
>
> Could someone provide some insight into this?
>
> Thx
> Jakub
>