You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Divya <di...@k2associates.com.sg> on 2010/11/19 07:52:02 UTC

classification example doubts

Hi,

 

I have few questions regarding classification in Mahout 

May be my questions would look silly ..

As I am new bee to Mahout and trying to understand the logic .

 

I am following https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html for
classification.

 

1)      I want to  know what should go in "bayes-test-input".

 

As when I extract the 20news-bydate.tar.gz I get only 20news-bydate-test and
20news-bydate-train.

 

As per steps 20news-bydate-train  we generate input dataset  and that output
we use as input to train the  classifier.

 

2)      If we take Wikipedia example
https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html

 

To  trainclassifier We have used Wikipediainput to generate model .

To test classifier again we used wikipediamodel as input and Wikipedia input
as test documents directory.

I didn't understand why are we doing so ?

 

3)      Last thing I want to know that when we use run testclassifier using
command line we can see the output.

How can we make use of this output?

 

 

Thanks in advance

Regards,

Divya 

 

 

 


Re: classification example doubts

Posted by Lance Norskog <go...@gmail.com>.
Do you have to give complete pathnames in the last pass?

In the cygwin shell, is /cygdrive/d/mahout-0.4 your current directory?

If you do a complete directory walk on D:
find -d /cygdrive/d -name bayes*

you may find that the program parked the output somewhere else. It
might even write it on the C: drive. I've never tried to do
cross-drive Java path stuff.

On Mon, Nov 22, 2010 at 11:12 PM, Robin Anil <ro...@gmail.com> wrote:
> I am guessing this is a bad interaction on Windows. I have never tested this
> code out in Windows and wierd as it sounds, I really dont have access to a
> windows machine. There could be a unix like path generation code somewhere.
> I see the model is loading correctly. Need to verify if the the same is the
> case with reading the input files. Can you paste the test classifier code,
> the input text reading part and see if its loading the text into the string
> correctly.
>
> Robin
>
> On Tue, Nov 23, 2010 at 12:22 PM, JAGANADH G <ja...@gmail.com> wrote:
>
>> On Tue, Nov 23, 2010 at 11:09 AM, Divya <di...@k2associates.com.sg> wrote:
>>
>> > I am following same steps
>> > But no success...
>> >
>> >
>>
>> Are you using cygiwin or GNU/Linux
>> --
>> **********************************
>> JAGANADH G
>> http://jaganadhg.freeflux.net/blog
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: classification example doubts

Posted by Robin Anil <ro...@gmail.com>.
I am guessing this is a bad interaction on Windows. I have never tested this
code out in Windows and wierd as it sounds, I really dont have access to a
windows machine. There could be a unix like path generation code somewhere.
I see the model is loading correctly. Need to verify if the the same is the
case with reading the input files. Can you paste the test classifier code,
the input text reading part and see if its loading the text into the string
correctly.

Robin

On Tue, Nov 23, 2010 at 12:22 PM, JAGANADH G <ja...@gmail.com> wrote:

> On Tue, Nov 23, 2010 at 11:09 AM, Divya <di...@k2associates.com.sg> wrote:
>
> > I am following same steps
> > But no success...
> >
> >
>
> Are you using cygiwin or GNU/Linux
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>

Re: classification example doubts

Posted by JAGANADH G <ja...@gmail.com>.
On Tue, Nov 23, 2010 at 11:09 AM, Divya <di...@k2associates.com.sg> wrote:

> I am following same steps
> But no success...
>
>

Are you using cygiwin or GNU/Linux
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

RE: classification example doubts

Posted by Divya <di...@k2associates.com.sg>.
I am following same steps 
But no success... 

-----Original Message-----
From: Sreejith S [mailto:srssreejith@gmail.com] 
Sent: Friday, November 19, 2010 4:00 PM
To: user@mahout.apache.org
Subject: Re: classification example doubts

step 1 : U can provide ur own sample data set using the prepare20news
example
 just provide ur input dir.This is to perform some normalization on each
file.This is a must

stpe2 : Train the classifier with the normalized list of files.
u get a model dir which contains the trained data set in hdfs.

step3 : Test the classifier
By using the trained model and sample input u can test the classifier

Regards
Sreejith


On Fri, Nov 19, 2010 at 1:15 PM, Divya <di...@k2associates.com.sg> wrote:

> for my first question u say we can put our own input documents in
directory
> that documents also should be of format similar to  bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the
files
> of train-input directory
> Still same result.
> Why is it not reading my files?
>
> Results below :
>
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore:
> comp.sys.mac.hardware -121323.6282757108 547567.2698760114
> -0.2215684445551005
> 2
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space
> -189203.04544769705 547567.2698760114 -0.3455338838834164
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles
> -138625.2628242977 547567.2698760114 -0.25316572127418674
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos
> -136935.18434679657 547567.2698760114 -0.25007919917821886
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics
> -161979.38306986375 547567.2698760114 -0.29581640828631267
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore:
talk.politics.misc
> -159579.70032298338 547567.2698760114 -0.29143396455949216
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med
> -183835.5334355675 547567.2698760114 -0.3357314133790253
> 10/11/19 10:45:12 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0             ?%
> Incorrectly Classified Instances        :          0             ?%
> Total Classified Instances              :          0
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       c       d       e       f       g       h       i       j
> k       l       m       n       o       p       q     r
>        s       t       <--Classified as
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           a     = rec.sport.baseball
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           b     = sci.crypt
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           c     = rec.sport.hockey
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           d     = talk.politics.guns
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           e     = soc.religion.christian
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           f     = sci.electronics
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           g     = comp.os.ms-windows.misc
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           h     = misc.forsale
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           i     = talk.religion.misc
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           j     = alt.atheism
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           k     = comp.windows.x
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           l     = talk.politics.mideast
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           m     = comp.sys.ibm.pc.hardware
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           n     = comp.sys.mac.hardware
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           o     = sci.space
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           p     = rec.motorcycles
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           q     = rec.autos
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           r     = comp.graphics
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           s     = talk.politics.misc
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           t     = sci.med
> Default Category: unknown: 20
>
>
> 10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms
>
> Am I missing anything .
>
>
> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?
> Doesn't Mahout works in same way ?
>
> Third question, yeah I am looking for Mahout's API for classification.
>
>
> @ Jaganadh - Thanks for clearing my doubts
>
> Regards,
> Divya
>
>
> -----Original Message-----
> From: JAGANADH G [mailto:jaganadhg@gmail.com]
> Sent: Friday, November 19, 2010 3:09 PM
> To: user@mahout.apache.org
> Subject: Re: classification example doubts
>
> >
> > 1)      I want to  know what should go in "bayes-test-input".
> >
> >
> After preparing the 20news-group data for training you can separate some
> documents for testing your classifier.
> These documents should go to "bayes-test-input".
>
> Or ven you can put a new set of documets in the directory .
>
>
> > 2)      If we take Wikipedia example
> > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
> >
> >
> >
> > To  trainclassifier We have used Wikipediainput to generate model .
> >
> > To test classifier again we used wikipediamodel as input and Wikipedia
> > input
> > as test documents directory.
> >
> > I didn't understand why are we doing so ?
> >
> >
>
> We are testing the classifier against the development set we used.
>
>
>
> > 3)      Last thing I want to know that when we use run testclassifier
> using
> > command line we can see the output.
> >
> > How can we make use of this output?
> >
>
>
> Are you looking for Mahout API usgae for classification ?
>
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>
>


Re: classification example doubts

Posted by Sreejith S <sr...@gmail.com>.
step 1 : U can provide ur own sample data set using the prepare20news
example
 just provide ur input dir.This is to perform some normalization on each
file.This is a must

stpe2 : Train the classifier with the normalized list of files.
u get a model dir which contains the trained data set in hdfs.

step3 : Test the classifier
By using the trained model and sample input u can test the classifier

Regards
Sreejith


On Fri, Nov 19, 2010 at 1:15 PM, Divya <di...@k2associates.com.sg> wrote:

> for my first question u say we can put our own input documents in directory
> that documents also should be of format similar to  bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the files
> of train-input directory
> Still same result.
> Why is it not reading my files?
>
> Results below :
>
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore:
> comp.sys.mac.hardware -121323.6282757108 547567.2698760114
> -0.2215684445551005
> 2
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space
> -189203.04544769705 547567.2698760114 -0.3455338838834164
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles
> -138625.2628242977 547567.2698760114 -0.25316572127418674
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos
> -136935.18434679657 547567.2698760114 -0.25007919917821886
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics
> -161979.38306986375 547567.2698760114 -0.29581640828631267
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: talk.politics.misc
> -159579.70032298338 547567.2698760114 -0.29143396455949216
> 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med
> -183835.5334355675 547567.2698760114 -0.3357314133790253
> 10/11/19 10:45:12 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0             ?%
> Incorrectly Classified Instances        :          0             ?%
> Total Classified Instances              :          0
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       c       d       e       f       g       h       i       j
> k       l       m       n       o       p       q     r
>        s       t       <--Classified as
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           a     = rec.sport.baseball
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           b     = sci.crypt
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           c     = rec.sport.hockey
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           d     = talk.politics.guns
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           e     = soc.religion.christian
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           f     = sci.electronics
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           g     = comp.os.ms-windows.misc
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           h     = misc.forsale
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           i     = talk.religion.misc
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           j     = alt.atheism
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           k     = comp.windows.x
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           l     = talk.politics.mideast
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           m     = comp.sys.ibm.pc.hardware
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           n     = comp.sys.mac.hardware
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           o     = sci.space
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           p     = rec.motorcycles
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           q     = rec.autos
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           r     = comp.graphics
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           s     = talk.politics.misc
> 0       0       0       0       0       0       0       0       0       0
> 0       0       0       0       0       0       0     0
>        0       0        |  0           t     = sci.med
> Default Category: unknown: 20
>
>
> 10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms
>
> Am I missing anything .
>
>
> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?
> Doesn't Mahout works in same way ?
>
> Third question, yeah I am looking for Mahout's API for classification.
>
>
> @ Jaganadh - Thanks for clearing my doubts
>
> Regards,
> Divya
>
>
> -----Original Message-----
> From: JAGANADH G [mailto:jaganadhg@gmail.com]
> Sent: Friday, November 19, 2010 3:09 PM
> To: user@mahout.apache.org
> Subject: Re: classification example doubts
>
> >
> > 1)      I want to  know what should go in "bayes-test-input".
> >
> >
> After preparing the 20news-group data for training you can separate some
> documents for testing your classifier.
> These documents should go to "bayes-test-input".
>
> Or ven you can put a new set of documets in the directory .
>
>
> > 2)      If we take Wikipedia example
> > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
> >
> >
> >
> > To  trainclassifier We have used Wikipediainput to generate model .
> >
> > To test classifier again we used wikipediamodel as input and Wikipedia
> > input
> > as test documents directory.
> >
> > I didn't understand why are we doing so ?
> >
> >
>
> We are testing the classifier against the development set we used.
>
>
>
> > 3)      Last thing I want to know that when we use run testclassifier
> using
> > command line we can see the output.
> >
> > How can we make use of this output?
> >
>
>
> Are you looking for Mahout API usgae for classification ?
>
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>
>

RE: classification example doubts

Posted by Divya <di...@k2associates.com.sg>.
Hi,

Yeah I understood the logic behind it.
First we have to provide the set of documents and train classifier build
model out of it 
And when testing classifier whenever we provide input data after generating
it in form of dataset.
It will classify those data according the built model.

Even I am doing the same thing

I am using the test input given with 20news-bydate.tar.gz data set
As when we extract 20news-bydate.tar.gz we get two directories
20news-bydate-train and 20news-bydate-test out of which I am using to train
the classifier and other to test classifier respectively.


Steps I am following -
1. Extract dataset
  tar zxf 20news-bydate.tar.gz 

2.Generate input dataset train classifier 
$ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups   -p
examples/bin/work/20news-bydate/20news-bydate-train
 -o examples/bin/work/20news-bydate/bayes-train-input  -a
org.apache.mahout.vectorizer.DefaultAnalyzer  -c UTF-8

3.Generate input dataset test classifier 
$ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p
examples/bin/work/20news-bydate/20news-bydate-test 
-o examples/bin/work/20news-bydate/20news-test-input -a
org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8

4. Train the classifier
bin/mahout trainclassifier -i
examples/bin/work/20news-bydate/bayes-train-input -o
examples/bin/work/20news-bydate/bayes-model
-type bayes -ng 1 -source hdfs

5.Test classifier
$ bin/mahout testclassifier -m
D:/mahout-0.4/examples/bin/work/20news-bydate/bayes-model 
-d D:/mahout-0.4/examples/bin/work/20news-test-input -type bayes -ng 1
-method sequential

Not getting expected output. Can view my result  @
http://pastebin.com/CicVMpST.

Still trying to figure  whats missing in my steps.

Can any one help me.

Regards,
Divya 



-----Original Message-----
From: JAGANADH G [mailto:jaganadhg@gmail.com] 
Sent: Friday, November 19, 2010 5:36 PM
To: Divya
Cc: user@mahout.apache.org
Subject: Re: classification example doubts

On Fri, Nov 19, 2010 at 1:15 PM, Divya <di...@k2associates.com.sg> wrote:

> for my first question u say we can put our own input documents in
directory
> that documents also should be of format similar to  bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the
files
> of train-input directory
> Still same result.
> Why is it not reading my files?
>
> Am I missing anything .
>
>
I think some thing happened wrong with your training .
I trained 20-news groups and tested it. My result is available at
http://pastebin.com/kGY4LmW7 . Check it.

The commad which i used for
1) Preparing data is
 bin/mahout prepare20newsgroups  -p /home/jaganadhg/20news-bydate-train/ -o
20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
2) to train :
bin/mahout trainclassifier  -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng
2
3) to test :
bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method
sequential

The result is available at http://pastebin.com/kGY4LmW7


>
> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?
>


The documets are not acting as model. Mahout TrainClassifierr will create a
model out of the documents provided for training.
The command testclassifier takes following arguments
1) a directory containing model (specified after -m )
2) a directory which containing documents for testing the classifier.
(specified after -d ) . Documents in this directory should be formatted like
the wat we prepared document for training
3) type of the classifier algo . Here I used bayes (specified after -type )
4) Defuault category name (specified after -default) you can set it as
"unknown"
4) Value of Alpha_i used in training (specified after -a ). By default it is
1.0
5) Source of model dir (specified after -source). You can set it as hdfs
6) Ngram sixe (specified after -ng) . The ngram size should be same as you
used in training

A sample command with all these parameters are shown below
bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1


> Doesn't Mahout works in same way ?
>
> Third question, yeah I am looking for Mahout's API for classification.
>

A sample program is given below

http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl
/ClassifierDemo.java

For working it in real-time system you have to some more work . Find it :-)

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog


Re: classification example doubts

Posted by Ted Dunning <te...@gmail.com>.
There is a pretty easy command line interface for doing this.  Instantiating
a Naive Bayes classifier from code is kind of complicated, but can be doped
out by reading and hacking the code.

The SGD classifiers are designed to be easier to integrate on an API basis
and Robin has been working on pulling something together for the Naive Bayes
code.

On Mon, Nov 22, 2010 at 7:13 PM, ivek gimmick <gi...@gmail.com> wrote:

>   I was able to test this example.  But how do I use the actual classifier?
> Once I train the data and have the model, I want to use the model to
> categorize new set of data which is not classified.
>
>   Is there any straight-forward way to do this with Mahout or should I be
> tweaking the code?
>

Re: classification example doubts

Posted by ivek gimmick <gi...@gmail.com>.
Hi guys,

   I was able to test this example.  But how do I use the actual classifier?
Once I train the data and have the model, I want to use the model to
categorize new set of data which is not classified.

   Is there any straight-forward way to do this with Mahout or should I be
tweaking the code?

Regards,
~Vivek

On Fri, Nov 19, 2010 at 4:35 AM, JAGANADH G <ja...@gmail.com> wrote:

> On Fri, Nov 19, 2010 at 1:15 PM, Divya <di...@k2associates.com.sg> wrote:
>
> > for my first question u say we can put our own input documents in
> directory
> > that documents also should be of format similar to  bayes-train-input.
> > If yes, then I generated my input data using PrepareTwentyNewsgroups.
> > And used that as my input for testclassifier
> > But didn't get expected results.
> > As I observed it didn't read my files I my input directory
> > I tried replacing one of the files of input directory with one of the
> files
> > of train-input directory
> > Still same result.
> > Why is it not reading my files?
> >
> > Am I missing anything .
> >
> >
> I think some thing happened wrong with your training .
> I trained 20-news groups and tested it. My result is available at
> http://pastebin.com/kGY4LmW7 . Check it.
>
> The commad which i used for
> 1) Preparing data is
>  bin/mahout prepare20newsgroups  -p /home/jaganadhg/20news-bydate-train/ -o
> 20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
> 2) to train :
> bin/mahout trainclassifier  -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng
> 2
> 3) to test :
> bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method
> sequential
>
> The result is available at http://pastebin.com/kGY4LmW7
>
>
> >
> > Come to my second question, that means we are testing the classifier
> > against
> > our inputs itself.
> > Still I didn't understand.
> > What I understood about classification is we have set of documents which
> > will act as model for classification of new documents in the system.
> > Am I right?
> >
>
>
> The documets are not acting as model. Mahout TrainClassifierr will create a
> model out of the documents provided for training.
> The command testclassifier takes following arguments
> 1) a directory containing model (specified after -m )
> 2) a directory which containing documents for testing the classifier.
> (specified after -d ) . Documents in this directory should be formatted
> like
> the wat we prepared document for training
> 3) type of the classifier algo . Here I used bayes (specified after -type )
> 4) Defuault category name (specified after -default) you can set it as
> "unknown"
> 4) Value of Alpha_i used in training (specified after -a ). By default it
> is
> 1.0
> 5) Source of model dir (specified after -source). You can set it as hdfs
> 6) Ngram sixe (specified after -ng) . The ngram size should be same as you
> used in training
>
> A sample command with all these parameters are shown below
> bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
> unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1
>
>
> > Doesn't Mahout works in same way ?
> >
> > Third question, yeah I am looking for Mahout's API for classification.
> >
>
> A sample program is given below
>
>
> http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl/ClassifierDemo.java
>
> For working it in real-time system you have to some more work . Find it :-)
>
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>

Re: classification example doubts

Posted by JAGANADH G <ja...@gmail.com>.
On Fri, Nov 19, 2010 at 1:15 PM, Divya <di...@k2associates.com.sg> wrote:

> for my first question u say we can put our own input documents in directory
> that documents also should be of format similar to  bayes-train-input.
> If yes, then I generated my input data using PrepareTwentyNewsgroups.
> And used that as my input for testclassifier
> But didn't get expected results.
> As I observed it didn't read my files I my input directory
> I tried replacing one of the files of input directory with one of the files
> of train-input directory
> Still same result.
> Why is it not reading my files?
>
> Am I missing anything .
>
>
I think some thing happened wrong with your training .
I trained 20-news groups and tested it. My result is available at
http://pastebin.com/kGY4LmW7 . Check it.

The commad which i used for
1) Preparing data is
 bin/mahout prepare20newsgroups  -p /home/jaganadhg/20news-bydate-train/ -o
20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
2) to train :
bin/mahout trainclassifier  -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng
2
3) to test :
bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method
sequential

The result is available at http://pastebin.com/kGY4LmW7


>
> Come to my second question, that means we are testing the classifier
> against
> our inputs itself.
> Still I didn't understand.
> What I understood about classification is we have set of documents which
> will act as model for classification of new documents in the system.
> Am I right?
>


The documets are not acting as model. Mahout TrainClassifierr will create a
model out of the documents provided for training.
The command testclassifier takes following arguments
1) a directory containing model (specified after -m )
2) a directory which containing documents for testing the classifier.
(specified after -d ) . Documents in this directory should be formatted like
the wat we prepared document for training
3) type of the classifier algo . Here I used bayes (specified after -type )
4) Defuault category name (specified after -default) you can set it as
"unknown"
4) Value of Alpha_i used in training (specified after -a ). By default it is
1.0
5) Source of model dir (specified after -source). You can set it as hdfs
6) Ngram sixe (specified after -ng) . The ngram size should be same as you
used in training

A sample command with all these parameters are shown below
bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1


> Doesn't Mahout works in same way ?
>
> Third question, yeah I am looking for Mahout's API for classification.
>

A sample program is given below

http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl/ClassifierDemo.java

For working it in real-time system you have to some more work . Find it :-)

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

RE: classification example doubts

Posted by Divya <di...@k2associates.com.sg>.
for my first question u say we can put our own input documents in directory 
that documents also should be of format similar to  bayes-train-input.
If yes, then I generated my input data using PrepareTwentyNewsgroups.
And used that as my input for testclassifier 
But didn't get expected results.
As I observed it didn't read my files I my input directory
I tried replacing one of the files of input directory with one of the files
of train-input directory 
Still same result.
Why is it not reading my files?

Results below :

10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore:
comp.sys.mac.hardware -121323.6282757108 547567.2698760114
-0.2215684445551005
2
10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space
-189203.04544769705 547567.2698760114 -0.3455338838834164
10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles
-138625.2628242977 547567.2698760114 -0.25316572127418674
10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos
-136935.18434679657 547567.2698760114 -0.25007919917821886
10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics
-161979.38306986375 547567.2698760114 -0.29581640828631267
10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: talk.politics.misc
-159579.70032298338 547567.2698760114 -0.29143396455949216
10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med
-183835.5334355675 547567.2698760114 -0.3357314133790253
10/11/19 10:45:12 INFO bayes.TestClassifier:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0             ?%
Incorrectly Classified Instances        :          0             ?%
Total Classified Instances              :          0

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       c       d       e       f       g       h       i       j
k       l       m       n       o       p       q     r
        s       t       <--Classified as
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           a     = rec.sport.baseball
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           b     = sci.crypt
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           c     = rec.sport.hockey
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           d     = talk.politics.guns
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           e     = soc.religion.christian
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           f     = sci.electronics
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           g     = comp.os.ms-windows.misc
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           h     = misc.forsale
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           i     = talk.religion.misc
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           j     = alt.atheism
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           k     = comp.windows.x
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           l     = talk.politics.mideast
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           m     = comp.sys.ibm.pc.hardware
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           n     = comp.sys.mac.hardware
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           o     = sci.space
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           p     = rec.motorcycles
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           q     = rec.autos
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           r     = comp.graphics
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           s     = talk.politics.misc
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0     0
        0       0        |  0           t     = sci.med
Default Category: unknown: 20


10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms

Am I missing anything .


Come to my second question, that means we are testing the classifier against
our inputs itself.
Still I didn't understand.
What I understood about classification is we have set of documents which
will act as model for classification of new documents in the system.
Am I right?
Doesn't Mahout works in same way ?

Third question, yeah I am looking for Mahout's API for classification.


@ Jaganadh - Thanks for clearing my doubts  

Regards,
Divya 

 
-----Original Message-----
From: JAGANADH G [mailto:jaganadhg@gmail.com] 
Sent: Friday, November 19, 2010 3:09 PM
To: user@mahout.apache.org
Subject: Re: classification example doubts

>
> 1)      I want to  know what should go in "bayes-test-input".
>
>
After preparing the 20news-group data for training you can separate some
documents for testing your classifier.
These documents should go to "bayes-test-input".

Or ven you can put a new set of documets in the directory .


> 2)      If we take Wikipedia example
> https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
>
>
>
> To  trainclassifier We have used Wikipediainput to generate model .
>
> To test classifier again we used wikipediamodel as input and Wikipedia
> input
> as test documents directory.
>
> I didn't understand why are we doing so ?
>
>

We are testing the classifier against the development set we used.



> 3)      Last thing I want to know that when we use run testclassifier
using
> command line we can see the output.
>
> How can we make use of this output?
>


Are you looking for Mahout API usgae for classification ?

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog


Re: classification example doubts

Posted by JAGANADH G <ja...@gmail.com>.
>
> 1)      I want to  know what should go in "bayes-test-input".
>
>
After preparing the 20news-group data for training you can separate some
documents for testing your classifier.
These documents should go to "bayes-test-input".

Or ven you can put a new set of documets in the directory .


> 2)      If we take Wikipedia example
> https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
>
>
>
> To  trainclassifier We have used Wikipediainput to generate model .
>
> To test classifier again we used wikipediamodel as input and Wikipedia
> input
> as test documents directory.
>
> I didn't understand why are we doing so ?
>
>

We are testing the classifier against the development set we used.



> 3)      Last thing I want to know that when we use run testclassifier using
> command line we can see the output.
>
> How can we make use of this output?
>


Are you looking for Mahout API usgae for classification ?

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog