You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by JAGANADH G <ja...@gmail.com> on 2010/10/18 17:07:37 UTC

Querry regarding use of classifier in Mahout

Dear All
I am trying to implement classifier algo used in Mahout for a sample
project.

I tried both NaiveBayesClassifer and CBayesClassifer . But I am getting
wrong output. For trained the classifier with
http://code.google.com/p/nltk/source/browse/trunk/nltk_data/packages/corpora/movie_reviews.zipdata.

When i run the prediction module it says that all the reviews are positive .

Any thoughts !!!

I think I am posting the same question 3rd time here :-)

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Robin Anil <ro...@gmail.com>.

Correctly Classified Instances          :       1702      85.1%
Incorrectly Classified Instances        :        298      14.9%
Total Classified Instances              :       2000

=======================================================
Confusion Matrix
-------------------------------------------------------
a     b     <--Classified as
704   296   |  1000   a     = pos
2     998   |  1000   b     = neg

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On 10/19/10, Ted Dunning <te...@gmail.com> wrote:
> Remember it is on the training data!
>
> Naive Bayes classifiers have the property that they overfit massively but
> still give good results on held out data.  Thus,
> when tested on the same data that they trained with, they demonstrate
> results that are unrealistically good.
>
> This is still an important thing to look at.  It just isn't really 200 times
> lower error rate than any other result ever reported on this dataset.
>

One of my student did a rough test on the same data with MALLET . It
is giving somewhat near result . That is what he told. Once again I
will recheck and compare the results .
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Ted Dunning <te...@gmail.com>.

That is what I would expect.

On Wed, Oct 20, 2010 at 8:33 PM, JAGANADH G <ja...@gmail.com> wrote:

> On Thu, Oct 21, 2010 at 8:10 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Depends on what you mean by accurately.
> >
> > It will not generate the n-grams that would have resulted from the
> original
> > text.  It still generates useful ngrams.
> >
> >
> Ok. I think I have to give input to the classifier after performing
> preprocessing (As like prepare20newsgroup does) . Then only I can attain
> reasonable result out the trained model.

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Thu, Oct 21, 2010 at 8:10 AM, Ted Dunning <te...@gmail.com> wrote:

> Depends on what you mean by accurately.
>
> It will not generate the n-grams that would have resulted from the original
> text.  It still generates useful ngrams.
>
>
Ok. I think I have to give input to the classifier after performing
preprocessing (As like prepare20newsgroup does) . Then only I can attain
reasonable result out the trained model.
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Ted Dunning <te...@gmail.com>.

Depends on what you mean by accurately.

It will not generate the n-grams that would have resulted from the original
text.  It still generates useful ngrams.

On Wed, Oct 20, 2010 at 6:35 AM, JAGANADH G <ja...@gmail.com> wrote:

> Now I have question .
> 1) The output of preparetwentynesgroup creates a text from where all the
> stop words are removed. Also the text will be just a simple collection of
> words . So when we apply generateNGramsWithoutLabel() will it it generate
> NGrams correctly (Means accuracy of ngram?)
>

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Thu, Oct 21, 2010 at 8:11 AM, Ted Dunning <te...@gmail.com> wrote:

> This is not good, then.
>
> Did you remove stop words?  This is often important with Naive Bayes.
>
>
I didn't removed the stop words . Now I understood how to utilize Mahout for
text classification effectively . I was using a Python basesd NaiveBayes
clsiiifer prepared by me. But due to many reasons it was too slow :-) .
Mahout makes life brighter
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Ted Dunning <te...@gmail.com>.

This is not good, then.

Did you remove stop words?  This is often important with Naive Bayes.

On Wed, Oct 20, 2010 at 7:28 PM, JAGANADH G <ja...@gmail.com> wrote:

>
>
> On Thu, Oct 21, 2010 at 4:20 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> If this is testing on held-out data, then this is a pretty respectable
>> result for an untuned system.
>>
>
> This result is obtained from the training set .
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>
>

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Thu, Oct 21, 2010 at 4:20 AM, Ted Dunning <te...@gmail.com> wrote:

> If this is testing on held-out data, then this is a pretty respectable
> result for an untuned system.
>

This result is obtained from the training set .
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Ted Dunning <te...@gmail.com>.

If this is testing on held-out data, then this is a pretty respectable
result for an untuned system.

Are these results on held-out data?

On Wed, Oct 20, 2010 at 6:35 AM, JAGANADH G <ja...@gmail.com> wrote:

> @robin and @ted
>
> I tested it in a different way.
> I created a program to convert input text to Mahout training format. The
> program will remove all the punctuation and junk charters from a text,
> removes any numbers like year date exists there. Then it converts the text
> to lowercase. After that the text will be prepared in to a mahout training
> format (label"\t" text"\n").
>
> After training with CBayesClasssifier I tested it.
> The result is
> 1) with ng=1 -a=1.0
> Correctly calssified instances = 52.5%
> Incorrect = 47.5%
> 2) with ng=2 -a=1.0
> Correctly calssified instances = 74.5%
> Incorrect = 25.5%
>
> Now I have question .
> 1) The output of preparetwentynesgroup creates a text from where all the
> stop words are removed. Also the text will be just a simple collection of
> words . So when we apply generateNGramsWithoutLabel() will it it generate
> NGrams correctly (Means accuracy of ngram?)
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

@robin and @ted

I tested it in a different way.
I created a program to convert input text to Mahout training format. The
program will remove all the punctuation and junk charters from a text,
removes any numbers like year date exists there. Then it converts the text
to lowercase. After that the text will be prepared in to a mahout training
format (label"\t" text"\n").

After training with CBayesClasssifier I tested it.
The result is
1) with ng=1 -a=1.0
Correctly calssified instances = 52.5%
Incorrect = 47.5%
2) with ng=2 -a=1.0
Correctly calssified instances = 74.5%
Incorrect = 25.5%

Now I have question .
1) The output of preparetwentynesgroup creates a text from where all the
stop words are removed. Also the text will be just a simple collection of
words . So when we apply generateNGramsWithoutLabel() will it it generate
NGrams correctly (Means accuracy of ngram?)
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Ted Dunning <te...@gmail.com>.

Remember it is on the training data!

Naive Bayes classifiers have the property that they overfit massively but
still give good results on held out data.  Thus,
when tested on the same data that they trained with, they demonstrate
results that are unrealistically good.

This is still an important thing to look at.  It just isn't really 200 times
lower error rate than any other result ever reported on this dataset.

On Mon, Oct 18, 2010 at 11:26 AM, JAGANADH G <ja...@gmail.com> wrote:

> >> > Correctly Classified Instances          :       1995     99.75%
> >> > Incorrectly Classified Instances        :          5      0.25%
> >> > Total Classified Instances              :       2000
> >> >
> >> > =======================================================
> >> > Confusion Matrix
> >> > -------------------------------------------------------
> >> > a     b     <--Classified as
> >> > 995   5     |  1000   a     = pos
> >> > 0     1000  |  1000   b     = neg
> >> > Default Category: unknown: 2
> >> >
> >> >
> >> > With some pruning, you will have a decent enough classifier for
> >> sentiments
>
>
> Wow this is an amazing result :-)

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

>> > Just pushed a bug fix for ngrams. Update your copy. Here is the result
>> with
>> > ngram = 2
>> >
>> > Correctly Classified Instances          :       1995     99.75%
>> > Incorrectly Classified Instances        :          5      0.25%
>> > Total Classified Instances              :       2000
>> >
>> > =======================================================
>> > Confusion Matrix
>> > -------------------------------------------------------
>> > a     b     <--Classified as
>> > 995   5     |  1000   a     = pos
>> > 0     1000  |  1000   b     = neg
>> > Default Category: unknown: 2
>> >
>> >
>> > With some pruning, you will have a decent enough classifier for
>> sentiments


Wow this is an amazing result :-)

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Robin Anil <ro...@gmail.com>.

No, this is just on the train data. Its just a sanity check that the
classifier works.

With ms =  5 and mdf = 5

INFO: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       1816      90.8%
Incorrectly Classified Instances        :        184       9.2%
Total Classified Instances              :       2000

=======================================================
Confusion Matrix
-------------------------------------------------------
a     b     <--Classified as
818   182   |  1000   a     = pos
2     998   |  1000   b     = neg
Default Category: unknown: 2



On Mon, Oct 18, 2010 at 10:49 PM, Ted Dunning <te...@gmail.com> wrote:

> Is this on the training data?  Or held-out test data?
>
> If on test data, this is much, much too accurate to be believed.
>
> On Mon, Oct 18, 2010 at 10:14 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > Just pushed a bug fix for ngrams. Update your copy. Here is the result
> with
> > ngram = 2
> >
> > Correctly Classified Instances          :       1995     99.75%
> > Incorrectly Classified Instances        :          5      0.25%
> > Total Classified Instances              :       2000
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a     b     <--Classified as
> > 995   5     |  1000   a     = pos
> > 0     1000  |  1000   b     = neg
> > Default Category: unknown: 2
> >
> >
> > With some pruning, you will have a decent enough classifier for
> sentiments
> >
>

Re: Querry regarding use of classifier in Mahout

Posted by Ted Dunning <te...@gmail.com>.

Is this on the training data?  Or held-out test data?

If on test data, this is much, much too accurate to be believed.

On Mon, Oct 18, 2010 at 10:14 AM, Robin Anil <ro...@gmail.com> wrote:

> Just pushed a bug fix for ngrams. Update your copy. Here is the result with
> ngram = 2
>
> Correctly Classified Instances          :       1995     99.75%
> Incorrectly Classified Instances        :          5      0.25%
> Total Classified Instances              :       2000
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a     b     <--Classified as
> 995   5     |  1000   a     = pos
> 0     1000  |  1000   b     = neg
> Default Category: unknown: 2
>
>
> With some pruning, you will have a decent enough classifier for sentiments
>

Re: Querry regarding use of classifier in Mahout

Posted by Robin Anil <ro...@gmail.com>.

Just pushed a bug fix for ngrams. Update your copy. Here is the result with
ngram = 2

Correctly Classified Instances          :       1995     99.75%
Incorrectly Classified Instances        :          5      0.25%
Total Classified Instances              :       2000

=======================================================
Confusion Matrix
-------------------------------------------------------
a     b     <--Classified as
995   5     |  1000   a     = pos
0     1000  |  1000   b     = neg
Default Category: unknown: 2


With some pruning, you will have a decent enough classifier for sentiments

Re: Querry regarding use of classifier in Mahout

Posted by Drew Farris <dr...@apache.org>.

Ugh, that's a really horrible bug. Thanks for smashing it Robin.

On Mon, Oct 18, 2010 at 12:57 PM, Robin Anil <ro...@gmail.com> wrote:
> Aargh found it
>
>   set("gramSize", Integer.toBinaryString(gramSize));
>
>
> it was setting it as 10
>

Re: Querry regarding use of classifier in Mahout

Posted by Robin Anil <ro...@gmail.com>.

Aargh found it

   set("gramSize", Integer.toBinaryString(gramSize));


it was setting it as 10

Re: Querry regarding use of classifier in Mahout

Posted by Robin Anil <ro...@gmail.com>.

-user +dev

Drew, I just uncovered a bug in the classifier where when using ngram > 1,
to be precise ng ==2, it creates all possible ngrams using the shingle
filter. I am not sure about the fix, investigating....

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Mon, Oct 18, 2010 at 9:11 PM, JAGANADH G <ja...@gmail.com> wrote:

>
>
> On Mon, Oct 18, 2010 at 9:03 PM, Robin Anil <ro...@gmail.com> wrote:
>
>> bin/mahout prepare20newsgroups  -p
>> /Users/robinanil/Downloads/movie_reviews/
>> -o movie -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>>
>> bin/mahout trainclassifier  -i movie/ -o movie-model -type cbayes -a 1.0
>>
>> bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
>> unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1
>>
>
>
>

>From command line I am able to get the same result.
I will try it through Java program tomorrow and update
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Mon, Oct 18, 2010 at 9:03 PM, Robin Anil <ro...@gmail.com> wrote:

> bin/mahout prepare20newsgroups  -p
> /Users/robinanil/Downloads/movie_reviews/
> -o movie -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer
>
> bin/mahout trainclassifier  -i movie/ -o movie-model -type cbayes -a 1.0
>
> bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
> unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1
>

Thanks Robin.
I think the wrong thing i done is the preparation of document .
Once again I will try the document preparation in the said way and try to
classify the document with my java code .
I will update one it is done I will update u regarding the status

-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Robin Anil <ro...@gmail.com>.

bin/mahout prepare20newsgroups  -p /Users/robinanil/Downloads/movie_reviews/
-o movie -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer

bin/mahout trainclassifier  -i movie/ -o movie-model -type cbayes -a 1.0

bin/mahout testclassifier -d movie -m movie-model/ -type bayes  -default
unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1

Re: Querry regarding use of classifier in Mahout

Posted by JAGANADH G <ja...@gmail.com>.

On Mon, Oct 18, 2010 at 8:49 PM, Robin Anil <ro...@gmail.com> wrote:

> Let me take a look. I will let you know. How was the preprocessing done?
> Could you enumerate the steps you followed.
>
>
I converted each text to a single line (normalization). The wrote to a file
.The format is like
pos"\t" text like .
After that the pos.txt and neg.txt placed in to a dir called "training"
This directory was given to the trainer as input
-- 
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Re: Querry regarding use of classifier in Mahout

Posted by Robin Anil <ro...@gmail.com>.

Let me take a look. I will let you know. How was the preprocessing done?
Could you enumerate the steps you followed.

On Mon, Oct 18, 2010 at 8:37 PM, JAGANADH G <ja...@gmail.com> wrote:

> Dear All
> I am trying to implement classifier algo used in Mahout for a sample
> project.
>
> I tried both NaiveBayesClassifer and CBayesClassifer . But I am getting
> wrong output. For trained the classifier with
>
> http://code.google.com/p/nltk/source/browse/trunk/nltk_data/packages/corpora/movie_reviews.zipdata
> .
>
> When i run the prediction module it says that all the reviews are positive
> .
>
> Any thoughts !!!
>
> I think I am posting the same question 3rd time here :-)
>
> --
> **********************************
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
>