You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Zehao Jin <ze...@gmail.com> on 2012/05/04 15:08:32 UTC

A Mahout Naive Bayes classifier problem

Dear all,
I'm a mahout beginner, I need to use the mahout Naive Bayes classifier for text classification.To get started, I followed the example of Twenty NewsGroup:
1.Start the Hadoop clusters.
2.Run the 20 newsgroup example by executing the script:  $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method.
3.Finally I got the result same Confusion Matrix as you put here:https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
But I have to classifier the Chinese texts, I had no clue, so I read the shell script:examples/bin/build-20news-bayes.sh and I knew how this example processed.Then I did like the script:
1.Preparing Training Data.
The script use org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the E-mail texts and gets one document per line,the label and the words,you know,the Chinese is different from English,the words cannot splitted by a space,different combination have different meaning, so I used a Chinese text analyzer to split the words, and match the format. Each line is like this: Label+'\t'+word1 word2 ....+'\n';
The example's analyzer output :

And the Chinese anlyzer output:


2.Put the formatted train data and the test data to HDFS.(My Hadoop platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20 categories, and my corpus has 10 categoris:
The example:              My categories:

3.Train the classifier and test the classifier on Hadoop.
The example do like this:
  ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o /20news-bydate/bayes-model -type bayes -ng 1 -source hdfs
  ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d /20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce
And my commands are absolutely accord the example,the only difference is the directory.

Strangely I cannot get the result as the example,I ran the program several times, but the mapreduce job always fail!
Task xxx failed to report status for 600 seconds.Killing.

What I want to ask that are the mahout trainclassifer (./bin/mahout trainclassifier xxx)and testclassifier(  ./bin/mahout testclassifier xxx) codes fit for my program ? Or it can only be used by the 20 newsgroup example? if they cannot be used ,it's really hard for me to achieve the Naive Bayes algorithm...Or is it the charset problems ? Many problems are occurred by this. Can you give me some support? I scratched my head for a few days. Thank you very much!!! 


Zehao Jin,SCUT , China.

Re: A Mahout Naive Bayes classifier problem

Posted by Zehao Jin <ze...@gmail.com>.
Thanks for your help, Robin Anil, Lance Norskog and Nimesh Parikh. I've successfully completed the Chinese texts classification, I think it's the total number of the term's problem ,it's to large. It also could be my fault that some punctuations that i forgot to remove.Thanks again. 


Zehao Jin,SCUT , China.

From: Nimesh Parikh
Date: 2012-05-08 22:11
To: user
Subject: Re: A Mahout Naive Bayes classifier problem
Well, You can take a chance with changing parameter "UTF-8" to something
else..

Thanks,
Nimesh

On Sat, May 5, 2012 at 5:53 AM, Lance Norskog <go...@gmail.com> wrote:

> Yes, it could be the charset problem. Also, it could be the total
> number of terms you supply.
>
> Which analyzer do you use? It is the Lucene "CJKAnalyzer"? This
> creates bigrams of all successive words, and so the number of unique
> terms explodes. This will cause the Hadoop job to explode. The "Smart
> Chinese Analyzer" uses a trained model to split words into 1-, 2- and
> 3-word clusters. The "Standard Analyzer" will split all CJK words into
> single terms. Given that this is a Bayesian model, the Bayesian
> assumption would be that single terms are good enough. I would go with
> the StandardAnalyzer.
>
> (I learned all of this just now in my day job in the Lucene business.)
>
> On Fri, May 4, 2012 at 6:32 AM, Robin Anil <ro...@gmail.com> wrote:
> > Can you provide the console output when you run train or test
> > On May 4, 2012 8:09 AM, "Zehao Jin" <ze...@gmail.com> wrote:
> >
> >> **
> >> Dear all,
> >> I'm a mahout beginner, I need to use the mahout Naive Bayes classifier
> for
> >> text classification.To get started, I followed the example of Twenty
> >> NewsGroup:
> >> 1.Start the Hadoop clusters.
> >> 2.Run the 20 newsgroup example by executing the script:
> >> $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method.
> >> 3.Finally I got the result same Confusion Matrix as you put here:
> >> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
> >> But I have to classifier the Chinese texts, I had no clue, so I read the
> >> shell script:examples/bin/build-20news-bayes.sh and I knew how this
> example
> >> processed.Then I did like the script:
> >> 1.Preparing Training Data.
> >> The script use
> >> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the
> >> E-mail texts and gets one document per line,the label and the words,you
> >> know,the Chinese is different from English,the words cannot splitted by
> a
> >> space,different combination have different meaning, so I used a Chinese
> >> text analyzer to split the words, and match the format. Each line is
> like
> >> this: Label+'\t'+word1 word2 ....+'\n';
> >> The example's analyzer output :
> >>  And the Chinese anlyzer output:
> >>
> >> 2.Put the formatted train data and the test data to HDFS.(My Hadoop
> >> platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20
> >> categories, and my corpus has 10 categoris:
> >> The example:              My categories:
> >>
> >> 3.Train the classifier and test the classifier on Hadoop.
> >> The example do like this:
> >>
> >> ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o
> /20news-bydate/bayes-model -type bayes -ng 1 -source hdfs
> >>
> >>   ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d
> /20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method
> mapreduce
> >> And my commands are absolutely accord the example,the only difference is
> >> the directory.
> >>
> >> Strangely I cannot get the result as the example,I ran the program
> several
> >> times, but the mapreduce job always fail!
> >> Task xxx failed to report status for 600 seconds.Killing.
> >>
> >> What I want to ask that are the mahout trainclassifer (
> >> ./bin/mahout trainclassifier xxx)and testclassifier(  ./bin/mahout
> testclassifier
> >> xxx) codes fit for my program ? Or it can only be used by the 20
> >> newsgroup example? if they cannot be used ,it's really hard for me to
> >> achieve the Naive Bayes algorithm...Or is it the charset problems ? Many
> >> problems are occurred by this. Can you give me some support? I
> scratched my
> >> head for a few days. Thank you very much!!!
> >> ------------------------------
> >>  Zehao Jin,SCUT , China.
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: A Mahout Naive Bayes classifier problem

Posted by Nimesh Parikh <da...@gmail.com>.
Well, You can take a chance with changing parameter "UTF-8" to something
else..

Thanks,
Nimesh

On Sat, May 5, 2012 at 5:53 AM, Lance Norskog <go...@gmail.com> wrote:

> Yes, it could be the charset problem. Also, it could be the total
> number of terms you supply.
>
> Which analyzer do you use? It is the Lucene "CJKAnalyzer"? This
> creates bigrams of all successive words, and so the number of unique
> terms explodes. This will cause the Hadoop job to explode. The "Smart
> Chinese Analyzer" uses a trained model to split words into 1-, 2- and
> 3-word clusters. The "Standard Analyzer" will split all CJK words into
> single terms. Given that this is a Bayesian model, the Bayesian
> assumption would be that single terms are good enough. I would go with
> the StandardAnalyzer.
>
> (I learned all of this just now in my day job in the Lucene business.)
>
> On Fri, May 4, 2012 at 6:32 AM, Robin Anil <ro...@gmail.com> wrote:
> > Can you provide the console output when you run train or test
> > On May 4, 2012 8:09 AM, "Zehao Jin" <ze...@gmail.com> wrote:
> >
> >> **
> >> Dear all,
> >> I'm a mahout beginner, I need to use the mahout Naive Bayes classifier
> for
> >> text classification.To get started, I followed the example of Twenty
> >> NewsGroup:
> >> 1.Start the Hadoop clusters.
> >> 2.Run the 20 newsgroup example by executing the script:
> >> $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method.
> >> 3.Finally I got the result same Confusion Matrix as you put here:
> >> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
> >> But I have to classifier the Chinese texts, I had no clue, so I read the
> >> shell script:examples/bin/build-20news-bayes.sh and I knew how this
> example
> >> processed.Then I did like the script:
> >> 1.Preparing Training Data.
> >> The script use
> >> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the
> >> E-mail texts and gets one document per line,the label and the words,you
> >> know,the Chinese is different from English,the words cannot splitted by
> a
> >> space,different combination have different meaning, so I used a Chinese
> >> text analyzer to split the words, and match the format. Each line is
> like
> >> this: Label+'\t'+word1 word2 ....+'\n';
> >> The example's analyzer output :
> >>  And the Chinese anlyzer output:
> >>
> >> 2.Put the formatted train data and the test data to HDFS.(My Hadoop
> >> platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20
> >> categories, and my corpus has 10 categoris:
> >> The example:              My categories:
> >>
> >> 3.Train the classifier and test the classifier on Hadoop.
> >> The example do like this:
> >>
> >> ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o
> /20news-bydate/bayes-model -type bayes -ng 1 -source hdfs
> >>
> >>   ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d
> /20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method
> mapreduce
> >> And my commands are absolutely accord the example,the only difference is
> >> the directory.
> >>
> >> Strangely I cannot get the result as the example,I ran the program
> several
> >> times, but the mapreduce job always fail!
> >> Task xxx failed to report status for 600 seconds.Killing.
> >>
> >> What I want to ask that are the mahout trainclassifer (
> >> ./bin/mahout trainclassifier xxx)and testclassifier(  ./bin/mahout
> testclassifier
> >> xxx) codes fit for my program ? Or it can only be used by the 20
> >> newsgroup example? if they cannot be used ,it's really hard for me to
> >> achieve the Naive Bayes algorithm...Or is it the charset problems ? Many
> >> problems are occurred by this. Can you give me some support? I
> scratched my
> >> head for a few days. Thank you very much!!!
> >> ------------------------------
> >>  Zehao Jin,SCUT , China.
> >>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: A Mahout Naive Bayes classifier problem

Posted by Lance Norskog <go...@gmail.com>.
Yes, it could be the charset problem. Also, it could be the total
number of terms you supply.

Which analyzer do you use? It is the Lucene "CJKAnalyzer"? This
creates bigrams of all successive words, and so the number of unique
terms explodes. This will cause the Hadoop job to explode. The "Smart
Chinese Analyzer" uses a trained model to split words into 1-, 2- and
3-word clusters. The "Standard Analyzer" will split all CJK words into
single terms. Given that this is a Bayesian model, the Bayesian
assumption would be that single terms are good enough. I would go with
the StandardAnalyzer.

(I learned all of this just now in my day job in the Lucene business.)

On Fri, May 4, 2012 at 6:32 AM, Robin Anil <ro...@gmail.com> wrote:
> Can you provide the console output when you run train or test
> On May 4, 2012 8:09 AM, "Zehao Jin" <ze...@gmail.com> wrote:
>
>> **
>> Dear all,
>> I'm a mahout beginner, I need to use the mahout Naive Bayes classifier for
>> text classification.To get started, I followed the example of Twenty
>> NewsGroup:
>> 1.Start the Hadoop clusters.
>> 2.Run the 20 newsgroup example by executing the script:
>> $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method.
>> 3.Finally I got the result same Confusion Matrix as you put here:
>> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
>> But I have to classifier the Chinese texts, I had no clue, so I read the
>> shell script:examples/bin/build-20news-bayes.sh and I knew how this example
>> processed.Then I did like the script:
>> 1.Preparing Training Data.
>> The script use
>> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the
>> E-mail texts and gets one document per line,the label and the words,you
>> know,the Chinese is different from English,the words cannot splitted by a
>> space,different combination have different meaning, so I used a Chinese
>> text analyzer to split the words, and match the format. Each line is like
>> this: Label+'\t'+word1 word2 ....+'\n';
>> The example's analyzer output :
>>  And the Chinese anlyzer output:
>>
>> 2.Put the formatted train data and the test data to HDFS.(My Hadoop
>> platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20
>> categories, and my corpus has 10 categoris:
>> The example:              My categories:
>>
>> 3.Train the classifier and test the classifier on Hadoop.
>> The example do like this:
>>
>> ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o /20news-bydate/bayes-model -type bayes -ng 1 -source hdfs
>>
>>   ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d /20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce
>> And my commands are absolutely accord the example,the only difference is
>> the directory.
>>
>> Strangely I cannot get the result as the example,I ran the program several
>> times, but the mapreduce job always fail!
>> Task xxx failed to report status for 600 seconds.Killing.
>>
>> What I want to ask that are the mahout trainclassifer (
>> ./bin/mahout trainclassifier xxx)and testclassifier(  ./bin/mahout testclassifier
>> xxx) codes fit for my program ? Or it can only be used by the 20
>> newsgroup example? if they cannot be used ,it's really hard for me to
>> achieve the Naive Bayes algorithm...Or is it the charset problems ? Many
>> problems are occurred by this. Can you give me some support? I scratched my
>> head for a few days. Thank you very much!!!
>> ------------------------------
>>  Zehao Jin,SCUT , China.
>>



-- 
Lance Norskog
goksron@gmail.com

Re: A Mahout Naive Bayes classifier problem

Posted by Robin Anil <ro...@gmail.com>.
Can you provide the console output when you run train or test
On May 4, 2012 8:09 AM, "Zehao Jin" <ze...@gmail.com> wrote:

> **
> Dear all,
> I'm a mahout beginner, I need to use the mahout Naive Bayes classifier for
> text classification.To get started, I followed the example of Twenty
> NewsGroup:
> 1.Start the Hadoop clusters.
> 2.Run the 20 newsgroup example by executing the script:
> $./examples/bin/build-20news-bayes.sh ,and chose Naive Bayes method.
> 3.Finally I got the result same Confusion Matrix as you put here:
> https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
> But I have to classifier the Chinese texts, I had no clue, so I read the
> shell script:examples/bin/build-20news-bayes.sh and I knew how this example
> processed.Then I did like the script:
> 1.Preparing Training Data.
> The script use
> org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups to format the
> E-mail texts and gets one document per line,the label and the words,you
> know,the Chinese is different from English,the words cannot splitted by a
> space,different combination have different meaning, so I used a Chinese
> text analyzer to split the words, and match the format. Each line is like
> this: Label+'\t'+word1 word2 ....+'\n';
> The example's analyzer output :
>  And the Chinese anlyzer output:
>
> 2.Put the formatted train data and the test data to HDFS.(My Hadoop
> platform has 1 namenode and 4 datanodes on Fedora 14)The example have 20
> categories, and my corpus has 10 categoris:
> The example:              My categories:
>
> 3.Train the classifier and test the classifier on Hadoop.
> The example do like this:
>
> ./bin/mahout trainclassifier -i /20news-bydate/bayes-train-input -o /20news-bydate/bayes-model -type bayes -ng 1 -source hdfs
>
>   ./bin/mahout testclassifier -m /20news-bydate/bayes-model -d /20news-bydate/bayes-test-input -type bayes -ng 1 -source hdfs -method mapreduce
> And my commands are absolutely accord the example,the only difference is
> the directory.
>
> Strangely I cannot get the result as the example,I ran the program several
> times, but the mapreduce job always fail!
> Task xxx failed to report status for 600 seconds.Killing.
>
> What I want to ask that are the mahout trainclassifer (
> ./bin/mahout trainclassifier xxx)and testclassifier(  ./bin/mahout testclassifier
> xxx) codes fit for my program ? Or it can only be used by the 20
> newsgroup example? if they cannot be used ,it's really hard for me to
> achieve the Naive Bayes algorithm...Or is it the charset problems ? Many
> problems are occurred by this. Can you give me some support? I scratched my
> head for a few days. Thank you very much!!!
> ------------------------------
>  Zehao Jin,SCUT , China.
>