You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ryan Rosario <uc...@gmail.com> on 2010/10/05 04:21:17 UTC

Can't Get Bayes Classifier to Work Properly

Hi,

I have a data file that I formatted in the same manner as the
20newsgroups example I have seen. A snippet of my fake data file
(key\tword1 word2 word3... \n)

spam    you need some viagra medication my friend
nonspam hi ryan my name is cassie and I am in your class
spam    aviator sunglasses with your name on them
nonspam hello ryan can you do me a favor
spam    free infertility medication here

I am trying to train and test the CBayes classifier. When I test the
classifier, I get the following non-sense output:

INFO: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0	         �%
Incorrectly Classified Instances        :          0	         �%
Total Classified Instances              :          0

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	<--Classified as
0    	0    	 |  0     	a     = spam
0    	0    	 |  0     	b     = nonspam
Default Category: unknown: 2


[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1 second
[INFO] Finished at: Mon Oct 04 18:13:51 PDT 2010
[INFO] Final Memory: 26M/360M
[INFO] ------------------------------------------------------------------------

I am using the following commands from the wiki to run the jobs:

mvn -e exec:java \
      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier \
      -Dexec.args="-i simple_spam \
                   -o spam-model \
                   -type cbayes \
                   -ng 1 \
                   -source hdfs"

mvn -e exec:java \
      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier \
      -Dexec.args="-m spam-model \
                   -d simple_spam \
                   -type cbayes \
                   -ng 1 \
                   -source hdfs \
                   -method sequential"

What might I be doing wrong? Let me know if you need more information.

Thanks,
Ryan

-- 
RRR

Re: Can't Get Bayes Classifier to Work Properly

Posted by Drew Farris <dr...@apache.org>.
Ryan,

Sorry to hear it's still not working for you. I can try to reproduce
your problem to see if I've missed anything important. Are you using a
release version of mahout or are you running from trunk?

How many examples in each of your training sets?

Drew

On Tue, Oct 5, 2010 at 2:02 PM, Ryan Rosario <uc...@gmail.com> wrote:
> Thank you for your help.
>
> I tried dividing the data into two files spam.txt and nonspam.txt
> within directory "simple_spam",
> but still have the same problem. No useful output.
>
> Ryan
>
> On Mon, Oct 4, 2010 at 7:42 PM, Drew Farris <dr...@apache.org> wrote:
>> Hi Ryan,
>>
>> Your format looks good. The -i argument must point to a directory of
>> one or more files as input. In the example the 20newsgroups data is
>> separated into a single file per class. I'm not certain this is a
>> requirement because the class is in the first column after all.
>>
>> If you are running from trunk, you might find that './bin/mahout
>> trainclassifier' and './bin/mahout testclassifier' is easier to
>> remember than the somewhat arcane maven invocation.
>>
>> HTH,
>>
>> Drew
>>
>> On Mon, Oct 4, 2010 at 10:21 PM, Ryan Rosario <uc...@gmail.com> wrote:
>>> Hi,
>>>
>>> I have a data file that I formatted in the same manner as the
>>> 20newsgroups example I have seen. A snippet of my fake data file
>>> (key\tword1 word2 word3... \n)
>>>
>>> spam    you need some viagra medication my friend
>>> nonspam hi ryan my name is cassie and I am in your class
>>> spam    aviator sunglasses with your name on them
>>> nonspam hello ryan can you do me a favor
>>> spam    free infertility medication here
>>>
>>> I am trying to train and test the CBayes classifier. When I test the
>>> classifier, I get the following non-sense output:
>>>
>>> INFO: =======================================================
>>> Summary
>>> -------------------------------------------------------
>>> Correctly Classified Instances          :          0             �%
>>> Incorrectly Classified Instances        :          0             �%
>>> Total Classified Instances              :          0
>>>
>>> =======================================================
>>> Confusion Matrix
>>> -------------------------------------------------------
>>> a       b       <--Classified as
>>> 0       0        |  0           a     = spam
>>> 0       0        |  0           b     = nonspam
>>> Default Category: unknown: 2
>>>
>>>
>>> [INFO] ------------------------------------------------------------------------
>>> [INFO] BUILD SUCCESSFUL
>>> [INFO] ------------------------------------------------------------------------
>>> [INFO] Total time: 1 second
>>> [INFO] Finished at: Mon Oct 04 18:13:51 PDT 2010
>>> [INFO] Final Memory: 26M/360M
>>> [INFO] ------------------------------------------------------------------------
>>>
>>> I am using the following commands from the wiki to run the jobs:
>>>
>>> mvn -e exec:java \
>>>      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier \
>>>      -Dexec.args="-i simple_spam \
>>>                   -o spam-model \
>>>                   -type cbayes \
>>>                   -ng 1 \
>>>                   -source hdfs"
>>>
>>> mvn -e exec:java \
>>>      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier \
>>>      -Dexec.args="-m spam-model \
>>>                   -d simple_spam \
>>>                   -type cbayes \
>>>                   -ng 1 \
>>>                   -source hdfs \
>>>                   -method sequential"
>>>
>>> What might I be doing wrong? Let me know if you need more information.
>>>
>>> Thanks,
>>> Ryan
>>>
>>> --
>>> RRR
>>>
>>
>
>
>
> --
> RRR
>

Re: Can't Get Bayes Classifier to Work Properly

Posted by Ryan Rosario <uc...@gmail.com>.
Thank you for your help.

I tried dividing the data into two files spam.txt and nonspam.txt
within directory "simple_spam",
but still have the same problem. No useful output.

Ryan

On Mon, Oct 4, 2010 at 7:42 PM, Drew Farris <dr...@apache.org> wrote:
> Hi Ryan,
>
> Your format looks good. The -i argument must point to a directory of
> one or more files as input. In the example the 20newsgroups data is
> separated into a single file per class. I'm not certain this is a
> requirement because the class is in the first column after all.
>
> If you are running from trunk, you might find that './bin/mahout
> trainclassifier' and './bin/mahout testclassifier' is easier to
> remember than the somewhat arcane maven invocation.
>
> HTH,
>
> Drew
>
> On Mon, Oct 4, 2010 at 10:21 PM, Ryan Rosario <uc...@gmail.com> wrote:
>> Hi,
>>
>> I have a data file that I formatted in the same manner as the
>> 20newsgroups example I have seen. A snippet of my fake data file
>> (key\tword1 word2 word3... \n)
>>
>> spam    you need some viagra medication my friend
>> nonspam hi ryan my name is cassie and I am in your class
>> spam    aviator sunglasses with your name on them
>> nonspam hello ryan can you do me a favor
>> spam    free infertility medication here
>>
>> I am trying to train and test the CBayes classifier. When I test the
>> classifier, I get the following non-sense output:
>>
>> INFO: =======================================================
>> Summary
>> -------------------------------------------------------
>> Correctly Classified Instances          :          0             �%
>> Incorrectly Classified Instances        :          0             �%
>> Total Classified Instances              :          0
>>
>> =======================================================
>> Confusion Matrix
>> -------------------------------------------------------
>> a       b       <--Classified as
>> 0       0        |  0           a     = spam
>> 0       0        |  0           b     = nonspam
>> Default Category: unknown: 2
>>
>>
>> [INFO] ------------------------------------------------------------------------
>> [INFO] BUILD SUCCESSFUL
>> [INFO] ------------------------------------------------------------------------
>> [INFO] Total time: 1 second
>> [INFO] Finished at: Mon Oct 04 18:13:51 PDT 2010
>> [INFO] Final Memory: 26M/360M
>> [INFO] ------------------------------------------------------------------------
>>
>> I am using the following commands from the wiki to run the jobs:
>>
>> mvn -e exec:java \
>>      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier \
>>      -Dexec.args="-i simple_spam \
>>                   -o spam-model \
>>                   -type cbayes \
>>                   -ng 1 \
>>                   -source hdfs"
>>
>> mvn -e exec:java \
>>      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier \
>>      -Dexec.args="-m spam-model \
>>                   -d simple_spam \
>>                   -type cbayes \
>>                   -ng 1 \
>>                   -source hdfs \
>>                   -method sequential"
>>
>> What might I be doing wrong? Let me know if you need more information.
>>
>> Thanks,
>> Ryan
>>
>> --
>> RRR
>>
>



-- 
RRR

Re: Can't Get Bayes Classifier to Work Properly

Posted by Drew Farris <dr...@apache.org>.
Hi Ryan,

Your format looks good. The -i argument must point to a directory of
one or more files as input. In the example the 20newsgroups data is
separated into a single file per class. I'm not certain this is a
requirement because the class is in the first column after all.

If you are running from trunk, you might find that './bin/mahout
trainclassifier' and './bin/mahout testclassifier' is easier to
remember than the somewhat arcane maven invocation.

HTH,

Drew

On Mon, Oct 4, 2010 at 10:21 PM, Ryan Rosario <uc...@gmail.com> wrote:
> Hi,
>
> I have a data file that I formatted in the same manner as the
> 20newsgroups example I have seen. A snippet of my fake data file
> (key\tword1 word2 word3... \n)
>
> spam    you need some viagra medication my friend
> nonspam hi ryan my name is cassie and I am in your class
> spam    aviator sunglasses with your name on them
> nonspam hello ryan can you do me a favor
> spam    free infertility medication here
>
> I am trying to train and test the CBayes classifier. When I test the
> classifier, I get the following non-sense output:
>
> INFO: =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0             �%
> Incorrectly Classified Instances        :          0             �%
> Total Classified Instances              :          0
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       <--Classified as
> 0       0        |  0           a     = spam
> 0       0        |  0           b     = nonspam
> Default Category: unknown: 2
>
>
> [INFO] ------------------------------------------------------------------------
> [INFO] BUILD SUCCESSFUL
> [INFO] ------------------------------------------------------------------------
> [INFO] Total time: 1 second
> [INFO] Finished at: Mon Oct 04 18:13:51 PDT 2010
> [INFO] Final Memory: 26M/360M
> [INFO] ------------------------------------------------------------------------
>
> I am using the following commands from the wiki to run the jobs:
>
> mvn -e exec:java \
>      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TrainClassifier \
>      -Dexec.args="-i simple_spam \
>                   -o spam-model \
>                   -type cbayes \
>                   -ng 1 \
>                   -source hdfs"
>
> mvn -e exec:java \
>      -Dexec.mainClass=org.apache.mahout.classifier.bayes.TestClassifier \
>      -Dexec.args="-m spam-model \
>                   -d simple_spam \
>                   -type cbayes \
>                   -ng 1 \
>                   -source hdfs \
>                   -method sequential"
>
> What might I be doing wrong? Let me know if you need more information.
>
> Thanks,
> Ryan
>
> --
> RRR
>