You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Loek Cleophas <lo...@kalooga.com> on 2010/02/18 11:12:22 UTC

20newsgroups example/TestClassifier code - bug/oddity?

Hi

While playing around some more with the 20newsgroups example code for  
the Bayes classifiers, I ran into an oddity and a presumable bug:

instead of using (parts of) the 20 newsgroups data set, which was  
split nicely into one file per newsgroup, with the 'category, tab,  
tokens' line format, I generated such a file out of our company data  
set. What I did though was generate 1 file to train, and 1 to test  
with - so both files could have different lines having different  
categories, e.g.

cars	Ferrari red ....
animals	cow cat dog ....

In training, this works fine.  In testing, it crashes TestClassifier  
with a null pointer exception. I presume that is because either the  
file name does not match category.txt for some category name, or  
because there's multiple categories being used inside the single file  
- but I also presume that neither should crash the thing :) It also  
brings up the question: if the line format in the data files has the  
category in there, then why are the file names relevant at all? Seems  
like redundancy to me. Shouldn't TestClassifier merely take all .txt  
files in the input data directory and process their contents?

Regards,
Loek

Re: 20newsgroups example/TestClassifier code - bug/oddity?

Posted by Robin Anil <ro...@gmail.com>.

Fixz!  Done. Svn up

https://issues.apache.org/jira/browse/MAHOUT-296

robin

On Thu, Feb 18, 2010 at 4:19 PM, Robin Anil <ro...@gmail.com> wrote:

> Yeah. It definitely shouldn't be. I will post a fix soon(I am at work right
> now). Meanwhile, You can see the test classifier code, and programmatically
> run the classifier.
> its as easy as  setting the params and instantiating a classifier context
> and send it files one by one.
>
> Robin
>
>
>
> On Thu, Feb 18, 2010 at 4:15 PM, Loek Cleophas <lo...@kalooga.com>wrote:
>
>> Thank you Robin. The stack trace I got:
>>
>> Exception in thread "main" java.lang.NullPointerException
>>        at
>> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:100)
>>        at
>> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:117)
>>        at
>> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:122)
>>        at
>> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:88)
>>        at
>> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:63)
>>        at
>> org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:289)
>>        at
>> org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:204)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>> Command line was: bin/hadoop jar
>> ~/Downloads/mahout-0.2/examples/target/mahout-examples-0.2.job
>> org.apache.mahout.classifier.bayes.TestClassifier -m
>> docs-klg-n3-wordLevel-complementary -d
>> ~/Code/klg/indextrainingvalidation/docs-klg-mahout-validate -ng 3 -type
>> cbayes -source hdfs -method sequential
>>
>> It did read the model in correctly - and when I substitute a non-existing
>> input directory for the one with the non-category-named .txt file, it indeed
>> runs normally (classifying 0 instances).
>>
>> I presume it should be easy to reproduce - if not, let me know and I can
>> see whether I can give you our small test data set or some small subset of
>> it that I can reproduce it with.
>>
>> Regards,
>> Loek
>>
>>
>> On Feb 18, 2010, at 11:25, Robin Anil wrote:
>>
>>  I will look into this.
>>>
>>> On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <
>>> loek.cleophas@kalooga.com>wrote:
>>>
>>>  Hi
>>>>
>>>> While playing around some more with the 20newsgroups example code for
>>>> the
>>>> Bayes classifiers, I ran into an oddity and a presumable bug:
>>>>
>>>> instead of using (parts of) the 20 newsgroups data set, which was split
>>>> nicely into one file per newsgroup, with the 'category, tab, tokens'
>>>> line
>>>> format, I generated such a file out of our company data set. What I did
>>>> though was generate 1 file to train, and 1 to test with - so both files
>>>> could have different lines having different categories, e.g.
>>>>
>>>> cars    Ferrari red ....
>>>> animals cow cat dog ....
>>>>
>>>> In training, this works fine.  In testing, it crashes TestClassifier
>>>> with a
>>>> null pointer exception. I presume that is because either the file name
>>>> does
>>>> not match category.txt for some category name, or because there's
>>>> multiple
>>>> categories being used inside the single file - but I also presume that
>>>> neither should crash the thing :) It also brings up the question: if the
>>>> line format in the data files has the category in there, then why are
>>>> the
>>>> file names relevant at all? Seems like redundancy to me. Shouldn't
>>>> TestClassifier merely take all .txt files in the input data directory
>>>> and
>>>> process their contents?
>>>>
>>>> Regards,
>>>> Loek
>>>>
>>>>
>>
>

Re: 20newsgroups example/TestClassifier code - bug/oddity?

Posted by Robin Anil <ro...@gmail.com>.

Yeah. It definitely shouldn't be. I will post a fix soon(I am at work right
now). Meanwhile, You can see the test classifier code, and programmatically
run the classifier.
its as easy as  setting the params and instantiating a classifier context
and send it files one by one.

Robin


On Thu, Feb 18, 2010 at 4:15 PM, Loek Cleophas <lo...@kalooga.com>wrote:

> Thank you Robin. The stack trace I got:
>
> Exception in thread "main" java.lang.NullPointerException
>        at
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:100)
>        at
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:117)
>        at
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:122)
>        at
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:88)
>        at
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:63)
>        at
> org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:289)
>        at
> org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:204)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> Command line was: bin/hadoop jar
> ~/Downloads/mahout-0.2/examples/target/mahout-examples-0.2.job
> org.apache.mahout.classifier.bayes.TestClassifier -m
> docs-klg-n3-wordLevel-complementary -d
> ~/Code/klg/indextrainingvalidation/docs-klg-mahout-validate -ng 3 -type
> cbayes -source hdfs -method sequential
>
> It did read the model in correctly - and when I substitute a non-existing
> input directory for the one with the non-category-named .txt file, it indeed
> runs normally (classifying 0 instances).
>
> I presume it should be easy to reproduce - if not, let me know and I can
> see whether I can give you our small test data set or some small subset of
> it that I can reproduce it with.
>
> Regards,
> Loek
>
>
> On Feb 18, 2010, at 11:25, Robin Anil wrote:
>
>  I will look into this.
>>
>> On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <loek.cleophas@kalooga.com
>> >wrote:
>>
>>  Hi
>>>
>>> While playing around some more with the 20newsgroups example code for the
>>> Bayes classifiers, I ran into an oddity and a presumable bug:
>>>
>>> instead of using (parts of) the 20 newsgroups data set, which was split
>>> nicely into one file per newsgroup, with the 'category, tab, tokens' line
>>> format, I generated such a file out of our company data set. What I did
>>> though was generate 1 file to train, and 1 to test with - so both files
>>> could have different lines having different categories, e.g.
>>>
>>> cars    Ferrari red ....
>>> animals cow cat dog ....
>>>
>>> In training, this works fine.  In testing, it crashes TestClassifier with
>>> a
>>> null pointer exception. I presume that is because either the file name
>>> does
>>> not match category.txt for some category name, or because there's
>>> multiple
>>> categories being used inside the single file - but I also presume that
>>> neither should crash the thing :) It also brings up the question: if the
>>> line format in the data files has the category in there, then why are the
>>> file names relevant at all? Seems like redundancy to me. Shouldn't
>>> TestClassifier merely take all .txt files in the input data directory and
>>> process their contents?
>>>
>>> Regards,
>>> Loek
>>>
>>>
>

Re: 20newsgroups example/TestClassifier code - bug/oddity?

Posted by Loek Cleophas <lo...@kalooga.com>.

Thank you Robin. The stack trace I got:

Exception in thread "main" java.lang.NullPointerException
	at  
org 
.apache 
.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:100)
	at  
org 
.apache 
.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java: 
117)
	at  
org 
.apache 
.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java: 
122)
	at  
org 
.apache 
.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:88)
	at  
org 
.apache 
.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:63)
	at  
org 
.apache 
.mahout 
.classifier 
.bayes.TestClassifier.classifySequential(TestClassifier.java:289)
	at  
org 
.apache 
.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:204)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at  
sun 
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 
39)
	at  
sun 
.reflect 
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 
25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Command line was: bin/hadoop jar ~/Downloads/mahout-0.2/examples/ 
target/mahout-examples-0.2.job  
org.apache.mahout.classifier.bayes.TestClassifier -m docs-klg-n3- 
wordLevel-complementary -d ~/Code/klg/indextrainingvalidation/docs-klg- 
mahout-validate -ng 3 -type cbayes -source hdfs -method sequential

It did read the model in correctly - and when I substitute a non- 
existing input directory for the one with the non-category-named .txt  
file, it indeed runs normally (classifying 0 instances).

I presume it should be easy to reproduce - if not, let me know and I  
can see whether I can give you our small test data set or some small  
subset of it that I can reproduce it with.

Regards,
Loek

On Feb 18, 2010, at 11:25, Robin Anil wrote:

> I will look into this.
>
> On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <loek.cleophas@kalooga.com 
> >wrote:
>
>> Hi
>>
>> While playing around some more with the 20newsgroups example code  
>> for the
>> Bayes classifiers, I ran into an oddity and a presumable bug:
>>
>> instead of using (parts of) the 20 newsgroups data set, which was  
>> split
>> nicely into one file per newsgroup, with the 'category, tab,  
>> tokens' line
>> format, I generated such a file out of our company data set. What I  
>> did
>> though was generate 1 file to train, and 1 to test with - so both  
>> files
>> could have different lines having different categories, e.g.
>>
>> cars    Ferrari red ....
>> animals cow cat dog ....
>>
>> In training, this works fine.  In testing, it crashes  
>> TestClassifier with a
>> null pointer exception. I presume that is because either the file  
>> name does
>> not match category.txt for some category name, or because there's  
>> multiple
>> categories being used inside the single file - but I also presume  
>> that
>> neither should crash the thing :) It also brings up the question:  
>> if the
>> line format in the data files has the category in there, then why  
>> are the
>> file names relevant at all? Seems like redundancy to me. Shouldn't
>> TestClassifier merely take all .txt files in the input data  
>> directory and
>> process their contents?
>>
>> Regards,
>> Loek
>>

Re: 20newsgroups example/TestClassifier code - bug/oddity?

Posted by Robin Anil <ro...@gmail.com>.

I will look into this.

On Thu, Feb 18, 2010 at 3:42 PM, Loek Cleophas <lo...@kalooga.com>wrote:

> Hi
>
> While playing around some more with the 20newsgroups example code for the
> Bayes classifiers, I ran into an oddity and a presumable bug:
>
> instead of using (parts of) the 20 newsgroups data set, which was split
> nicely into one file per newsgroup, with the 'category, tab, tokens' line
> format, I generated such a file out of our company data set. What I did
> though was generate 1 file to train, and 1 to test with - so both files
> could have different lines having different categories, e.g.
>
> cars    Ferrari red ....
> animals cow cat dog ....
>
> In training, this works fine.  In testing, it crashes TestClassifier with a
> null pointer exception. I presume that is because either the file name does
> not match category.txt for some category name, or because there's multiple
> categories being used inside the single file - but I also presume that
> neither should crash the thing :) It also brings up the question: if the
> line format in the data files has the category in there, then why are the
> file names relevant at all? Seems like redundancy to me. Shouldn't
> TestClassifier merely take all .txt files in the input data directory and
> process their contents?
>
> Regards,
> Loek
>

Re: 20newsgroups example/TestClassifier code - bug/oddity?

Posted by Sean Owen <sr...@gmail.com>.

An NPE is undoubtedly a bug, do you have a stack trace?
Beyond that Robin et al would have to comment.

On Thu, Feb 18, 2010 at 10:12 AM, Loek Cleophas
<lo...@kalooga.com> wrote:
> In training, this works fine.  In testing, it crashes TestClassifier with a
> null pointer exception. I presume that is because either the file name does
> not match category.txt for some category name, or because there's multiple
> categories being used inside the single file - but I also presume that
> neither should crash the thing :) It also brings up the question: if the
> line format in the data files has the category in there, then why are the
> file names relevant at all? Seems like redundancy to me. Shouldn't
> TestClassifier merely take all .txt files in the input data directory and
> process their contents?