You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dimitri Goldin <di...@neofonie.de> on 2012/03/27 18:26:42 UTC
Mahout 0.6 Naive Bayes Accuracy
Hi,
We were evaluating Mahout 0.6's Naive Bayes implementation using a
training set of 70000 documents (we know, that with this amount of
documents distributed training does not yet make too much sense).
During the tests we noticed that the performance is around 80% with the
20newsgroups data - which is quite balanced (in the sense that
there are approximately the same number of documents per class). Most
documents tended to be classified as the class with the most number of
training documents.
Using our own data we only achieved an accuracy between ~35% and ~55%
depending on the classes' sizes within the test-sets.
We also tested replacing the tokenization, which right now is performed
on tabs and spaces using guavas Splitter class, which we replaced with
Lucenes GermanAnalyzer. This gave us around 10% more accuracy with
balanced training-data, resulting in ~89% accuracy.
Having tried Mallets naive bayes implementation we achieved ~95%
accuracy without having to balance the training-data. Does anybody know
which implementation detail might cause this or why balance seems
influence mahouts implementation much more?
I also found the following thread from fall 2011, which seems to
describe a similar problem:
http://search-lucene.com/m/DLzRcMLnWM
Unfortunately there was no follow-up to this, but maybe someone already
solved it.
Thanks in advance,
Dimitry
Re: Mahout 0.6 Naive Bayes Accuracy
Posted by Dimitri Goldin <di...@neofonie.de>.
Hi Isabel,
First of all, thanks for your reply.
On 03/28/2012 09:10 AM, Isabel Drost wrote:
> On 27.03.2012 Dimitri Goldin wrote:
>> Having tried Mallets naive bayes implementation we achieved ~95%
>> accuracy without having to balance the training-data. Does anybody know
>> which implementation detail might cause this or why balance seems
>> influence mahouts implementation much more?
>
> Without knowing the Mallet implementation: You describe that you tried using two
> tokenizations for your Mahout runs - what are you using when running Mallet?
No "special" tokenization and/or stemming was set for mallet. The
default tokenizer matches (http://mallet.cs.umass.edu/import.php
see --token-regex) tokens using a regular expression.
We used the following one: "\p{L}+".
> Which Naive Bayes implementation in Mahout did you use?
So far we used the regular Naive Bayes.
> Did you also try running with the complementary naive bayes implementation or
> the logistic regression instead?
I ran Complementary Naive Bayes on the same (unbalanced and more
balanced, same as previous tests) training sets and achieved around
the same results as with the regular Naive Bayes, worst of which was
also around ~30% with pretty unbalanced data (listing below).
For completeness sake, the class sizes in the "worst", unbalanced
training set:
1431 a
4117 b
5348 c
15967 d
2940 e
9095 f
15925 g
10736 h
4441 i
The assigned class still seems to gravitate around the largest class
from the training set, which would be 'd' from the list.
Yes, we evaluated the logistic regression (not the adaptive variant)
encoding the features using LuceneTextValueEncoder, the GermanAnalyzer
and a list of stopwords. The accuracy was ~82%, though
we did not compare it to any other implementations.
Thanks,
Dimitry
Re: Mahout 0.6 Naive Bayes Accuracy
Posted by Isabel Drost <is...@apache.org>.
On 27.03.2012 Dimitri Goldin wrote:
> Having tried Mallets naive bayes implementation we achieved ~95%
> accuracy without having to balance the training-data. Does anybody know
> which implementation detail might cause this or why balance seems
> influence mahouts implementation much more?
Without knowing the Mallet implementation: You describe that you tried using two
tokenizations for your Mahout runs - what are you using when running Mallet?
Which Naive Bayes implementation in Mahout did you use?
Did you also try running with the complementary naive bayes implementation or
the logistic regression instead?
Isabel