You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dimitri Goldin <di...@neofonie.de> on 2012/03/27 18:26:42 UTC

Mahout 0.6 Naive Bayes Accuracy

Hi,

We were evaluating Mahout 0.6's Naive Bayes implementation using a
training set of 70000 documents (we know, that with this amount of
documents distributed training does not yet make too much sense).

During the tests we noticed that the performance is around 80% with the 
20newsgroups data - which is quite balanced (in the sense that
there are approximately the same number of documents per class). Most 
documents tended to be classified as the class with the most number of 
training documents.

Using our own data we only achieved an accuracy between ~35% and ~55%
depending on the classes' sizes within the test-sets.
We also tested replacing the tokenization, which right now is performed
on tabs and spaces using guavas Splitter class, which we replaced with
Lucenes GermanAnalyzer. This gave us around 10% more accuracy with
balanced training-data, resulting in ~89% accuracy.

Having tried Mallets naive bayes implementation we achieved ~95%
accuracy without having to balance the training-data. Does anybody know
which implementation detail might cause this or why balance seems
influence mahouts implementation much more?

I also found the following thread from fall 2011, which seems to
describe a similar problem:
http://search-lucene.com/m/DLzRcMLnWM

Unfortunately there was no follow-up to this, but maybe someone already
solved it.

Thanks in advance,
     Dimitry

Re: Mahout 0.6 Naive Bayes Accuracy

Posted by Dimitri Goldin <di...@neofonie.de>.

Hi Isabel,

First of all, thanks for your reply.

On 03/28/2012 09:10 AM, Isabel Drost wrote:
> On 27.03.2012 Dimitri Goldin wrote:
>> Having tried Mallets naive bayes implementation we achieved ~95%
>> accuracy without having to balance the training-data. Does anybody know
>> which implementation detail might cause this or why balance seems
>> influence mahouts implementation much more?
>
> Without knowing the Mallet implementation: You describe that you tried using two
> tokenizations for your Mahout runs - what are you using when running Mallet?

No "special" tokenization and/or stemming was set for mallet. The 
default tokenizer matches (http://mallet.cs.umass.edu/import.php
see --token-regex) tokens using a regular expression.
We used the following one: "\p{L}+".

> Which Naive Bayes implementation in Mahout did you use?

So far we used the regular Naive Bayes.

> Did you also try running with the complementary naive bayes implementation or
> the logistic regression instead?

I ran Complementary Naive Bayes on the same (unbalanced and more 
balanced, same as previous tests) training sets and achieved around
the same results as with the regular Naive Bayes, worst of which was
also around ~30% with pretty unbalanced data (listing below).

For completeness sake, the class sizes in the "worst", unbalanced
training set:

1431 a
4117 b
5348 c
15967 d
2940 e
9095 f
15925 g
10736 h
4441 i

The assigned class still seems to gravitate around the largest class
from the training set, which would be 'd' from the list.

Yes, we evaluated the logistic regression (not the adaptive variant) 
encoding the features using LuceneTextValueEncoder, the GermanAnalyzer 
and a list of stopwords. The accuracy was ~82%, though
we did not compare it to any other implementations.

Thanks,
	Dimitry

Re: Mahout 0.6 Naive Bayes Accuracy

Posted by Isabel Drost <is...@apache.org>.

On 27.03.2012 Dimitri Goldin wrote:
> Having tried Mallets naive bayes implementation we achieved ~95%
> accuracy without having to balance the training-data. Does anybody know
> which implementation detail might cause this or why balance seems
> influence mahouts implementation much more?

Without knowing the Mallet implementation: You describe that you tried using two 
tokenizations for your Mahout runs - what are you using when running Mallet?

Which Naive Bayes implementation in Mahout did you use?

Did you also try running with the complementary naive bayes implementation or 
the logistic regression instead?

Isabel