You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Loek Cleophas <lo...@kalooga.com> on 2010/01/26 12:05:13 UTC

Naive Bayes implementation

Hi

I was looking at the naive Bayes classifier's implementation, due to  
my surprise at the n-gram parameter being used.

My understanding of 'traditional' naive Bayes is that it only  
considers probabilities related to single words/tokens, independent of  
context. Is that not what the Mahout implementation does? Are the N- 
grams used to also model N-sequences of tokens as "words" to be dealt  
with in the algorithm? Or are they used as input in some other way?

It seems it uses "N-grams" of N tokens, not N characters, from what I  
gather from NGrams.java. Or are they not related to token sequences  
but to character sequences somehow?

Any help or pointers to materials the implementation is based on would  
be appreciated. (I know that the Complementary Naive Bayes  
implementation is quite different and based on a paper introducing  
that method - but I'm wondering about the 'normal' Naive Bayes  
implementation.)

Regards,
Loek

Re: Naive Bayes implementation

Posted by Robin Anil <ro...@gmail.com>.

Hi Loek, The Ngram sequence considers n sequences of words as a token in the
traditional naive bayes sense. Therefore it gives a boost for those words.

The current implementation is based on Jason Rennie's paper on Complementary
naive bayes. The optimisations he used(other than the complementary class)
is being used in the Naivebayes implementation.

Robin
On Tue, Jan 26, 2010 at 4:35 PM, Loek Cleophas <lo...@kalooga.com>wrote:

> Hi
>
> I was looking at the naive Bayes classifier's implementation, due to my
> surprise at the n-gram parameter being used.
>
> My understanding of 'traditional' naive Bayes is that it only considers
> probabilities related to single words/tokens, independent of context. Is
> that not what the Mahout implementation does? Are the N-grams used to also
> model N-sequences of tokens as "words" to be dealt with in the algorithm? Or
> are they used as input in some other way?
>
> It seems it uses "N-grams" of N tokens, not N characters, from what I
> gather from NGrams.java. Or are they not related to token sequences but to
> character sequences somehow?
>
> Any help or pointers to materials the implementation is based on would be
> appreciated. (I know that the Complementary Naive Bayes implementation is
> quite different and based on a paper introducing that method - but I'm
> wondering about the 'normal' Naive Bayes implementation.)
>
> Regards,
> Loek
>

Re: Naive Bayes implementation

Posted by Ted Dunning <te...@gmail.com>.

Naive Bayes treats features.  Those features can be anything.  It is, as you
say, common for them to be single words, but there is no reason not to use
additional features and some promise of better performance.  Overtraining
may be worse with more features, but with naive Bayes you are in a state of
sin on that count from the start.

On Tue, Jan 26, 2010 at 3:05 AM, Loek Cleophas <lo...@kalooga.com>wrote:

> My understanding of 'traditional' naive Bayes is that it only considers
> probabilities related to single words/tokens, independent of context.

-- 
Ted Dunning, CTO
DeepDyve