You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Daniel Korzekwa <da...@gmail.com> on 2012/01/18 20:22:06 UTC

Bayes classification - strange results

Hello,

I'm training bayes classifier against this data (6 records):

target, words
T A A A
T A A A
T A A A
T A A B
T A A B
F A A B

with a command:
 ./mahout trainclassifier -i /mnt/hgfs/C/daniel/my_fav_data/test -o model
-type bayes -ng 1 -source hdfs

then I test this classifier against the same data with:
./mahout testclassifier -d /mnt/hgfs/C/daniel/my_fav_data/test -m model
-type bayes -ng 1 -source hdfs -method sequential -v

 and I'm getting classification I cannot understand. All records are
classified as F, why is that?, shouldn't they be all classified as T?
12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 0 Line(30): T A A
A Expected Label: T Classified Label: F Correct: false
12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 1 Line(30): T A A
A Expected Label: T Classified Label: F Correct: false
12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 2 Line(30): T A A
A Expected Label: T Classified Label: F Correct: false
12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 3 Line(30): T A A
B Expected Label: T Classified Label: F Correct: false
12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 4 Line(30): T A A
B Expected Label: T Classified Label: F Correct: false
12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 5 Line(30): F A A
B Expected Label: F Classified Label: F Correct: true

My reasoning (no smoothing applied):
Prior:
P(T) = 5/6
P(F) = 1/6

P(A/T) = 13/15
P(A/F) = 2/3

P(B/T) = 2/15
P(B/F) = 1/3

Then I calculate posterior probability, e.g. P(T|A,A,B) = 0.7717 - record
classified as T.

What is the reasoning behind classifying all records above as F?

Any help much appreciated.

PS. I was using mahout trunk from 16.01.2012.

Regards.
Daniel

-- 
Daniel Korzekwa
Software Engineer
priv: http://danmachine.com
blog: http://blog.danmachine.com

Re: Bayes classification - strange results

Posted by Daniel Korzekwa <da...@gmail.com>.
After analyzing Mahout bayes code I found that priors are not taken into
account. Mahout just provides some different version of Naive Bayes. Today
I evaluated machine learning java library from  http://mallet.cs.umass.edu .
For the trivial test data presented below, it gives the results I was
expecting to see. All records are classified as T.

csvline:1 T 0.8709677419354839 F 0.12903225806451615
csvline:2 T 0.8709677419354839 F 0.12903225806451615
csvline:3 T 0.8709677419354839 F 0.12903225806451615
csvline:4 T 0.6923076923076923 F 0.30769230769230765
csvline:5 T 0.6923076923076923 F 0.30769230769230765
csvline:6 T 0.6923076923076923 F 0.30769230769230765


2012/1/18 Daniel Korzekwa <da...@gmail.com>

> Hello,
>
> I'm training bayes classifier against this data (6 records):
>
> target, words
> T A A A
> T A A A
> T A A A
> T A A B
> T A A B
> F A A B
>
> with a command:
>  ./mahout trainclassifier -i /mnt/hgfs/C/daniel/my_fav_data/test -o model
> -type bayes -ng 1 -source hdfs
>
> then I test this classifier against the same data with:
> ./mahout testclassifier -d /mnt/hgfs/C/daniel/my_fav_data/test -m model
> -type bayes -ng 1 -source hdfs -method sequential -v
>
>  and I'm getting classification I cannot understand. All records are
> classified as F, why is that?, shouldn't they be all classified as T?
> 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 0 Line(30): T A
> A A Expected Label: T Classified Label: F Correct: false
> 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 1 Line(30): T A
> A A Expected Label: T Classified Label: F Correct: false
> 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 2 Line(30): T A
> A A Expected Label: T Classified Label: F Correct: false
> 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 3 Line(30): T A
> A B Expected Label: T Classified Label: F Correct: false
> 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 4 Line(30): T A
> A B Expected Label: T Classified Label: F Correct: false
> 12/01/18 11:07:55 INFO bayes.TestClassifier: Line Number: 5 Line(30): F A
> A B Expected Label: F Classified Label: F Correct: true
>
> My reasoning (no smoothing applied):
> Prior:
> P(T) = 5/6
> P(F) = 1/6
>
> P(A/T) = 13/15
> P(A/F) = 2/3
>
> P(B/T) = 2/15
> P(B/F) = 1/3
>
> Then I calculate posterior probability, e.g. P(T|A,A,B) = 0.7717 - record
> classified as T.
>
> What is the reasoning behind classifying all records above as F?
>
> Any help much appreciated.
>
> PS. I was using mahout trunk from 16.01.2012.
>
> Regards.
> Daniel
>
> --
> Daniel Korzekwa
> Software Engineer
> priv: http://danmachine.com
> blog: http://blog.danmachine.com
>



-- 
Daniel Korzekwa
Software Engineer
priv: http://danmachine.com
blog: http://blog.danmachine.com