You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stuart Smith <st...@yahoo.com> on 2012/01/27 21:06:05 UTC

Diagnosing naive bayes results

Hello,

Does naive bayes always classify a document into a category?
Or will it refuse to classify something it cannot?


For example:


I'm working through the naive bayes tutorial in Taming Text - with my own data.
I built a lucene index, ran extract training data, split 90/10, etc.

After looking at the seq dumper on the trained model - I noticed I made a mistake when building the index:
The good/bad documents had a unique id field (in the terms) that didn't get filtered out because of a typo/error in my little java program to build the index.


I went ahead and ran the test just to see what would happen, and the confusion matrix I got all was zeros.
No document was classified correctly or incorrectly.

No document was classified at all.

I suspect this was because it overfit to the unique id field in the training data - which the test vectors would not have.

While this sounds rational, it only explains the results if naive bayes can refuse to classify a document in any category whatsover. 

So I'm just wondering if this is true, or I should be looking for more mistakes.

I'm re-running it right now, but building the index takes a while, so I thought I'd ping the list in the meantime..

Thanks!

Take care,
  -stu

Re: Diagnosing naive bayes results

Posted by Salim <sa...@toralab.org>.

Stuart Smith <stu24mail <at> yahoo.com> writes:

> 
> Hello Salim,
>   The code for the book is up on github:
> 
> https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext
> 
> (I believe the URL is listed in the book?)
> 
> Take care,
>   -stu
> 

Hello Mr.Smith,

the tutorial I found was just a small part of the book :) and the link wasn't
mentionned in it :/.

Thank you for your kind assistance.

Have a nice day,
Salim

Re: Diagnosing naive bayes results

Posted by Salim <sa...@toralab.org>.

Stuart Smith <stu24mail <at> yahoo.com> writes:

> 
> Hello Salim,
>   The code for the book is up on github:
> 
> https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext
> 
> (I believe the URL is listed in the book?)
> 
> Take care,
>   -stu

Hello Mr.Smith,

the tutorial I found was just a small part of the book :) and the link wasn't
mentionned in it :/.

Thank you for your kind assistance.

Have a nice day,
Salim

Re: Diagnosing naive bayes results

Posted by Stuart Smith <st...@yahoo.com>.

Hello Salim,
  The code for the book is up on github:

https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext

(I believe the URL is listed in the book?)

Take care,
  -stu

________________________________
 From: Salim <sa...@toralab.org>
To: user@mahout.apache.org 
Sent: Friday, February 3, 2012 3:11 AM
Subject: Re: Diagnosing naive bayes results

Stuart Smith <stu24mail <at> yahoo.com> writes:

> 
> Hello,
> 
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
> 
> For example:
> 
> I'm working through the naive bayes tutorial in Taming Text - with my own data.
> I built a lucene index, ran extract training data, split 90/10, etc.

Hello Mr.Smith,

I am following the same tutorial as you, but there is a little problem, I can't
find the extractTrainingData mentionned in Taming Text.

can you please give me details about the location of this function?

Thank you,

Re: Diagnosing naive bayes results

Posted by Salim <sa...@toralab.org>.

Stuart Smith <stu24mail <at> yahoo.com> writes:

> 
> Hello,
> 
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
> 
> For example:
> 
> I'm working through the naive bayes tutorial in Taming Text - with my own data.
> I built a lucene index, ran extract training data, split 90/10, etc.

Hello Mr.Smith,

I am following the same tutorial as you, but there is a little problem, I can't
find the extractTrainingData mentionned in Taming Text.

can you please give me details about the location of this function?

Thank you,

Re: Diagnosing naive bayes results - now I'm really stumped

Posted by Ted Dunning <te...@gmail.com>.

These are some pretty strange looking terms popping up here.

Can you share some of your data?

On Sun, Jan 29, 2012 at 11:43 PM, Stuart Smith <st...@yahoo.com> wrote:

> Hello,
>
>    So I eliminated the feature that was basically a document id, and I'm
> still getting the same results.
>
> Based on what's been said on this thread, this should not happen (because
> we should always be classifying into some category):
> 12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from:
> {basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes,
> alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false,
> confusionMatrix=null, encoding=UTF-8, defaultCat=unknown,
> testDirPath=/user/stu/machine_learning/bayes/category-test-data}
> 12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature
> weights
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature
> weights
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature
> weights
> 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature
> weights
> 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
> 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good
> -1537123.539470884 1845854.5550999944 -0.8327435849286697
> 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad
> -1845854.5550999944 1845854.5550999944 -1.0
> 12/01/29 15:30:30 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0             ï¿½%
>
> Yet, this is what I get (from a 90/10 split of the data using the
> splitBayesInput class from Taming Text).
>
>
> So I'm stumped.
>
> I don't even really know where to begin debugging this..
>
>
> And just to rule out the most obvious bonehead mistake:
>
> hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
> Found 2 items
> -rw-r--r--   3 stu supergroup  108810564 2012-01-29 14:50
> /user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r--   3 stu
> supergroup   38614032 2012-01-29 14:50
> /user/stu/machine_learning/bayes/category-test-data/good
>
> Here's a couple snippets from my seqdump:
>
> Key class: class org.apache.mahout.common.StringTuple Value Class: class
> org.apache.hadoop.io.DoubleWritable
> Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
> Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
> Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
> Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
> Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
> Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good,
> 0array]: Value: 10.481077499671203
> Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
> Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
> Key: [__WT, good, 0x1]: Value: 1342.2191134942075
> Key: [__WT, good, 0x10000]: Value: 243.74351518918098
> Ted,
> If you're interested, I can send over the whole seqdump file just to you,
> but I'm a little wary of posting it to the whole list at this point...
> Once I understand the problem more, I might realize that giving away the
> information won't hurt anything...
>
>
> Thoughts?
>
> Take care,
>   -stu
>
>
>
>
> ________________________________
>  From: Ted Dunning <te...@gmail.com>
> To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com>
> Sent: Saturday, January 28, 2012 12:36 PM
> Subject: Re: Diagnosing naive bayes results
>
> It always tells you the most likely category, but you can redefine the
> output to only trigger if the most likely category really dominates the
> results.
>
> With two categories, this is reasonable.  For a dozen it is much more
> debatable.
>
> This works with the SGD classifiers as well and I have seen this used in a
> multi-level classifier.
>
> On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:
>
> > Hello,
> >
> > Does naive bayes always classify a document into a category?
> > Or will it refuse to classify something it cannot?
> >

Re: Diagnosing naive bayes results - now I'm really stumped

Posted by Stuart Smith <st...@yahoo.com>.

Hello,

   So I eliminated the feature that was basically a document id, and I'm still getting the same results.

Based on what's been said on this thread, this should not happen (because we should always be classifying into some category):
12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from: {basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes, alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false, confusionMatrix=null, encoding=UTF-8, defaultCat=unknown, testDirPath=/user/stu/machine_learning/bayes/category-test-data}
12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good -1537123.539470884 1845854.5550999944 -0.8327435849286697
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad -1845854.5550999944 1845854.5550999944 -1.0
12/01/29 15:30:30 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0	         ï¿½%

Yet, this is what I get (from a 90/10 split of the data using the splitBayesInput class from Taming Text). 

So I'm stumped. 

I don't even really know where to begin debugging this..

And just to rule out the most obvious bonehead mistake:

hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
Found 2 items
-rw-r--r--   3 stu supergroup  108810564 2012-01-29 14:50 /user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r--   3 stu supergroup   38614032 2012-01-29 14:50 /user/stu/machine_learning/bayes/category-test-data/good

Here's a couple snippets from my seqdump:

Key class: class org.apache.mahout.common.StringTuple Value Class: class org.apache.hadoop.io.DoubleWritable
Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good, 0array]: Value: 10.481077499671203
Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
Key: [__WT, good, 0x1]: Value: 1342.2191134942075
Key: [__WT, good, 0x10000]: Value: 243.74351518918098
Ted,
If you're interested, I can send over the whole seqdump file just to you, but I'm a little wary of posting it to the whole list at this point...
Once I understand the problem more, I might realize that giving away the information won't hurt anything...

Thoughts?

Take care,
  -stu

________________________________
 From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com> 
Sent: Saturday, January 28, 2012 12:36 PM
Subject: Re: Diagnosing naive bayes results

It always tells you the most likely category, but you can redefine the
output to only trigger if the most likely category really dominates the
results.

With two categories, this is reasonable.  For a dozen it is much more
debatable.

This works with the SGD classifiers as well and I have seen this used in a
multi-level classifier.

On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:

> Hello,
>
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
>

Re: Diagnosing naive bayes results

Posted by Ted Dunning <te...@gmail.com>.

Naive Bayesian scores tend to be over confident so it can be difficult to calibrate what exactly they mean in terms of probability. 

Sent from my iPhone

On Jan 28, 2012, at 14:19, Lance Norskog <go...@gmail.com> wrote:

> What algebra can be done on the classification scores? For example:
> 
> Classification of A : 60%
> Classification of B: 80%
> A and B are correct: ?
> A or B are correct: ?
> 
> Of course these exist for probabilities but I have not found handy
> formulae. Do these forumulae even exist with log-likelihood?
> 
> On Sat, Jan 28, 2012 at 12:36 PM, Ted Dunning <te...@gmail.com> wrote:
>> It always tells you the most likely category, but you can redefine the
>> output to only trigger if the most likely category really dominates the
>> results.
>> 
>> With two categories, this is reasonable.  For a dozen it is much more
>> debatable.
>> 
>> This works with the SGD classifiers as well and I have seen this used in a
>> multi-level classifier.
>> 
>> On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:
>> 
>>> Hello,
>>> 
>>> Does naive bayes always classify a document into a category?
>>> Or will it refuse to classify something it cannot?
>>> 
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Diagnosing naive bayes results

Posted by Lance Norskog <go...@gmail.com>.

What algebra can be done on the classification scores? For example:

Classification of A : 60%
Classification of B: 80%
A and B are correct: ?
A or B are correct: ?

Of course these exist for probabilities but I have not found handy
formulae. Do these forumulae even exist with log-likelihood?

On Sat, Jan 28, 2012 at 12:36 PM, Ted Dunning <te...@gmail.com> wrote:
> It always tells you the most likely category, but you can redefine the
> output to only trigger if the most likely category really dominates the
> results.
>
> With two categories, this is reasonable.  For a dozen it is much more
> debatable.
>
> This works with the SGD classifiers as well and I have seen this used in a
> multi-level classifier.
>
> On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:
>
>> Hello,
>>
>> Does naive bayes always classify a document into a category?
>> Or will it refuse to classify something it cannot?
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Diagnosing naive bayes results

Posted by Robin Anil <ro...@gmail.com>.

If the score is 0, then it's category is assumed as default. If there is a
score, then naive bayes takes the largest scored, and cnb takes the lowest
scored category.
------
Robin Anil


On Sat, Jan 28, 2012 at 5:32 PM, Stuart Smith <st...@yahoo.com> wrote:

>
>
> Any idea if there is a default setting in the mahout implementation of
> naive bayes that has a threshold below which it does trigger? And would
> explain it not classifying anything?
>
> ..yes, I'll dig around in the code if I need to - but if you know off the
> top of your head... :)
>
>
> Take care,
>  -stu
>
>
>
> ________________________________
>  From: Ted Dunning <te...@gmail.com>
> To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com>
> Sent: Saturday, January 28, 2012 12:36 PM
> Subject: Re: Diagnosing naive bayes results
>
>
> It always tells you the most likely category, but you can redefine the
> output to only trigger if the most likely category really dominates the
> results.
>
> With two categories, this is reasonable.  For a dozen it is much more
> debatable.
>
> This works with the SGD classifiers as well and I have seen this used in a
> multi-level classifier.
>
>
> On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:
>
> Hello,
> >
> >Does naive bayes always classify a document into a category?
> >Or will it refuse to classify something it cannot?
> >
>

Re: Diagnosing naive bayes results

Posted by Stuart Smith <st...@yahoo.com>.

Any idea if there is a default setting in the mahout implementation of naive bayes that has a threshold below which it does trigger? And would explain it not classifying anything?

..yes, I'll dig around in the code if I need to - but if you know off the top of your head... :)

Take care,
 -stu

________________________________
 From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com> 
Sent: Saturday, January 28, 2012 12:36 PM
Subject: Re: Diagnosing naive bayes results

It always tells you the most likely category, but you can redefine the output to only trigger if the most likely category really dominates the results.

With two categories, this is reasonable.  For a dozen it is much more debatable.

This works with the SGD classifiers as well and I have seen this used in a multi-level classifier.

On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:

Hello,
>
>Does naive bayes always classify a document into a category?
>Or will it refuse to classify something it cannot?
>

Re: Diagnosing naive bayes results

Posted by Ted Dunning <te...@gmail.com>.

It always tells you the most likely category, but you can redefine the
output to only trigger if the most likely category really dominates the
results.

With two categories, this is reasonable.  For a dozen it is much more
debatable.

This works with the SGD classifiers as well and I have seen this used in a
multi-level classifier.

On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:

> Hello,
>
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
>