You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stuart Smith <st...@yahoo.com> on 2012/01/30 00:43:32 UTC

Re: Diagnosing naive bayes results - now I'm really stumped

Hello,

   So I eliminated the feature that was basically a document id, and I'm still getting the same results.

Based on what's been said on this thread, this should not happen (because we should always be classifying into some category):
12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from: {basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes, alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false, confusionMatrix=null, encoding=UTF-8, defaultCat=unknown, testDirPath=/user/stu/machine_learning/bayes/category-test-data}
12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good -1537123.539470884 1845854.5550999944 -0.8327435849286697
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad -1845854.5550999944 1845854.5550999944 -1.0
12/01/29 15:30:30 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0	         ï¿½%

Yet, this is what I get (from a 90/10 split of the data using the splitBayesInput class from Taming Text). 

So I'm stumped. 

I don't even really know where to begin debugging this..

And just to rule out the most obvious bonehead mistake:

hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
Found 2 items
-rw-r--r--   3 stu supergroup  108810564 2012-01-29 14:50 /user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r--   3 stu supergroup   38614032 2012-01-29 14:50 /user/stu/machine_learning/bayes/category-test-data/good

Here's a couple snippets from my seqdump:

Key class: class org.apache.mahout.common.StringTuple Value Class: class org.apache.hadoop.io.DoubleWritable
Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good, 0array]: Value: 10.481077499671203
Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
Key: [__WT, good, 0x1]: Value: 1342.2191134942075
Key: [__WT, good, 0x10000]: Value: 243.74351518918098
Ted,
If you're interested, I can send over the whole seqdump file just to you, but I'm a little wary of posting it to the whole list at this point...
Once I understand the problem more, I might realize that giving away the information won't hurt anything...

Thoughts?

Take care,
  -stu

________________________________
 From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com> 
Sent: Saturday, January 28, 2012 12:36 PM
Subject: Re: Diagnosing naive bayes results

It always tells you the most likely category, but you can redefine the
output to only trigger if the most likely category really dominates the
results.

With two categories, this is reasonable.  For a dozen it is much more
debatable.

This works with the SGD classifiers as well and I have seen this used in a
multi-level classifier.

On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:

> Hello,
>
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
>

Re: Diagnosing naive bayes results - now I'm really stumped

Posted by Ted Dunning <te...@gmail.com>.

These are some pretty strange looking terms popping up here.

Can you share some of your data?

On Sun, Jan 29, 2012 at 11:43 PM, Stuart Smith <st...@yahoo.com> wrote:

> Hello,
>
>    So I eliminated the feature that was basically a document id, and I'm
> still getting the same results.
>
> Based on what's been said on this thread, this should not happen (because
> we should always be classifying into some category):
> 12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from:
> {basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes,
> alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false,
> confusionMatrix=null, encoding=UTF-8, defaultCat=unknown,
> testDirPath=/user/stu/machine_learning/bayes/category-test-data}
> 12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature
> weights
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature
> weights
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature
> weights
> 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature
> weights
> 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
> 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good
> -1537123.539470884 1845854.5550999944 -0.8327435849286697
> 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad
> -1845854.5550999944 1845854.5550999944 -1.0
> 12/01/29 15:30:30 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0             ï¿½%
>
> Yet, this is what I get (from a 90/10 split of the data using the
> splitBayesInput class from Taming Text).
>
>
> So I'm stumped.
>
> I don't even really know where to begin debugging this..
>
>
> And just to rule out the most obvious bonehead mistake:
>
> hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
> Found 2 items
> -rw-r--r--   3 stu supergroup  108810564 2012-01-29 14:50
> /user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r--   3 stu
> supergroup   38614032 2012-01-29 14:50
> /user/stu/machine_learning/bayes/category-test-data/good
>
> Here's a couple snippets from my seqdump:
>
> Key class: class org.apache.mahout.common.StringTuple Value Class: class
> org.apache.hadoop.io.DoubleWritable
> Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
> Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
> Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
> Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
> Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
> Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good,
> 0array]: Value: 10.481077499671203
> Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
> Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
> Key: [__WT, good, 0x1]: Value: 1342.2191134942075
> Key: [__WT, good, 0x10000]: Value: 243.74351518918098
> Ted,
> If you're interested, I can send over the whole seqdump file just to you,
> but I'm a little wary of posting it to the whole list at this point...
> Once I understand the problem more, I might realize that giving away the
> information won't hurt anything...
>
>
> Thoughts?
>
> Take care,
>   -stu
>
>
>
>
> ________________________________
>  From: Ted Dunning <te...@gmail.com>
> To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com>
> Sent: Saturday, January 28, 2012 12:36 PM
> Subject: Re: Diagnosing naive bayes results
>
> It always tells you the most likely category, but you can redefine the
> output to only trigger if the most likely category really dominates the
> results.
>
> With two categories, this is reasonable.  For a dozen it is much more
> debatable.
>
> This works with the SGD classifiers as well and I have seen this used in a
> multi-level classifier.
>
> On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <st...@yahoo.com> wrote:
>
> > Hello,
> >
> > Does naive bayes always classify a document into a category?
> > Or will it refuse to classify something it cannot?
> >