You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2011/02/02 05:42:28 UTC

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989510#comment-12989510 ] 

Lance Norskog commented on MAHOUT-604:
--------------------------------------

Scenario: I was trying out the Wikipedia Bayes classifier example from the wiki. After downloading the full wikipedia set and splitting it into 1.1m chunks, I used  chunk-0001.xml as training data and chunk-0002.xml as test data. I preprocessed both test and training data with the countries.txt file. 

When I test against the same data I trained on (chunk-0001.xml), I get this:
 =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          5       83.3333%
Incorrectly Classified Instances        :          1       16.6667%
Total Classified Instances              :          6

=======================================================
Confusion Matrix
-------------------------------------------------------
a      b      c         <--Classified as
0      1      0          |  1           a     = algeria
0      5      0          |  5           b     = united_states
0      0      0          |  0           c     = unknown
Default Category: unknown: 2


When I test with this on a different input set (chunk-0002.xml), I get this:

Running on hadoop, using HADOOP_HOME=/lucid/lance/open/hadoop-0.20.2
No HADOOP_CONF_DIR set, using /lucid/lance/open/hadoop-0.20.2/conf
11/01/31 23:49:16 INFO bayes.TestClassifier: Loading model from:
{basePath=../datasets/wikipedia_train1/, classifierType=bayes,
alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false,
encoding=UTF-8, defaultCat=unknown,
testDirPath=../datasets/wikipedia_input2}
11/01/31 23:49:16 INFO bayes.TestClassifier: Testing Bayes Classifier
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_j/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_k/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_kSigma_j/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader: 24.58041375116976
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-thetaNormalizer/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-tfIdf/trainer-tfIdf/part-00000
11/01/31 23:49:17 INFO datastore.InMemoryBayesDatastore: algeria
-30159.939567094563 78146.01904096449 -0.38594339081156
11/01/31 23:49:17 INFO datastore.InMemoryBayesDatastore: united_states
-78146.01904096449 78146.01904096449 -1.0
Exception in thread "main" java.lang.NullPointerException
 at org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:99)
 at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:115)
 at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:119)
 at org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:87)
 at org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:69)
 at org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:266)
 at org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:186)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Lance Norskog
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira