You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2011/02/02 05:42:28 UTC
[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other
than training data
[ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989510#comment-12989510 ]
Lance Norskog commented on MAHOUT-604:
--------------------------------------
Scenario: I was trying out the Wikipedia Bayes classifier example from the wiki. After downloading the full wikipedia set and splitting it into 1.1m chunks, I used chunk-0001.xml as training data and chunk-0002.xml as test data. I preprocessed both test and training data with the countries.txt file.
When I test against the same data I trained on (chunk-0001.xml), I get this:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 5 83.3333%
Incorrectly Classified Instances : 1 16.6667%
Total Classified Instances : 6
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
0 1 0 | 1 a = algeria
0 5 0 | 5 b = united_states
0 0 0 | 0 c = unknown
Default Category: unknown: 2
When I test with this on a different input set (chunk-0002.xml), I get this:
Running on hadoop, using HADOOP_HOME=/lucid/lance/open/hadoop-0.20.2
No HADOOP_CONF_DIR set, using /lucid/lance/open/hadoop-0.20.2/conf
11/01/31 23:49:16 INFO bayes.TestClassifier: Loading model from:
{basePath=../datasets/wikipedia_train1/, classifierType=bayes,
alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false,
encoding=UTF-8, defaultCat=unknown,
testDirPath=../datasets/wikipedia_input2}
11/01/31 23:49:16 INFO bayes.TestClassifier: Testing Bayes Classifier
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_j/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_k/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_kSigma_j/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader: 24.58041375116976
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-thetaNormalizer/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-tfIdf/trainer-tfIdf/part-00000
11/01/31 23:49:17 INFO datastore.InMemoryBayesDatastore: algeria
-30159.939567094563 78146.01904096449 -0.38594339081156
11/01/31 23:49:17 INFO datastore.InMemoryBayesDatastore: united_states
-78146.01904096449 78146.01904096449 -1.0
Exception in thread "main" java.lang.NullPointerException
at org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:99)
at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:115)
at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:119)
at org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:87)
at org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:69)
at org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:266)
at org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
> Key: MAHOUT-604
> URL: https://issues.apache.org/jira/browse/MAHOUT-604
> Project: Mahout
> Issue Type: Bug
> Reporter: Lance Norskog
>
> The Bayes Classifier throws an exception when tested with different data than the training data.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira