You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2011/02/02 05:40:29 UTC

[jira] Created: (MAHOUT-604) Bayes Classifier fails on data other than training data

Bayes Classifier fails on data other than training data
-------------------------------------------------------

                 Key: MAHOUT-604
                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
             Project: Mahout
          Issue Type: Bug
            Reporter: Lance Norskog


The Bayes Classifier throws an exception when tested with different data than the training data.







-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-604:
-----------------------------

    Attachment: MAHOUT-604.patch

I think this corrects and clarifies the Preconditions check in a way that avoids this NPE. You can try it out Lance to confirm?

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>         Attachments: MAHOUT-604.patch
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994725#comment-12994725 ] 

Hudson commented on MAHOUT-604:
-------------------------------

Integrated in Mahout-Quality #629 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/629/])
    MAHOUT-604 avoid an NPE by updating Preconditions check


> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>             Fix For: 0.5
>
>         Attachments: MAHOUT-604.patch
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-604:
-----------------------------

    Due Date: 25/Feb/11  (was: 18/Feb/11)
    Assignee: Robin Anil

Lance can we get any more info on this? The exception in question, to start.

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Resolved: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-604.
------------------------------

    Resolution: Fixed
      Assignee: Sean Owen  (was: Robin Anil)

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Sean Owen
>             Fix For: 0.5
>
>         Attachments: MAHOUT-604.patch
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989510#comment-12989510 ] 

Lance Norskog commented on MAHOUT-604:
--------------------------------------

Scenario: I was trying out the Wikipedia Bayes classifier example from the wiki. After downloading the full wikipedia set and splitting it into 1.1m chunks, I used  chunk-0001.xml as training data and chunk-0002.xml as test data. I preprocessed both test and training data with the countries.txt file. 

When I test against the same data I trained on (chunk-0001.xml), I get this:
 =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          5       83.3333%
Incorrectly Classified Instances        :          1       16.6667%
Total Classified Instances              :          6

=======================================================
Confusion Matrix
-------------------------------------------------------
a      b      c         <--Classified as
0      1      0          |  1           a     = algeria
0      5      0          |  5           b     = united_states
0      0      0          |  0           c     = unknown
Default Category: unknown: 2


When I test with this on a different input set (chunk-0002.xml), I get this:

Running on hadoop, using HADOOP_HOME=/lucid/lance/open/hadoop-0.20.2
No HADOOP_CONF_DIR set, using /lucid/lance/open/hadoop-0.20.2/conf
11/01/31 23:49:16 INFO bayes.TestClassifier: Loading model from:
{basePath=../datasets/wikipedia_train1/, classifierType=bayes,
alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false,
encoding=UTF-8, defaultCat=unknown,
testDirPath=../datasets/wikipedia_input2}
11/01/31 23:49:16 INFO bayes.TestClassifier: Testing Bayes Classifier
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_j/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_k/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-weights/Sigma_kSigma_j/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader: 24.58041375116976
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-thetaNormalizer/part-00000
11/01/31 23:49:16 INFO io.SequenceFileModelReader:
file:/lucid/lance/open/datasets/wikipedia_train1/trainer-tfIdf/trainer-tfIdf/part-00000
11/01/31 23:49:17 INFO datastore.InMemoryBayesDatastore: algeria
-30159.939567094563 78146.01904096449 -0.38594339081156
11/01/31 23:49:17 INFO datastore.InMemoryBayesDatastore: united_states
-78146.01904096449 78146.01904096449 -1.0
Exception in thread "main" java.lang.NullPointerException
 at org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:99)
 at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:115)
 at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:119)
 at org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:87)
 at org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:69)
 at org.apache.mahout.classifier.bayes.TestClassifier.classifySequential(TestClassifier.java:266)
 at org.apache.mahout.classifier.bayes.TestClassifier.main(TestClassifier.java:186)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Lance Norskog
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-604:
-----------------------------

          Component/s: Classification
             Due Date: 18/Feb/11
    Affects Version/s: 0.4
        Fix Version/s: 0.5

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>             Fix For: 0.5
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992947#comment-12992947 ] 

Lance Norskog commented on MAHOUT-604:
--------------------------------------

bq. Lance can we get any more info on this? The exception in question, to start.
The first comment has all the details, and a stack trace. I'll rerun it on my setup- maybe my test data got corrupted somehow.

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992952#comment-12992952 ] 

Sean Owen commented on MAHOUT-604:
----------------------------------

Oops. Whatever link I clicked last time must have sent me straight to some sub-section of the issue in the new JIRA layout. It appeared to have no comments. In any event, yes I bet you can spot the small bug an suggested a way to handle the null here. Patches are great.

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993482#comment-12993482 ] 

Sean Owen commented on MAHOUT-604:
----------------------------------

This sounds like MAHOUT-569 but is not quite the same. I think it's another issue that's been uncovered.

This is the problem in ConfusionMatrix.getCount(). labelMap.get(classifiedLabel) is null. 

The Preconditions call looks like it's attempting to check for that, but the check seems wrong:
!labelMap.containsKey(correctLabel) || labelMap.containsKey(classifiedLabel) || defaultLabel.equals(classifiedLabel)

It seems like it would want to verify that the map contained both keys, but that's not what it says. The check for MAHOUT-569 changed this check but not that logic.

It still doesn't answer why it would be called with invalid input, but, fixing this is a first step perhaps. Am I on the right track?

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994576#comment-12994576 ] 

Lance Norskog commented on MAHOUT-604:
--------------------------------------

Yes, it works now. It classifies against both the training and test data.

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>         Attachments: MAHOUT-604.patch
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (MAHOUT-604) Bayes Classifier fails on data other than training data

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992961#comment-12992961 ] 

Robin Anil commented on MAHOUT-604:
-----------------------------------

Could you list down the steps you followed. I am not able to reproduce this issue

Robin

> Bayes Classifier fails on data other than training data
> -------------------------------------------------------
>
>                 Key: MAHOUT-604
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-604
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Lance Norskog
>            Assignee: Robin Anil
>             Fix For: 0.5
>
>
> The Bayes Classifier throws an exception when tested with different data than the training data.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira