You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Robin Anil (JIRA)" <ji...@apache.org> on 2008/06/01 00:50:45 UTC

[jira] Updated: (MAHOUT-60) Complementary Naive Bayes

     [ https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-60:
-----------------------------

    Attachment: MAHOUT-60.patch

Before using this patch please use MAHOUT-9 (Implement MapReduce BayesianClassifier) patch and the instructions given there.

the 20Newsgroups Trainer requires the collapsed version as given in MAHOUT-9

Steps to get it running
{quote}
ant extract-20news-18828
ant examples-job

bin/start-all.sh //Start Hadoop
bin/hadoop dfs -put <MAHOUT_HOME>/work/20news-18828-collapse 20newsInput 
bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar org.apache.mahout.examples.classifiers.cbayes.TwentyNewsgroups -t -i 20newsinput/20news-18828-collapse -o 20newsoutput //This will train a classifier and write the model in a file named model in the folder 20newsoutput
{quote}
Copy the model file from the DFS to the local filesystem
{quote}
bin/hadoop dfs -get 20newsoutput 20newsoutput
{quote}
Test on the 20newsgroups data to check how well it is able to classify the train set. Accuracy is around 98.4% on the train set. But only way to check the implementation is correct is by doing some cross validation which is yet to be done.
{quote}
java -Xmx1024M org.apache.mahout.examples.classifiers.cbayes.Test20Newsgroups -p 20newsoutput/model -t  work/20news-18828/  
{quote}

TODO: Option to Split the 20newsgroups dataset into a train and a test set. Meanwhile if you have a set of test and train set on the 20newsgroups data you can build model on one of them and test on the other.




> Complementary Naive Bayes
> -------------------------
>
>                 Key: MAHOUT-60
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-60
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Classification
>            Reporter: Robin Anil
>         Attachments: MAHOUT-60.patch
>
>
> The focus is to implement an improved text classifier based on this paper http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.