You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2009/03/17 23:35:00 UTC
[CONF] Apache Lucene Mahout: TwentyNewsgroups (page edited)
TwentyNewsgroups (MAHOUT) edited by Grant Ingersoll
Page: http://cwiki.apache.org/confluence/display/MAHOUT/TwentyNewsgroups
Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=99739&originalVersion=9&revisedVersion=10
Content:
---------------------------------------------------------------------
h1. Twenty Newsgroups Classification
[Get Mahout|http://cwiki.apache.org/confluence/display/MAHOUT/index#index-Installation%2FSetup]
Assume MAHOUT_HOME refers to the location where you checked out/installed Mahout
After downloading the distribution, unzip/untar it into the directory of your choice and do:
h2. Setup:
# In trunk, mvn install // This will compile everything and create the Hadoop Job.
# cd examples
# ant -f build-deprecated.xml get-files //Note, we are in the process of updating to Maven
# ant -f build-deprecated.xml extract-20news-18828
Then, from Hadoop:
# emacs conf/hadoop-site.xml (add in local settings per [quickstart|http://hadoop.apache.org/core/docs/current/quickstart.html])
# bin/hadoop namenode -format //Format the HDFS
# bin/start-all.sh //Start Hadoop
# bin/hadoop dfs -put <MAHOUT_HOME>/work/20news-18828-collapse 20newsInput //Copies the extracted text to HDFS
h2. Bayes
Then, to train the Bayes Classifier using tri-grams:
{code}hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job org.apache.mahout.classifier.bayes.TrainClassifier -i 20newsInput -o newsmodel -ng 3 -type bayes{code}
To Test:
{code}hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.TestClassifier -p newsmodel -t work/newstest -ng 3 -type bayes{code}
Output might look like:
{code}
08/11/07 16:52:39 INFO bayes.TestClassifier: Done loading model: # labels: 20
08/11/07 16:52:39 INFO bayes.TestClassifier: Done generating Model
08/11/07 16:52:57 INFO bayes.TestClassifier: alt.atheism 96.9962453066333 775/799.0
08/11/07 16:53:15 INFO bayes.TestClassifier: comp.graphics 99.28057553956835 966/973.0
08/11/07 16:53:45 INFO bayes.TestClassifier: comp.os.ms-windows.misc 96.95431472081218 955/985.0
08/11/07 16:53:59 INFO bayes.TestClassifier: comp.sys.ibm.pc.hardware 99.59266802443992 978/982.0
08/11/07 16:54:10 INFO bayes.TestClassifier: comp.sys.mac.hardware 99.47970863683663 956/961.0
08/11/07 16:54:28 INFO bayes.TestClassifier: comp.windows.x 99.59183673469387 976/980.0
08/11/07 16:54:38 INFO bayes.TestClassifier: misc.forsale 98.45679012345678 957/972.0
08/11/07 16:54:50 INFO bayes.TestClassifier: rec.autos 99.4949494949495 985/990.0
08/11/07 16:55:04 INFO bayes.TestClassifier: rec.motorcycles 100.0 994/994.0
08/11/07 16:55:16 INFO bayes.TestClassifier: rec.sport.baseball 99.89939637826961 993/994.0
08/11/07 16:55:36 INFO bayes.TestClassifier: rec.sport.hockey 99.89989989989989 998/999.0
08/11/07 16:55:54 INFO bayes.TestClassifier: sci.crypt 99.39455095862765 985/991.0
08/11/07 16:56:05 INFO bayes.TestClassifier: sci.electronics 98.98063200815494 971/981.0
08/11/07 16:56:27 INFO bayes.TestClassifier: sci.med 99.79797979797979 988/990.0
08/11/07 16:56:44 INFO bayes.TestClassifier: sci.space 99.3920972644377 981/987.0
08/11/07 16:57:06 INFO bayes.TestClassifier: soc.religion.christian 99.49849548645938 992/997.0
08/11/07 16:57:24 INFO bayes.TestClassifier: talk.politics.guns 99.45054945054945 905/910.0
08/11/07 16:57:51 INFO bayes.TestClassifier: talk.politics.mideast 98.82978723404256 929/940.0
08/11/07 16:58:13 INFO bayes.TestClassifier: talk.politics.misc 89.93548387096774 697/775.0
08/11/07 16:58:25 INFO bayes.TestClassifier: talk.religion.misc 61.78343949044586 388/628.0
08/11/07 16:58:25 INFO bayes.TestClassifier: =======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 18369 97.5621%
Incorrectly Classified Instances : 459 2.4379%
Total Classified Instances : 18828
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
994 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 994 a = rec.motorcycles
0 976 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 1 | 980 b = comp.windows.x
7 0 929 1 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 | 940 c = talk.politics.mideast
0 0 0 905 0 0 1 0 0 0 0 0 0 0 0 0 3 0 1 0 | 910 d = talk.politics.guns
4 1 4 27 388 1 0 1 0 5 1 1 2 2 149 7 2 33 0 0 | 628 e = talk.religion.misc
3 0 0 0 0 985 0 1 0 0 0 0 0 1 0 0 0 0 0 0 | 990 f = rec.autos
0 0 0 0 0 0 993 1 0 0 0 0 0 0 0 0 0 0 0 0 | 994 g = rec.sport.baseball
0 0 0 0 0 0 1 998 0 0 0 0 0 0 0 0 0 0 0 0 | 999 h = rec.sport.hockey
0 0 0 0 0 0 0 0 956 0 2 0 0 0 0 0 0 0 2 1 | 961 i = comp.sys.mac.hardware
0 0 0 0 0 0 0 0 0 981 0 0 5 0 0 1 0 0 0 0 | 987 j = sci.space
0 0 0 0 0 0 0 0 0 0 978 0 1 0 0 0 0 0 2 1 | 982 k = comp.sys.ibm.pc.hardware
1 0 3 36 0 1 2 1 0 5 0 697 4 0 3 3 19 0 0 0 | 775 l = talk.politics.misc
0 2 0 0 0 0 0 0 0 0 2 0 966 0 0 0 0 0 2 1 | 973 m = comp.graphics
1 0 0 0 0 0 0 0 0 0 6 0 0 971 0 0 0 0 3 0 | 981 n = sci.electronics
1 0 0 0 0 0 0 0 1 0 0 0 0 0 992 1 0 1 0 1 | 997 o = soc.religion.christian
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 988 0 0 0 1 | 990 p = sci.med
0 0 0 2 0 0 0 0 0 0 0 0 2 1 0 0 985 0 1 0 | 991 q = sci.crypt
0 0 0 1 1 0 0 0 0 1 0 0 1 0 19 0 1 775 0 0 | 799 r = alt.atheism
1 0 0 0 0 3 1 2 0 0 3 0 0 5 0 0 0 0 957 0 | 972 s = misc.forsale
0 0 0 8 0 0 0 0 0 0 6 0 6 0 0 0 0 0 10 955 | 985 t = comp.os.ms-windows.misc
{code}
h2. Complementary Naive Bayes
To Train a CBayes Classifier using bi-grams
{code}hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.TrainClassifier -t -i 20newsInput -o newsmodel -ng 2 -type cbayes{code}
To Test a CBayes Classifier using bi-grams
{code}hadoop jar <MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.TestClassifier -p newsmodel -t work/newstest -ng 2 -type cbayes{code}
---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence
Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action
If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action
If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence