You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2009/03/17 23:39:00 UTC
[CONF] Apache Lucene Mahout: WikipediaBayesExample (page edited)
WikipediaBayesExample (MAHOUT) edited by Grant Ingersoll
Page: http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample
Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=100721&originalVersion=3&revisedVersion=4
Content:
---------------------------------------------------------------------
h1. Intro
The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either the Naive Bayes or Complementary Naive Bayes implementations in Mahout. The example (described below) gets a Wikipedia dump and then splits it up into chunks. These chunks are then further split by country. From these splits, a classifier is trained to predict what country an unseen article should be categorized into.
h1. Running the example
NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. replace 0.1-dev with the appropriate value)
# cd <MAHOUT_HOME>/examples
# ant -f build-deprecated.xml enwiki-files
# Chunk the Data into pieces: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-0.1-dev-ex.jar org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml -o <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64{code}
# Move the chunks to HDFS: {code}<HADOOP_HOME/bin/hadoop dfs -put <MAHOUT_HOME>/examples/work/wikipedia/chunks/ wikipediadump{code}
# Create the countries based Split of wikipedia dataset. {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.WikipediaDatasetCreator -i wikipediadump -o wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt{code}
# Train the classifier: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o wikipediamodel --gramSize 3 -classifierType bayes{code}
# Fetch the input files for testing: {code}<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput wikipediainput {code}
# Test the classifier: {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t wikipediainput{code}
---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence
Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action
If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action
If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence