You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2009/03/17 23:39:00 UTC

[CONF] Apache Lucene Mahout: WikipediaBayesExample (page edited)

WikipediaBayesExample (MAHOUT) edited by Grant Ingersoll
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/WikipediaBayesExample
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=100721&originalVersion=3&revisedVersion=4






Content:
---------------------------------------------------------------------

h1. Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either the Naive Bayes or Complementary Naive Bayes implementations in Mahout.  The example (described below) gets a Wikipedia dump and then splits it up into chunks.  These chunks are then further split by country.  From these splits, a classifier is trained to predict what country an unseen article should be categorized into.


h1. Running the example
NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. replace 0.1-dev with the appropriate value)

# cd <MAHOUT_HOME>/examples
# ant -f build-deprecated.xml enwiki-files
# Chunk the Data into pieces: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-0.1-dev-ex.jar org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml -o  <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64{code}
# Move the chunks to HDFS:  {code}<HADOOP_HOME/bin/hadoop dfs -put <MAHOUT_HOME>/examples/work/wikipedia/chunks/ wikipediadump{code}
# Create the countries based Split of wikipedia dataset. {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.WikipediaDatasetCreator -i wikipediadump -o wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt{code}
# Train the classifier: {code}<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o wikipediamodel --gramSize 3 -classifierType bayes{code}
# Fetch the input files for testing: {code}<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput wikipediainput {code}
# Test the classifier: {code}<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t  wikipediainput{code}


---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence