You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2010/09/22 07:26:00 UTC

[CONF] Apache Mahout > Wikipedia Bayes Example

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Wikipedia Bayes Example (https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example)

Change Comment:
---------------------------------------------------------------------
modified for revised instructions based on 0.4 with mahout command line util

Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h1. Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either the Naive Bayes or Complementary Naive Bayes implementations in Mahout.  The example (described below) gets a Wikipedia dump and then splits it up into chunks.  These chunks are then further split by country.  From these splits, a classifier is trained to predict what country an unseen article should be categorized into.


h1. Running the example

# download the wikipedia data set [here | http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2]
# unzip the bz2 file to get the enwiki-latest-pages-articles.xml. 
# Create directory $MAHOUT_HOME/examples/temp and copy the xml file into this directory
# Chunk the Data into pieces: {code}$MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o wikipedia/chunks -c 64{code} {quote}*We strongly suggest you backup the results to some other place so that you don't have to do this step again in case it gets accidentally erased*{quote}
# This would have created the chunks in HDFS. Verify the same by executing {code}hadoop fs -ls wikipedia/chunks{code} and it'll list all the xml files as chunk-0001.xml and so on.
# Create the countries based Split of wikipedia dataset. {code}$MAHOUT_HOME/bin/mahout  wikipediaDataSetCreator  -i wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt
{code}
# Verify the creation of input data set by executing {code} hadoop fs -ls wikipediainput {code} and you'll be able to see part-r-00000 file inside wikipediainput directory
# Train the classifier: {code}$MAHOUT_HOME/bin/mahout trainclassifier -i wikipediainput -o wikipediamodel{code}. The model file will be available in the wikipediamodel folder in HDFS.
# Test the classifier: {code}$MAHOUT_HOME/bin/mahout testclassifier -m wikipediamodel -d wikipediainput{code}


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action