You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ap...@apache.org on 2015/04/05 22:09:10 UTC
svn commit: r1671422 - in /mahout/site/mahout_cms/trunk:
content/users/classification/wikipedia-classifier-example.mdtext
templates/standard.html
Author: apalumbo
Date: Sun Apr 5 20:09:09 2015
New Revision: 1671422
URL: http://svn.apache.org/r1671422
Log:
MAHOUT-1559 add documentation for the Wikipedia classification example
Added:
mahout/site/mahout_cms/trunk/content/users/classification/wikipedia-classifier-example.mdtext
Modified:
mahout/site/mahout_cms/trunk/templates/standard.html
Added: mahout/site/mahout_cms/trunk/content/users/classification/wikipedia-classifier-example.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/classification/wikipedia-classifier-example.mdtext?rev=1671422&view=auto
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/classification/wikipedia-classifier-example.mdtext (added)
+++ mahout/site/mahout_cms/trunk/content/users/classification/wikipedia-classifier-example.mdtext Sun Apr 5 20:09:09 2015
@@ -0,0 +1,49 @@
+# Wikipedia XML parser and Naive Bayes Classifier Example
+
+## Introduction
+Mahout has an [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1] which will download a recent XML dump of the (entire if desired) [English Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running the classification script, you can use the [document classification script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala) from the Mahout [spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset.
+
+You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom).
+
+## Oververview
+
+By defult the script is set to run on a medium sized Wikipedia XML dump. To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80 of [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) [1]. However this is not recommended unless you have the resources to do so. *Be sure to clean your work directory when changing datasets- option (3).*
+
+The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html) [4]. The only difference being that instead of running `$mahout seqdirectory` on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the unzipped Wikipedia xml dump.
+
+ $ mahout seqwiki
+
+The above command launches `WikipediaToSequenceFile.java` which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file. This process will seek to extract documents with a wikipedia category tag which (exactly, if the `-exactMatchOnly` option is set) matches a line in the category file. If no match is found and the `-all` option is set, the document will be dumped into an "unknown" category. The documents will then be written out as a `<Text,Text>` sequence file of the form (K:/category/document_title , V: document).
+
+There are 3 different example category files available to in the /examples/src/test/resources
+directory: country.txt, country10.txt and country2.txt. You can edit these categories to extract a different corpus from the Wikipedia dataset.
+
+The CLI options for `seqwiki` are as follows:
+
+ --input (-i) input pathname String
+ --output (-o) the output pathname String
+ --categories (-c) the file containing the Wikipedia categories
+ --exactMatchOnly (-e) if set, then the Wikipedia category must match
+ exactly instead of simply containing the category string
+ --all (-all) if set select all categories
+ --removeLabels (-rl) if set, remove [[Category:labels]] from document text after extracting label.
+
+
+After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` as in the [step by step 20newsgroups example](http://mahout.apache.org/users/classification/twenty-newsgroups.html). When all of the jobs have finished, a confusion matrix will be displayed.
+
+#Resourcese
+
+[1] [classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
+
+[2] [Document classification script for the Mahout Spark Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
+
+[3] [Example category file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt)
+
+[4] [Step by step instructions for building a Naive Bayes classifier for 20newsgroups from the command line](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
+
+[5] [Mahout MapReduce Naive Bayes](http://mahout.apache.org/users/classification/bayesian.html)
+
+[6] [Mahout Spark Naive Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html)
+
+[7] [Mahout Scala Spark and H2O Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
+
Modified: mahout/site/mahout_cms/trunk/templates/standard.html
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/templates/standard.html?rev=1671422&r1=1671421&r2=1671422&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/templates/standard.html (original)
+++ mahout/site/mahout_cms/trunk/templates/standard.html Sun Apr 5 20:09:09 2015
@@ -184,12 +184,13 @@
<li class="nav-header">Classification</li>
<li><a href="/users/classification/bayesian.html">Naive Bayes</a></li>
<li><a href="/users/classification/hidden-markov-models.html">Hidden Markov Models</a></li>
- <li><a href="/users/classification/logistic-regression.html">Logistic Regression</a></li>
+ <li><a href="/users/classification/logistic-regression.html">Logistic Regression (Single Machine)</a></li>
<li><a href="/users/classification/partial-implementation.html">Random Forest</a></li>
<li class="nav-header">Classification Examples</li>
<li><a href="/users/classification/breiman-example.html">Breiman example</a></li>
<li><a href="/users/classification/twenty-newsgroups.html">20 newsgroups example</a></li>
<li><a href="/users/classification/bankmarketing-example.html">SGD classifier bank marketing</a></li>
+ <li><a href="/users/classification/wikipedia-classifier-example.html">Wikipedia XML parser and classifier</a></li>
<li class="nav-header">Clustering</li>
<li><a href="/users/clustering/k-means-clustering.html">k-Means</a></li>
<li><a href="/users/clustering/canopy-clustering.html">Canopy</a></li>