You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2015/04/05 22:15:02 UTC

svn commit: r946425 - in /websites/staging/mahout/trunk/content: ./ users/classification/wikipedia-classifier-example.html

Author: buildbot
Date: Sun Apr  5 20:15:02 2015
New Revision: 946425

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/classification/wikipedia-classifier-example.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sun Apr  5 20:15:02 2015
@@ -1 +1 @@
-1671422
+1671423

Modified: websites/staging/mahout/trunk/content/users/classification/wikipedia-classifier-example.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/wikipedia-classifier-example.html (original)
+++ websites/staging/mahout/trunk/content/users/classification/wikipedia-classifier-example.html Sun Apr  5 20:15:02 2015
@@ -260,6 +260,7 @@
 <p>Mahout has an <a href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">example script</a> [1] which will download a recent XML dump of the (entire if desired) <a href="http://dumps.wikimedia.org/enwiki/latest/">English Wikipedia database</a>. After running the classification script, you can use the <a href="https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala">document classification script</a> from the Mahout <a href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html">spark-shell</a> to vectorize and classify text from outside of the training and testing corpus using a modle built on the Wikipedia dataset.  </p>
 <p>You can run this script to build and test a Naive Bayes classifier for option (1) 10 arbitrary countries or option (2) 2 countries (United States and United Kingdom).</p>
 <h2 id="oververview">Oververview</h2>
+<p>Tou run the example simply execute the <code>$MAHOUT_HOME/examples/bin/classify-wikipedia.sh</code> script.</p>
 <p>By defult the script is set to run on a medium sized Wikipedia XML dump.  To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80  of <a href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">classify-wikipedia.sh</a> [1]. However this is not recommended unless you have the resources to do so. <em>Be sure to clean your work directory when changing datasets- option (3).</em></p>
 <p>The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for <a href="http://mahout.apache.org/users/classification/twenty-newsgroups.html">creating a 20 Newsgroups Classifier</a> [4].  The only difference being that instead of running <code>$mahout seqdirectory</code> on the unzipped 20 Newsgroups file, you'll run <code>$mahout seqwiki</code> on the unzipped Wikipedia xml dump.</p>
 <div class="codehilite"><pre>$ <span class="n">mahout</span> <span class="n">seqwiki</span>