You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ap...@apache.org on 2015/04/23 03:28:47 UTC
svn commit: r1675528 - /mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext

Author: apalumbo
Date: Thu Apr 23 01:28:47 2015
New Revision: 1675528

URL: http://svn.apache.org/r1675528
Log:
edit doc classification tutorial

Modified:
    mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext?rev=1675528&r1=1675527&r2=1675528&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext Thu Apr 23 01:28:47 2015
@@ -1,9 +1,12 @@
 #Classifying a Document with the Mahout Shell
 
-This tutorial assumes that you have Spark configured for the ```spark-shell``` See [Playing with Mahout's Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html).  As well we assume that Mahout is running in cluster mode (i.e. with the ```MAHOUT_LOCAL``` environment variable unset) so that the output is put into HDFS.
+This tutorial will take you through the steps used to train and create a Multinomial Naive Bayes text classifier using the ```mahout spark-shell```. 
 
-## Downloading and Vectorizing the wikipedia dataset
-*As of Mahout v0.10.0, we are still reliant on the MapReduce versions of ```mahout seqwiki``` and ```mahout seq2sparse``` to extract and vectorize our text.  A* [*Spark implemenation of seq2sparse*](https://issues.apache.org/jira/browse/MAHOUT-1663) *is in the works for Mahout v0.11.* However, to download the wikipedia dataset, extract the bodies of the documentation, label each document and vectorize the text into TF-IDF vectors, we can sipmly run the [wikipedia-classifier.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) example.  
+## Prerequisites
+This tutorial assumes that you have your Spark environment variables set for the ```mahout spark-shell``` see: [Playing with Mahout's Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html).  As well we assume that Mahout is running in cluster mode (i.e. with the ```MAHOUT_LOCAL``` environment variable **unset**) as we'll be reading and writing to HDFS.
+
+## Downloading and Vectorizing the Wikipedia dataset
+*As of Mahout v. 0.10.0, we are still reliant on the MapReduce versions of ```mahout seqwiki``` and ```mahout seq2sparse``` to extract and vectorize our text.  A* [*Spark implementation of seq2sparse*](https://issues.apache.org/jira/browse/MAHOUT-1663) *is in the works for Mahout v. 0.11.* However, to download the Wikipedia dataset, extract the bodies of the documentation, label each document and vectorize the text into TF-IDF vectors, we can simpmly run the [wikipedia-classifier.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) example.  
 
     Please select a number to choose the corresponding task to run
     1. CBayes (may require increased heap space on yarn)
@@ -11,19 +14,19 @@ This tutorial assumes that you have Spar
     3. clean -- cleans up the work area in /tmp/mahout-work-wiki
     Enter your choice :
 
-Enter (2). This will download a large recent XML dump of the wikipedia database, into a ```/tmp/mahout-work-wiki``` directory, unzip it and  place it into HDFS.  It will run a [MapReduce job to parse the wikipedia set](http://mahout.apache.org/users/classification/wikipedia-classifier-example.html), extracting and labeling only pages with category tags for [United States] and [United Kingdom]. It will then run ```mahout seq2sparse``` to convert the documents into TF-IDF vectors.  The script will also a build and test a [Naive Bayes model using MapReduce](http://mahout.apache.org/users/classification/bayesian.html).  When it is completed, you should see a confusion matrix on your screen.  For this tutorial, we will ignore the MapReduce model, and build a new model using Spark based on the vectorization data created by ```seq2sparse```.
+Enter (2). This will download a large recent XML dump of the Wikipedia database, into a ```/tmp/mahout-work-wiki``` directory, unzip it and  place it into HDFS.  It will run a [MapReduce job to parse the wikipedia set](http://mahout.apache.org/users/classification/wikipedia-classifier-example.html), extracting and labeling only pages with category tags for [United States] and [United Kingdom] (~11600 documents). It will then run ```mahout seq2sparse``` to convert the documents into TF-IDF vectors.  The script will also a build and test a [Naive Bayes model using MapReduce](http://mahout.apache.org/users/classification/bayesian.html).  When it is completed, you should see a confusion matrix on your screen.  For this tutorial, we will ignore the MapReduce model, and build a new model using Spark based on the vectorized text output by ```seq2sparse```.
 
 ## Getting Started
 
-Launch the ```mahout-shell```.  There is an example script: ```spark-document-classifier.mscala``` (`.mscala` denotes a Mahout-Scala script which can be run similarly to an R-script).   We will be walking through this script for this tutorial but if you wanted to simply run the script, you could just issue the command: 
+Launch the ```mahout spark-shell```.  There is an example script: ```spark-document-classifier.mscala``` (.mscala denotes a Mahout-Scala script which can be run similarly to an R script).   We will be walking through this script for this tutorial but if you wanted to simply run the script, you could just issue the command: 
 
     mahout> :load /path/to/mahout/examples/bin/spark-document-classifier.mscala
 
-For now, lets take the script apart piece by piece.
+For now, lets take the script apart piece by piece.  You can cut and paste the following code blocks into the ```mahout spark-shell```.
 
 ## Imports
 
-Our mahout Naive Bayes Imports:
+Our Mahout Naive Bayes Imports:
 
     import org.apache.mahout.classifier.naivebayes._
     import org.apache.mahout.classifier.stats._
@@ -35,24 +38,25 @@ Hadoop Imports needed to read our dictio
     import org.apache.hadoop.io.IntWritable
     import org.apache.hadoop.io.LongWritable
 
-## read in our full set from HDFS as vectorized by seq2sparse in classify-wikipedia.sh
+## Read in our full set from HDFS as vectorized by seq2sparse in classify-wikipedia.sh
 
     val pathToData = "/tmp/mahout-work-wiki/"
     val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
 
-## extract the category of each observation and aggregate those observation by category
+## Extract the category of each observation and aggregate those observation by category
 
-    val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(fullData)
+    val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(
+                                                                 fullData)
 
-## build a Muitinomial Naive Bayes model and self test on the training set
+## Build a Muitinomial Naive Bayes model and self test on the training set
 
     val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
     val resAnalyzer = SparkNaiveBayes.test(model, fullData, false)
     println(resAnalyzer)
     
-printing the result analyzer will display the confusion matrix
+printing the ```ResultAnalyzer``` will display the confusion matrix.
 
-## read in the dictionary and document frequency count from HDFS
+## Read in the dictionary and document frequency count from HDFS
     
     val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0",
                                       classOf[Text],
@@ -75,9 +79,9 @@ printing the result analyzer will displa
     val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap
     val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap
 
-## define a function to tokeinze and vectorize new text using our current dictionary
+## Define a function to tokenize and vectorize new text using our current dictionary
 
-For this simple example, our function ```vectorizeDocument(...) will tokenize a new document into unigrams using native Java String methods and vectorize usingour dictionary and document frequencies. You could also use a [Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., and integrate Apache [Tika](https://tika.apache.org/) to extract text from different document types (PDF, PPT, XLS, etc.).  Here, however we will kwwp ot simple and split ouor text using regexs and native String methods.
+For this simple example, our function ```vectorizeDocument(...) will tokenize a new document into unigrams using native Java String methods and vectorize using our dictionary and document frequencies. You could also use a [Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., and integrate Apache [Tika](https://tika.apache.org/) to extract text from different document types (PDF, PPT, XLS, etc.).  Here, however we will keep it simple and split our text using regexs and native String methods.
 
     def vectorizeDocument(document: String,
                             dictionaryMap: Map[String,Int],
@@ -107,7 +111,7 @@ For this simple example, our function ``
         vec
     }
 
-## setup our classifier
+## Setup our classifier
 
     val labelMap = model.labelIndex
     val numLabels = model.numLabels
@@ -119,9 +123,9 @@ For this simple example, our function ``
         case _ => new StandardNBClassifier(model)
     }
 
-## define an argmax function 
+## Define an argmax function 
 
-The label with the higest score wins the classification for a given document
+The label with the highest score wins the classification for a given document.
     
     def argmax(v: Vector): (Int, Double) = {
         var bestIdx: Int = Integer.MIN_VALUE
@@ -135,7 +139,7 @@ The label with the higest score wins the
         (bestIdx, bestScore)
     }
 
-## define our final TF(-IDF) vector classifier
+## Define our final TF(-IDF) vector classifier
 
     def classifyDocument(clvec: Vector) : String = {
         val cvec = classifier.classifyFull(clvec)
@@ -211,7 +215,7 @@ The label with the higest score wins the
         " ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting by Jemima" + 
         " Kelly; editing by Keith Weir)")
 
-## vectorize and classify our documents
+## Vectorize and classify our documents
 
     val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
     val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)
@@ -222,7 +226,7 @@ The label with the higest score wins the
     println("Classifying the news article about Manchester United (united kingdom)")
     classifyDocument(ukVec)
 
-## tie everything together in a new method to classify new text 
+## Tie everything together in a new method to classify new text 
     
     def classifyText(txt: String): String = {
         val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
@@ -230,7 +234,7 @@ The label with the higest score wins the
 
     }
 
-## now we can simply call our classifyText method on any string
+## Now we can simply call our classifyText(...) method on any string
 
     classifyText("Hello world from Queens")
     classifyText("Hello world from London")
\ No newline at end of file