You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ap...@apache.org on 2015/04/23 03:28:47 UTC
svn commit: r1675528 -
/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
Author: apalumbo
Date: Thu Apr 23 01:28:47 2015
New Revision: 1675528
URL: http://svn.apache.org/r1675528
Log:
edit doc classification tutorial
Modified:
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
Modified: mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext?rev=1675528&r1=1675527&r2=1675528&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext Thu Apr 23 01:28:47 2015
@@ -1,9 +1,12 @@
#Classifying a Document with the Mahout Shell
-This tutorial assumes that you have Spark configured for the ```spark-shell``` See [Playing with Mahout's Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html). As well we assume that Mahout is running in cluster mode (i.e. with the ```MAHOUT_LOCAL``` environment variable unset) so that the output is put into HDFS.
+This tutorial will take you through the steps used to train and create a Multinomial Naive Bayes text classifier using the ```mahout spark-shell```.
-## Downloading and Vectorizing the wikipedia dataset
-*As of Mahout v0.10.0, we are still reliant on the MapReduce versions of ```mahout seqwiki``` and ```mahout seq2sparse``` to extract and vectorize our text. A* [*Spark implemenation of seq2sparse*](https://issues.apache.org/jira/browse/MAHOUT-1663) *is in the works for Mahout v0.11.* However, to download the wikipedia dataset, extract the bodies of the documentation, label each document and vectorize the text into TF-IDF vectors, we can sipmly run the [wikipedia-classifier.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) example.
+## Prerequisites
+This tutorial assumes that you have your Spark environment variables set for the ```mahout spark-shell``` see: [Playing with Mahout's Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html). As well we assume that Mahout is running in cluster mode (i.e. with the ```MAHOUT_LOCAL``` environment variable **unset**) as we'll be reading and writing to HDFS.
+
+## Downloading and Vectorizing the Wikipedia dataset
+*As of Mahout v. 0.10.0, we are still reliant on the MapReduce versions of ```mahout seqwiki``` and ```mahout seq2sparse``` to extract and vectorize our text. A* [*Spark implementation of seq2sparse*](https://issues.apache.org/jira/browse/MAHOUT-1663) *is in the works for Mahout v. 0.11.* However, to download the Wikipedia dataset, extract the bodies of the documentation, label each document and vectorize the text into TF-IDF vectors, we can simpmly run the [wikipedia-classifier.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) example.
Please select a number to choose the corresponding task to run
1. CBayes (may require increased heap space on yarn)
@@ -11,19 +14,19 @@ This tutorial assumes that you have Spar
3. clean -- cleans up the work area in /tmp/mahout-work-wiki
Enter your choice :
-Enter (2). This will download a large recent XML dump of the wikipedia database, into a ```/tmp/mahout-work-wiki``` directory, unzip it and place it into HDFS. It will run a [MapReduce job to parse the wikipedia set](http://mahout.apache.org/users/classification/wikipedia-classifier-example.html), extracting and labeling only pages with category tags for [United States] and [United Kingdom]. It will then run ```mahout seq2sparse``` to convert the documents into TF-IDF vectors. The script will also a build and test a [Naive Bayes model using MapReduce](http://mahout.apache.org/users/classification/bayesian.html). When it is completed, you should see a confusion matrix on your screen. For this tutorial, we will ignore the MapReduce model, and build a new model using Spark based on the vectorization data created by ```seq2sparse```.
+Enter (2). This will download a large recent XML dump of the Wikipedia database, into a ```/tmp/mahout-work-wiki``` directory, unzip it and place it into HDFS. It will run a [MapReduce job to parse the wikipedia set](http://mahout.apache.org/users/classification/wikipedia-classifier-example.html), extracting and labeling only pages with category tags for [United States] and [United Kingdom] (~11600 documents). It will then run ```mahout seq2sparse``` to convert the documents into TF-IDF vectors. The script will also a build and test a [Naive Bayes model using MapReduce](http://mahout.apache.org/users/classification/bayesian.html). When it is completed, you should see a confusion matrix on your screen. For this tutorial, we will ignore the MapReduce model, and build a new model using Spark based on the vectorized text output by ```seq2sparse```.
## Getting Started
-Launch the ```mahout-shell```. There is an example script: ```spark-document-classifier.mscala``` (`.mscala` denotes a Mahout-Scala script which can be run similarly to an R-script). We will be walking through this script for this tutorial but if you wanted to simply run the script, you could just issue the command:
+Launch the ```mahout spark-shell```. There is an example script: ```spark-document-classifier.mscala``` (.mscala denotes a Mahout-Scala script which can be run similarly to an R script). We will be walking through this script for this tutorial but if you wanted to simply run the script, you could just issue the command:
mahout> :load /path/to/mahout/examples/bin/spark-document-classifier.mscala
-For now, lets take the script apart piece by piece.
+For now, lets take the script apart piece by piece. You can cut and paste the following code blocks into the ```mahout spark-shell```.
## Imports
-Our mahout Naive Bayes Imports:
+Our Mahout Naive Bayes Imports:
import org.apache.mahout.classifier.naivebayes._
import org.apache.mahout.classifier.stats._
@@ -35,24 +38,25 @@ Hadoop Imports needed to read our dictio
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.LongWritable
-## read in our full set from HDFS as vectorized by seq2sparse in classify-wikipedia.sh
+## Read in our full set from HDFS as vectorized by seq2sparse in classify-wikipedia.sh
val pathToData = "/tmp/mahout-work-wiki/"
val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
-## extract the category of each observation and aggregate those observation by category
+## Extract the category of each observation and aggregate those observation by category
- val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(fullData)
+ val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(
+ fullData)
-## build a Muitinomial Naive Bayes model and self test on the training set
+## Build a Muitinomial Naive Bayes model and self test on the training set
val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
val resAnalyzer = SparkNaiveBayes.test(model, fullData, false)
println(resAnalyzer)
-printing the result analyzer will display the confusion matrix
+printing the ```ResultAnalyzer``` will display the confusion matrix.
-## read in the dictionary and document frequency count from HDFS
+## Read in the dictionary and document frequency count from HDFS
val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0",
classOf[Text],
@@ -75,9 +79,9 @@ printing the result analyzer will displa
val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap
val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap
-## define a function to tokeinze and vectorize new text using our current dictionary
+## Define a function to tokenize and vectorize new text using our current dictionary
-For this simple example, our function ```vectorizeDocument(...) will tokenize a new document into unigrams using native Java String methods and vectorize usingour dictionary and document frequencies. You could also use a [Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., and integrate Apache [Tika](https://tika.apache.org/) to extract text from different document types (PDF, PPT, XLS, etc.). Here, however we will kwwp ot simple and split ouor text using regexs and native String methods.
+For this simple example, our function ```vectorizeDocument(...) will tokenize a new document into unigrams using native Java String methods and vectorize using our dictionary and document frequencies. You could also use a [Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., and integrate Apache [Tika](https://tika.apache.org/) to extract text from different document types (PDF, PPT, XLS, etc.). Here, however we will keep it simple and split our text using regexs and native String methods.
def vectorizeDocument(document: String,
dictionaryMap: Map[String,Int],
@@ -107,7 +111,7 @@ For this simple example, our function ``
vec
}
-## setup our classifier
+## Setup our classifier
val labelMap = model.labelIndex
val numLabels = model.numLabels
@@ -119,9 +123,9 @@ For this simple example, our function ``
case _ => new StandardNBClassifier(model)
}
-## define an argmax function
+## Define an argmax function
-The label with the higest score wins the classification for a given document
+The label with the highest score wins the classification for a given document.
def argmax(v: Vector): (Int, Double) = {
var bestIdx: Int = Integer.MIN_VALUE
@@ -135,7 +139,7 @@ The label with the higest score wins the
(bestIdx, bestScore)
}
-## define our final TF(-IDF) vector classifier
+## Define our final TF(-IDF) vector classifier
def classifyDocument(clvec: Vector) : String = {
val cvec = classifier.classifyFull(clvec)
@@ -211,7 +215,7 @@ The label with the higest score wins the
" ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting by Jemima" +
" Kelly; editing by Keith Weir)")
-## vectorize and classify our documents
+## Vectorize and classify our documents
val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)
@@ -222,7 +226,7 @@ The label with the higest score wins the
println("Classifying the news article about Manchester United (united kingdom)")
classifyDocument(ukVec)
-## tie everything together in a new method to classify new text
+## Tie everything together in a new method to classify new text
def classifyText(txt: String): String = {
val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
@@ -230,7 +234,7 @@ The label with the higest score wins the
}
-## now we can simply call our classifyText method on any string
+## Now we can simply call our classifyText(...) method on any string
classifyText("Hello world from Queens")
classifyText("Hello world from London")
\ No newline at end of file