You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ap...@apache.org on 2015/04/23 02:57:50 UTC
svn commit: r1675527 -
/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
Author: apalumbo
Date: Thu Apr 23 00:57:50 2015
New Revision: 1675527
URL: http://svn.apache.org/r1675527
Log:
add a classify document from the shell tutorial
Added:
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
Added: mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext?rev=1675527&view=auto
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext (added)
+++ mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext Thu Apr 23 00:57:50 2015
@@ -0,0 +1,236 @@
+#Classifying a Document with the Mahout Shell
+
+This tutorial assumes that you have Spark configured for the ```spark-shell``` See [Playing with Mahout's Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html). As well we assume that Mahout is running in cluster mode (i.e. with the ```MAHOUT_LOCAL``` environment variable unset) so that the output is put into HDFS.
+
+## Downloading and Vectorizing the wikipedia dataset
+*As of Mahout v0.10.0, we are still reliant on the MapReduce versions of ```mahout seqwiki``` and ```mahout seq2sparse``` to extract and vectorize our text. A* [*Spark implemenation of seq2sparse*](https://issues.apache.org/jira/browse/MAHOUT-1663) *is in the works for Mahout v0.11.* However, to download the wikipedia dataset, extract the bodies of the documentation, label each document and vectorize the text into TF-IDF vectors, we can sipmly run the [wikipedia-classifier.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) example.
+
+ Please select a number to choose the corresponding task to run
+ 1. CBayes (may require increased heap space on yarn)
+ 2. BinaryCBayes
+ 3. clean -- cleans up the work area in /tmp/mahout-work-wiki
+ Enter your choice :
+
+Enter (2). This will download a large recent XML dump of the wikipedia database, into a ```/tmp/mahout-work-wiki``` directory, unzip it and place it into HDFS. It will run a [MapReduce job to parse the wikipedia set](http://mahout.apache.org/users/classification/wikipedia-classifier-example.html), extracting and labeling only pages with category tags for [United States] and [United Kingdom]. It will then run ```mahout seq2sparse``` to convert the documents into TF-IDF vectors. The script will also a build and test a [Naive Bayes model using MapReduce](http://mahout.apache.org/users/classification/bayesian.html). When it is completed, you should see a confusion matrix on your screen. For this tutorial, we will ignore the MapReduce model, and build a new model using Spark based on the vectorization data created by ```seq2sparse```.
+
+## Getting Started
+
+Launch the ```mahout-shell```. There is an example script: ```spark-document-classifier.mscala``` (`.mscala` denotes a Mahout-Scala script which can be run similarly to an R-script). We will be walking through this script for this tutorial but if you wanted to simply run the script, you could just issue the command:
+
+ mahout> :load /path/to/mahout/examples/bin/spark-document-classifier.mscala
+
+For now, lets take the script apart piece by piece.
+
+## Imports
+
+Our mahout Naive Bayes Imports:
+
+ import org.apache.mahout.classifier.naivebayes._
+ import org.apache.mahout.classifier.stats._
+ import org.apache.mahout.nlp.tfidf._
+
+Hadoop Imports needed to read our dictionary:
+
+ import org.apache.hadoop.io.Text
+ import org.apache.hadoop.io.IntWritable
+ import org.apache.hadoop.io.LongWritable
+
+## read in our full set from HDFS as vectorized by seq2sparse in classify-wikipedia.sh
+
+ val pathToData = "/tmp/mahout-work-wiki/"
+ val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
+
+## extract the category of each observation and aggregate those observation by category
+
+ val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(fullData)
+
+## build a Muitinomial Naive Bayes model and self test on the training set
+
+ val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
+ val resAnalyzer = SparkNaiveBayes.test(model, fullData, false)
+ println(resAnalyzer)
+
+printing the result analyzer will display the confusion matrix
+
+## read in the dictionary and document frequency count from HDFS
+
+ val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0",
+ classOf[Text],
+ classOf[IntWritable])
+ val documentFrequencyCount = sdc.sequenceFile(pathToData + "wikipediaVecs/df-count",
+ classOf[IntWritable],
+ classOf[LongWritable])
+
+ // setup the dictionary and document frequency count as maps
+ val dictionaryRDD = dictionary.map {
+ case (wKey, wVal) => wKey.asInstanceOf[Text]
+ .toString() -> wVal.get()
+ }
+
+ val documentFrequencyCountRDD = documentFrequencyCount.map {
+ case (wKey, wVal) => wKey.asInstanceOf[IntWritable]
+ .get() -> wVal.get()
+ }
+
+ val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap
+ val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap
+
+## define a function to tokeinze and vectorize new text using our current dictionary
+
+For this simple example, our function ```vectorizeDocument(...) will tokenize a new document into unigrams using native Java String methods and vectorize usingour dictionary and document frequencies. You could also use a [Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., and integrate Apache [Tika](https://tika.apache.org/) to extract text from different document types (PDF, PPT, XLS, etc.). Here, however we will kwwp ot simple and split ouor text using regexs and native String methods.
+
+ def vectorizeDocument(document: String,
+ dictionaryMap: Map[String,Int],
+ dfMap: Map[Int,Long]): Vector = {
+ val wordCounts = document.replaceAll("[^\\p{L}\\p{Nd}]+", " ")
+ .toLowerCase
+ .split(" ")
+ .groupBy(identity)
+ .mapValues(_.length)
+ val vec = new RandomAccessSparseVector(dictionaryMap.size)
+ val totalDFSize = dfMap(-1)
+ val docSize = wordCounts.size
+ for (word <- wordCounts) {
+ val term = word._1
+ if (dictionaryMap.contains(term)) {
+ val tfidf: TermWeight = new TFIDF()
+ val termFreq = word._2
+ val dictIndex = dictionaryMap(term)
+ val docFreq = dfCountMap(dictIndex)
+ val currentTfIdf = tfidf.calculate(termFreq,
+ docFreq.toInt,
+ docSize,
+ totalDFSize.toInt)
+ vec.setQuick(dictIndex, currentTfIdf)
+ }
+ }
+ vec
+ }
+
+## setup our classifier
+
+ val labelMap = model.labelIndex
+ val numLabels = model.numLabels
+ val reverseLabelMap = labelMap.map(x => x._2 -> x._1)
+
+ // instantiate the correct type of classifier
+ val classifier = model.isComplementary match {
+ case true => new ComplementaryNBClassifier(model)
+ case _ => new StandardNBClassifier(model)
+ }
+
+## define an argmax function
+
+The label with the higest score wins the classification for a given document
+
+ def argmax(v: Vector): (Int, Double) = {
+ var bestIdx: Int = Integer.MIN_VALUE
+ var bestScore: Double = Integer.MIN_VALUE.asInstanceOf[Int].toDouble
+ for(i <- 0 until v.size) {
+ if(v(i) > bestScore){
+ bestScore = v(i)
+ bestIdx = i
+ }
+ }
+ (bestIdx, bestScore)
+ }
+
+## define our final TF(-IDF) vector classifier
+
+ def classifyDocument(clvec: Vector) : String = {
+ val cvec = classifier.classifyFull(clvec)
+ val (bestIdx, bestScore) = argmax(cvec)
+ reverseLabelMap(bestIdx)
+ }
+
+## Two sample news articles: United States Football and United Kingdom Football
+
+ // A random United States football article
+ // http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128
+ val UStextToClassify = new String("(Reuters) - Super Bowl security officials acknowledge" +
+ " the NFL championship game represents a high profile target on a world stage but are" +
+ " unaware of any specific credible threats against Sunday's showcase. In advance of" +
+ " one of the world's biggest single day sporting events, Homeland Security Secretary" +
+ " Jeh Johnson was in Glendale on Wednesday to review security preparations and tour" +
+ " University of Phoenix Stadium where the Seattle Seahawks and New England Patriots" +
+ " will battle. Deadly shootings in Paris and arrest of suspects in Belgium, Greece and" +
+ " Germany heightened fears of more attacks around the world and social media accounts" +
+ " linked to Middle East militant groups have carried a number of threats to attack" +
+ " high-profile U.S. events. There is no specific credible threat, said Johnson, who" +
+ " has appointed a federal coordination team to work with local, state and federal" +
+ " agencies to ensure safety of fans, players and other workers associated with the" +
+ " Super Bowl. I'm confident we will have a safe and secure and successful event." +
+ " Sunday's game has been given a Special Event Assessment Rating (SEAR) 1 rating, the" +
+ " same as in previous years, except for the year after the Sept. 11, 2001 attacks, when" +
+ " a higher level was declared. But security will be tight and visible around Super" +
+ " Bowl-related events as well as during the game itself. All fans will pass through" +
+ " metal detectors and pat downs. Over 4,000 private security personnel will be deployed" +
+ " and the almost 3,000 member Phoenix police force will be on Super Bowl duty. Nuclear" +
+ " device sniffing teams will be deployed and a network of Bio-Watch detectors will be" +
+ " set up to provide a warning in the event of a biological attack. The Department of" +
+ " Homeland Security (DHS) said in a press release it had held special cyber-security" +
+ " and anti-sniper training sessions. A U.S. official said the Transportation Security" +
+ " Administration, which is responsible for screening airline passengers, will add" +
+ " screeners and checkpoint lanes at airports. Federal air marshals, behavior detection" +
+ " officers and dog teams will help to secure transportation systems in the area. We" +
+ " will be ramping it (security) up on Sunday, there is no doubt about that, said Federal"+
+ " Coordinator Matthew Allen, the DHS point of contact for planning and support. I have" +
+ " every confidence the public safety agencies that represented in the planning process" +
+ " are going to have their best and brightest out there this weekend and we will have" +
+ " a very safe Super Bowl.")
+
+ // A random United Kingdom football article
+ // http://www.reuters.com/article/2015/01/26/manchester-united-swissquote-idUSL6N0V52RZ20150126
+ val UKtextToClassify = new String("(Reuters) - Manchester United have signed a sponsorship" +
+ " deal with online financial trading company Swissquote, expanding the commercial" +
+ " partnerships that have helped to make the English club one of the richest teams in" +
+ " world soccer. United did not give a value for the deal, the club's first in the sector," +
+ " but said on Monday it was a multi-year agreement. The Premier League club, 20 times" +
+ " English champions, claim to have 659 million followers around the globe, making the" +
+ " United name attractive to major brands like Chevrolet cars and sportswear group Adidas." +
+ " Swissquote said the global deal would allow it to use United's popularity in Asia to" +
+ " help it meet its targets for expansion in China. Among benefits from the deal," +
+ " Swissquote's clients will have a chance to meet United players and get behind the scenes" +
+ " at the Old Trafford stadium. Swissquote is a Geneva-based online trading company that" +
+ " allows retail investors to buy and sell foreign exchange, equities, bonds and other asset" +
+ " classes. Like other retail FX brokers, Swissquote was left nursing losses on the Swiss" +
+ " franc after Switzerland's central bank stunned markets this month by abandoning its cap" +
+ " on the currency. The fallout from the abrupt move put rival and West Ham United shirt" +
+ " sponsor Alpari UK into administration. Swissquote itself was forced to book a 25 million" +
+ " Swiss francs ($28 million) provision for its clients who were left out of pocket" +
+ " following the franc's surge. United's ability to grow revenues off the pitch has made" +
+ " them the second richest club in the world behind Spain's Real Madrid, despite a" +
+ " downturn in their playing fortunes. United Managing Director Richard Arnold said" +
+ " there was still lots of scope for United to develop sponsorships in other areas of" +
+ " business. The last quoted statistics that we had showed that of the top 25 sponsorship" +
+ " categories, we were only active in 15 of those, Arnold told Reuters. I think there is a" +
+ " huge potential still for the club, and the other thing we have seen is there is very" +
+ " significant growth even within categories. United have endured a tricky transition" +
+ " following the retirement of manager Alex Ferguson in 2013, finishing seventh in the" +
+ " Premier League last season and missing out on a place in the lucrative Champions League." +
+ " ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting by Jemima" +
+ " Kelly; editing by Keith Weir)")
+
+## vectorize and classify our documents
+
+ val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
+ val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)
+
+ println("Classifying the news article about superbowl security (united states)")
+ classifyDocument(usVec)
+
+ println("Classifying the news article about Manchester United (united kingdom)")
+ classifyDocument(ukVec)
+
+## tie everything together in a new method to classify new text
+
+ def classifyText(txt: String): String = {
+ val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
+ classifyDocument(v)
+
+ }
+
+## now we can simply call our classifyText method on any string
+
+ classifyText("Hello world from Queens")
+ classifyText("Hello world from London")
\ No newline at end of file