You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ap...@apache.org on 2015/04/23 02:57:50 UTC

svn commit: r1675527 - /mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext

Author: apalumbo
Date: Thu Apr 23 00:57:50 2015
New Revision: 1675527

URL: http://svn.apache.org/r1675527
Log:
add a classify document from the shell tutorial

Added:
    mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext

Added: mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext?rev=1675527&view=auto
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext (added)
+++ mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext Thu Apr 23 00:57:50 2015
@@ -0,0 +1,236 @@
+#Classifying a Document with the Mahout Shell
+
+This tutorial assumes that you have Spark configured for the ```spark-shell``` See [Playing with Mahout's Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html).  As well we assume that Mahout is running in cluster mode (i.e. with the ```MAHOUT_LOCAL``` environment variable unset) so that the output is put into HDFS.
+
+## Downloading and Vectorizing the wikipedia dataset
+*As of Mahout v0.10.0, we are still reliant on the MapReduce versions of ```mahout seqwiki``` and ```mahout seq2sparse``` to extract and vectorize our text.  A* [*Spark implemenation of seq2sparse*](https://issues.apache.org/jira/browse/MAHOUT-1663) *is in the works for Mahout v0.11.* However, to download the wikipedia dataset, extract the bodies of the documentation, label each document and vectorize the text into TF-IDF vectors, we can sipmly run the [wikipedia-classifier.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh) example.  
+
+    Please select a number to choose the corresponding task to run
+    1. CBayes (may require increased heap space on yarn)
+    2. BinaryCBayes
+    3. clean -- cleans up the work area in /tmp/mahout-work-wiki
+    Enter your choice :
+
+Enter (2). This will download a large recent XML dump of the wikipedia database, into a ```/tmp/mahout-work-wiki``` directory, unzip it and  place it into HDFS.  It will run a [MapReduce job to parse the wikipedia set](http://mahout.apache.org/users/classification/wikipedia-classifier-example.html), extracting and labeling only pages with category tags for [United States] and [United Kingdom]. It will then run ```mahout seq2sparse``` to convert the documents into TF-IDF vectors.  The script will also a build and test a [Naive Bayes model using MapReduce](http://mahout.apache.org/users/classification/bayesian.html).  When it is completed, you should see a confusion matrix on your screen.  For this tutorial, we will ignore the MapReduce model, and build a new model using Spark based on the vectorization data created by ```seq2sparse```.
+
+## Getting Started
+
+Launch the ```mahout-shell```.  There is an example script: ```spark-document-classifier.mscala``` (`.mscala` denotes a Mahout-Scala script which can be run similarly to an R-script).   We will be walking through this script for this tutorial but if you wanted to simply run the script, you could just issue the command: 
+
+    mahout> :load /path/to/mahout/examples/bin/spark-document-classifier.mscala
+
+For now, lets take the script apart piece by piece.
+
+## Imports
+
+Our mahout Naive Bayes Imports:
+
+    import org.apache.mahout.classifier.naivebayes._
+    import org.apache.mahout.classifier.stats._
+    import org.apache.mahout.nlp.tfidf._
+
+Hadoop Imports needed to read our dictionary:
+
+    import org.apache.hadoop.io.Text
+    import org.apache.hadoop.io.IntWritable
+    import org.apache.hadoop.io.LongWritable
+
+## read in our full set from HDFS as vectorized by seq2sparse in classify-wikipedia.sh
+
+    val pathToData = "/tmp/mahout-work-wiki/"
+    val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
+
+## extract the category of each observation and aggregate those observation by category
+
+    val (labelIndex, aggregatedObservations) = SparkNaiveBayes.extractLabelsAndAggregateObservations(fullData)
+
+## build a Muitinomial Naive Bayes model and self test on the training set
+
+    val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, false)
+    val resAnalyzer = SparkNaiveBayes.test(model, fullData, false)
+    println(resAnalyzer)
+    
+printing the result analyzer will display the confusion matrix
+
+## read in the dictionary and document frequency count from HDFS
+    
+    val dictionary = sdc.sequenceFile(pathToData + "wikipediaVecs/dictionary.file-0",
+                                      classOf[Text],
+                                      classOf[IntWritable])
+    val documentFrequencyCount = sdc.sequenceFile(pathToData + "wikipediaVecs/df-count",
+                                                  classOf[IntWritable],
+                                                  classOf[LongWritable])
+
+    // setup the dictionary and document frequency count as maps
+    val dictionaryRDD = dictionary.map { 
+                                    case (wKey, wVal) => wKey.asInstanceOf[Text]
+                                                             .toString() -> wVal.get() 
+                                       }
+                                       
+    val documentFrequencyCountRDD = documentFrequencyCount.map {
+                                            case (wKey, wVal) => wKey.asInstanceOf[IntWritable]
+                                                                     .get() -> wVal.get() 
+                                                               }
+    
+    val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> x._2.toInt).toMap
+    val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> x._2.toLong).toMap
+
+## define a function to tokeinze and vectorize new text using our current dictionary
+
+For this simple example, our function ```vectorizeDocument(...) will tokenize a new document into unigrams using native Java String methods and vectorize usingour dictionary and document frequencies. You could also use a [Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., and integrate Apache [Tika](https://tika.apache.org/) to extract text from different document types (PDF, PPT, XLS, etc.).  Here, however we will kwwp ot simple and split ouor text using regexs and native String methods.
+
+    def vectorizeDocument(document: String,
+                            dictionaryMap: Map[String,Int],
+                            dfMap: Map[Int,Long]): Vector = {
+        val wordCounts = document.replaceAll("[^\\p{L}\\p{Nd}]+", " ")
+                                    .toLowerCase
+                                    .split(" ")
+                                    .groupBy(identity)
+                                    .mapValues(_.length)         
+        val vec = new RandomAccessSparseVector(dictionaryMap.size)
+        val totalDFSize = dfMap(-1)
+        val docSize = wordCounts.size
+        for (word <- wordCounts) {
+            val term = word._1
+            if (dictionaryMap.contains(term)) {
+                val tfidf: TermWeight = new TFIDF()
+                val termFreq = word._2
+                val dictIndex = dictionaryMap(term)
+                val docFreq = dfCountMap(dictIndex)
+                val currentTfIdf = tfidf.calculate(termFreq,
+                                                   docFreq.toInt,
+                                                   docSize,
+                                                   totalDFSize.toInt)
+                vec.setQuick(dictIndex, currentTfIdf)
+            }
+        }
+        vec
+    }
+
+## setup our classifier
+
+    val labelMap = model.labelIndex
+    val numLabels = model.numLabels
+    val reverseLabelMap = labelMap.map(x => x._2 -> x._1)
+    
+    // instantiate the correct type of classifier
+    val classifier = model.isComplementary match {
+        case true => new ComplementaryNBClassifier(model)
+        case _ => new StandardNBClassifier(model)
+    }
+
+## define an argmax function 
+
+The label with the higest score wins the classification for a given document
+    
+    def argmax(v: Vector): (Int, Double) = {
+        var bestIdx: Int = Integer.MIN_VALUE
+        var bestScore: Double = Integer.MIN_VALUE.asInstanceOf[Int].toDouble
+        for(i <- 0 until v.size) {
+            if(v(i) > bestScore){
+                bestScore = v(i)
+                bestIdx = i
+            }
+        }
+        (bestIdx, bestScore)
+    }
+
+## define our final TF(-IDF) vector classifier
+
+    def classifyDocument(clvec: Vector) : String = {
+        val cvec = classifier.classifyFull(clvec)
+        val (bestIdx, bestScore) = argmax(cvec)
+        reverseLabelMap(bestIdx)
+    }
+
+## Two sample news articles: United States Football and United Kingdom Football
+    
+    // A random United States football article
+    // http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128
+    val UStextToClassify = new String("(Reuters) - Super Bowl security officials acknowledge" +
+        " the NFL championship game represents a high profile target on a world stage but are" +
+        " unaware of any specific credible threats against Sunday's showcase. In advance of" +
+        " one of the world's biggest single day sporting events, Homeland Security Secretary" +
+        " Jeh Johnson was in Glendale on Wednesday to review security preparations and tour" +
+        " University of Phoenix Stadium where the Seattle Seahawks and New England Patriots" +
+        " will battle. Deadly shootings in Paris and arrest of suspects in Belgium, Greece and" +
+        " Germany heightened fears of more attacks around the world and social media accounts" +
+        " linked to Middle East militant groups have carried a number of threats to attack" +
+        " high-profile U.S. events. There is no specific credible threat, said Johnson, who" + 
+        " has appointed a federal coordination team to work with local, state and federal" +
+        " agencies to ensure safety of fans, players and other workers associated with the" + 
+        " Super Bowl. I'm confident we will have a safe and secure and successful event." +
+        " Sunday's game has been given a Special Event Assessment Rating (SEAR) 1 rating, the" +
+        " same as in previous years, except for the year after the Sept. 11, 2001 attacks, when" +
+        " a higher level was declared. But security will be tight and visible around Super" +
+        " Bowl-related events as well as during the game itself. All fans will pass through" +
+        " metal detectors and pat downs. Over 4,000 private security personnel will be deployed" +
+        " and the almost 3,000 member Phoenix police force will be on Super Bowl duty. Nuclear" +
+        " device sniffing teams will be deployed and a network of Bio-Watch detectors will be" +
+        " set up to provide a warning in the event of a biological attack. The Department of" +
+        " Homeland Security (DHS) said in a press release it had held special cyber-security" +
+        " and anti-sniper training sessions. A U.S. official said the Transportation Security" +
+        " Administration, which is responsible for screening airline passengers, will add" +
+        " screeners and checkpoint lanes at airports. Federal air marshals, behavior detection" +
+        " officers and dog teams will help to secure transportation systems in the area. We" +
+        " will be ramping it (security) up on Sunday, there is no doubt about that, said Federal"+
+        " Coordinator Matthew Allen, the DHS point of contact for planning and support. I have" +
+        " every confidence the public safety agencies that represented in the planning process" +
+        " are going to have their best and brightest out there this weekend and we will have" +
+        " a very safe Super Bowl.")
+    
+    // A random United Kingdom football article
+    // http://www.reuters.com/article/2015/01/26/manchester-united-swissquote-idUSL6N0V52RZ20150126
+    val UKtextToClassify = new String("(Reuters) - Manchester United have signed a sponsorship" +
+        " deal with online financial trading company Swissquote, expanding the commercial" +
+        " partnerships that have helped to make the English club one of the richest teams in" +
+        " world soccer. United did not give a value for the deal, the club's first in the sector," +
+        " but said on Monday it was a multi-year agreement. The Premier League club, 20 times" +
+        " English champions, claim to have 659 million followers around the globe, making the" +
+        " United name attractive to major brands like Chevrolet cars and sportswear group Adidas." +
+        " Swissquote said the global deal would allow it to use United's popularity in Asia to" +
+        " help it meet its targets for expansion in China. Among benefits from the deal," +
+        " Swissquote's clients will have a chance to meet United players and get behind the scenes" +
+        " at the Old Trafford stadium. Swissquote is a Geneva-based online trading company that" +
+        " allows retail investors to buy and sell foreign exchange, equities, bonds and other asset" +
+        " classes. Like other retail FX brokers, Swissquote was left nursing losses on the Swiss" +
+        " franc after Switzerland's central bank stunned markets this month by abandoning its cap" +
+        " on the currency. The fallout from the abrupt move put rival and West Ham United shirt" +
+        " sponsor Alpari UK into administration. Swissquote itself was forced to book a 25 million" +
+        " Swiss francs ($28 million) provision for its clients who were left out of pocket" +
+        " following the franc's surge. United's ability to grow revenues off the pitch has made" +
+        " them the second richest club in the world behind Spain's Real Madrid, despite a" +
+        " downturn in their playing fortunes. United Managing Director Richard Arnold said" +
+        " there was still lots of scope for United to develop sponsorships in other areas of" +
+        " business. The last quoted statistics that we had showed that of the top 25 sponsorship" +
+        " categories, we were only active in 15 of those, Arnold told Reuters. I think there is a" +
+        " huge potential still for the club, and the other thing we have seen is there is very" +
+        " significant growth even within categories. United have endured a tricky transition" +
+        " following the retirement of manager Alex Ferguson in 2013, finishing seventh in the" +
+        " Premier League last season and missing out on a place in the lucrative Champions League." +
+        " ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional reporting by Jemima" + 
+        " Kelly; editing by Keith Weir)")
+
+## vectorize and classify our documents
+
+    val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
+    val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)
+    
+    println("Classifying the news article about superbowl security (united states)")
+    classifyDocument(usVec)
+    
+    println("Classifying the news article about Manchester United (united kingdom)")
+    classifyDocument(ukVec)
+
+## tie everything together in a new method to classify new text 
+    
+    def classifyText(txt: String): String = {
+        val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
+        classifyDocument(v)
+
+    }
+
+## now we can simply call our classifyText method on any string
+
+    classifyText("Hello world from Queens")
+    classifyText("Hello world from London")
\ No newline at end of file