You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by pa...@apache.org on 2015/04/21 02:22:43 UTC

svn commit: r1675009 - /mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext

Author: pat
Date: Tue Apr 21 00:22:43 2015
New Revision: 1675009

URL: http://svn.apache.org/r1675009
Log:
CMS commit to mahout by pat

Modified:
    mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext?rev=1675009&r1=1675008&r2=1675009&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext Tue Apr 21 00:22:43 2015
@@ -14,144 +14,135 @@ In order to build and run the Cooccurren
 ##Application
 Using Mahout as a library in an application will require a little Scala code. We have an App trait in Scala so we'll create an object, which inherits from ```App```
 
-```
-object CooccurrenceDriver extends App {
-}
-```
+
+    object CooccurrenceDriver extends App {
+    }
+    
+
 This will look a little different than Java since ```App``` does delayed initialization, which causes the main body to be executed when the App is launched, just as in Java you would create a CooccurrenceDriver.main.
 
 Before we can execute something on Spark we'll need to create a context. We could use raw Spark calls here but default values are setup for a Ma  // strip off names, which only takes and array of IndexedDatasets
   val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a => a._2))
 hout context.
 
-```
-implicit val mc = mahoutSparkContext(masterUrl = "local", appName = "2-input-cooc")
-```
+
+    implicit val mc = mahoutSparkContext(masterUrl = "local", appName = "2-input-cooc")
+    
 We need to read in three files containing different interaction types. The files will each be read into a Mahout IndexedDataset. This allows us to preserve application-specific user and item IDs throughout the calculations.
 
 For example, here is data/purchase.csv:
 
-```
-u1,iphone
-u1,ipad
-u2,nexus
-u2,galaxy
-u3,surface
-u4,iphone
-u4,galaxy
 
-```
+    u1,iphone
+    u1,ipad
+    u2,nexus
+    u2,galaxy
+    u3,surface
+    u4,iphone
+    u4,galaxy
+
 Mahout has a helper function that reads the text delimited in SparkEngine.indexedDatasetDFSReadElements. The function reads single elements in a distributed way to create the IndexedDataset. 
 
 Notice we read in all datasets before we adjust the number of rows in them to match the total number of users in the data. This is so the math works out even if some users took one action but not another.
 
-```
-/**
- * Read files of element tuples and create IndexedDatasets one per action. These share a userID BiMap but have
- * their own itemID BiMaps
- */
-def readActions(actionInput: Array[(String, String)]): Array[(String, IndexedDataset)] = {
-  var actions = Array[(String, IndexedDataset)]()
-
-  val userDictionary: BiMap[String, Int] = HashBiMap.create()
-
-  // The first action named in the sequence is the "primary" action and 
-  // begins to fill up the user dictionary
-  for ( actionDescription <- actionInput ) {// grab the path to actions
-    val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
-      actionDescription._2,
-      schema = DefaultIndexedDatasetElementReadSchema,
-      existingRowIDs = userDictionary)
-    userDictionary.putAll(action.rowIDs)
-    // put the name in the tuple with the indexedDataset
-    actions = actions :+ (actionDescription._1, action) 
-  }
-
-  // After all actions are read in the userDictonary will contain every user seen, 
-  // even if they may not have taken all actions . Now we adjust the row rank of 
-  // all IndxedDataset's to have this number of rows
-  // Note: this is very important or the cooccurrence calc may fail
-  val numUsers = userDictionary.size() // one more than the cardinality
-
-  val resizedNameActionPairs = actions.map { a =>
-    //resize the matrix by, in effect by adding empty rows
-    val resizedMatrix = a._2.create(a._2.matrix, userDictionary, a._2.columnIDs).newRowCardinality(numUsers)
-    (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
-  }
-  resizedNameActionPairs // return the array of Tuples
-}
+    /**
+     * Read files of element tuples and create IndexedDatasets one per action. These share     a userID BiMap but have
+     * their own itemID BiMaps
+     */
+    def readActions(actionInput: Array[(String, String)]): Array[(String, IndexedDataset)] = {
+      var actions = Array[(String, IndexedDataset)]()
+
+      val userDictionary: BiMap[String, Int] = HashBiMap.create()
+
+      // The first action named in the sequence is the "primary" action and 
+      // begins to fill up the user dictionary
+      for ( actionDescription <- actionInput ) {// grab the path to actions
+        val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
+          actionDescription._2,
+          schema = DefaultIndexedDatasetElementReadSchema,
+          existingRowIDs = userDictionary)
+        userDictionary.putAll(action.rowIDs)
+        // put the name in the tuple with the indexedDataset
+        actions = actions :+ (actionDescription._1, action) 
+      }
+
+      // After all actions are read in the userDictonary will contain every user seen, 
+      // even if they may not have taken all actions . Now we adjust the row rank of 
+      // all IndxedDataset's to have this number of rows
+      // Note: this is very important or the cooccurrence calc may fail
+      val numUsers = userDictionary.size() // one more than the cardinality
+
+      val resizedNameActionPairs = actions.map { a =>
+        //resize the matrix by, in effect by adding empty rows
+        val resizedMatrix = a._2.create(a._2.matrix, userDictionary, a._2.columnIDs).newRowCardinality(numUsers)
+        (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
+      }
+      resizedNameActionPairs // return the array of Tuples
+    }
 
-```
 
 Now that we have the data read in we can perform the cooccurrence calculation.
 
-```
-// strip off names, which only takes and array of IndexedDatasets
-val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a => a._2))
 
-```
+    // strip off names, which only takes and array of IndexedDatasets
+    val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a => a._2))
+
 
 All we need to do now is write the indicators.
 
-```
-// zip a pair of arrays into an array of pairs, reattaching the action names
-val indicatorDescriptions = actions.map(a => a._1).zip(indicatorMatrices)
+    // zip a pair of arrays into an array of pairs, reattaching the action names
+    val indicatorDescriptions = actions.map(a => a._1).zip(indicatorMatrices)
 writeIndicators(indicatorDescriptions)
-```
+
 
 The ```writeIndicators``` method uses the default write function ```dfsWrite```.
 
-```
-/**
- * Write indicatorMatrices to the output dir in the default format
- */
-def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
-  for (indicator <- indicators ) {
-    val indicatorDir = OutputPath + indicator._1
-    indicator._2.dfsWrite(
-      indicatorDir, // do we have to remove the last $ char?
-      // omit LLR strengths and format for search engine indexing
-      IndexedDatasetWriteBooleanSchema) 
-  }
-}
+    /**
+     * Write indicatorMatrices to the output dir in the default format
+     */
+    def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
+      for (indicator <- indicators ) {
+        val indicatorDir = OutputPath + indicator._1
+        indicator._2.dfsWrite(
+          indicatorDir, // do we have to remove the last $ char?
+          // omit LLR strengths and format for search engine indexing
+          IndexedDatasetWriteBooleanSchema) 
+      }
+    }
  
-```
 
 See the Github project for the full source. Now we create a build.sbt to build the example. 
 
-```
-name := "cooccurrence-driver"
+    name := "cooccurrence-driver"
 
-organization := "com.finderbots"
+    organization := "com.finderbots"
 
-version := "0.1"
+    version := "0.1"
 
-scalaVersion := "2.10.4"
+    scalaVersion := "2.10.4"
 
-val sparkVersion = "1.1.1"
+    val sparkVersion = "1.1.1"
 
-libraryDependencies ++= Seq(
-  "log4j" % "log4j" % "1.2.17",
-  // Mahout's Spark code
-  "commons-io" % "commons-io" % "2.4",
-  "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-math" % "0.10.0",
-  "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
-  // Google collections, AKA Guava
-  "com.google.guava" % "guava" % "16.0")
+    libraryDependencies ++= Seq(
+      "log4j" % "log4j" % "1.2.17",
+      // Mahout's Spark code
+      "commons-io" % "commons-io" % "2.4",
+      "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
+      "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
+      "org.apache.mahout" % "mahout-math" % "0.10.0",
+      "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
+      // Google collections, AKA Guava
+      "com.google.guava" % "guava" % "16.0")
 
-resolvers += "typesafe repo" at " http://repo.typesafe.com/typesafe/releases/"
+    resolvers += "typesafe repo" at " http://repo.typesafe.com/typesafe/releases/"
 
-resolvers += Resolver.mavenLocal
+    resolvers += Resolver.mavenLocal
 
-packSettings
+    packSettings
 
-packMain := Map(
-  "cooc" -> "CooccurrenceDriver"
-)
+    packMain := Map(
+      "cooc" -> "CooccurrenceDriver")
 
-```
 
 ##Build
 Building the examples from project's root folder: