You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by pa...@apache.org on 2015/04/26 17:49:24 UTC

svn commit: r1676117 - /mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext

Author: pat
Date: Sun Apr 26 15:49:23 2015
New Revision: 1676117

URL: http://svn.apache.org/r1676117
Log:
CMS commit to mahout by pat

Modified:
    mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext?rev=1676117&r1=1676116&r2=1676117&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext Sun Apr 26 15:49:23 2015
@@ -1,27 +1,32 @@
 #How to create and App using Mahout
 
-This is an example of how to create a simple app using Mahout as a Library. The source is available on Github in the [3-input-cooc project](https://github.com/pferrel/3-input-cooc) with more explanation about what it does. For this tutorial we'll concentrate on how to create an app.
+This is an example of how to create a simple app using Mahout as a Library. The source is available on Github in the [3-input-cooc project](https://github.com/pferrel/3-input-cooc) with more explanation about what it does (has to do with collaborative filtering). For this tutorial we'll concentrate on the app rather than the data science.
 
-This example is for reading three interactions types and creating indicators for them using cooccurrence and cross-cooccurrence. The indicators will be written to text files in a format ready for search engine indexing in recommender.
+The app reads in three user-item interactions types and creats indicators for them using cooccurrence and cross-cooccurrence. The indicators will be written to text files in a format ready for search engine indexing in search engine based recommender.
 
 ##Setup
 In order to build and run the CooccurrenceDriver you need to install the following:
 
 * Install the Java 7 JDK from Oracle. Mac users look here: [Java SE Development Kit 7u72](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html).
-* Install sbt (simple build tool) 0.13.x for [Mac](http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Mac.html),[Linux](http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Linux.html) or [manual instalation](http://www.scala-sbt.org/release/tutorial/Manual-Installation.html).
-* Install [Mahout](http://mahout.apache.org/general/downloads.html). Don't forget to setup MAHOUT_HOME and MAHOUT_LOCAL
+* Install sbt (simple build tool) 0.13.x for [Mac](http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Mac.html), [Linux](http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Linux.html) or [manual instalation](http://www.scala-sbt.org/release/tutorial/Manual-Installation.html).
+* Install [Spark 1.1.1](https://spark.apache.org/docs/1.1.1/spark-standalone.html). Don't forget to setup SPARK_HOME
+* Install [Mahout 0.10.0](http://mahout.apache.org/general/downloads.html). Don't forget to setup MAHOUT_HOME and MAHOUT_LOCAL
+
+Why install if you are only using them as a library? Certain binaries and scripts are required by the libraries to get information about the environment like discovering where jars are located.
+
+Spark requires a set of jars on the classpath for the client side part of an app and another set of jars must be passed to the Spark Context for running distributed code. The example should discover all the neccessary classes automatically.
 
 ##Application
-Using Mahout as a library in an application will require a little Scala code. We have an App trait in Scala so we'll create an object, which inherits from ```App```
+Using Mahout as a library in an application will require a little Scala code. Scala has an App trait so we'll create an object, which inherits from ```App```
 
 
     object CooccurrenceDriver extends App {
     }
     
 
-This will look a little different than Java since ```App``` does delayed initialization, which causes the main body to be executed when the App is launched, just as in Java you would create a CooccurrenceDriver.main.
+This will look a little different than Java since ```App``` does delayed initialization, which causes the body to be executed when the App is launched, just as in Java you would create a main method.
 
-Before we can execute something on Spark we'll need to create a context. We could use raw Spark calls here but default values are setup for a Mahout context.
+Before we can execute something on Spark we'll need to create a context. We could use raw Spark calls here but default values are setup for a Mahout context by using the Mahout helper function.
 
     implicit val mc = mahoutSparkContext(masterUrl = "local", 
       appName = "CooccurrenceDriver")
@@ -30,7 +35,6 @@ We need to read in three files containin
 
 For example, here is data/purchase.csv:
 
-
     u1,iphone
     u1,ipad
     u2,nexus
@@ -39,9 +43,9 @@ For example, here is data/purchase.csv:
     u4,iphone
     u4,galaxy
 
-Mahout has a helper function that reads the text delimited in SparkEngine.indexedDatasetDFSReadElements. The function reads single elements in a distributed way to create the IndexedDataset. 
+Mahout has a helper function that reads the text delimited files  SparkEngine.indexedDatasetDFSReadElements. The function reads single element tuples (user-id,item-id) in a distributed way to create the IndexedDataset. Distributed Row Matrices (DRM) and Vectors are important data types supplied by Mahout and IndexedDataset is like a very lightweight Dataframe in R, it wraps a DRM with HashBiMaps for row and column IDs. 
 
-Notice we read in all datasets before we adjust the number of rows in them to match the total number of users in the data. This is so the math works out even if some users took one action but not another.
+One important thing to note about this example is that we read in all datasets before we adjust the number of rows in them to match the total number of users in the data. This is so the math works out [(A'A, A'B, A'C)](http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) even if some users took one action but not another there must be the same number of rows in all matrices.
 
     /**
      * Read files of element tuples and create IndexedDatasets one per action. These 
@@ -81,10 +85,9 @@ Notice we read in all datasets before we
 
 Now that we have the data read in we can perform the cooccurrence calculation.
 
-
-    // strip off names, method takes an array of IndexedDatasets
-    val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a => a._2))
-
+    // actions.map creates an array of just the IndeedDatasets
+    val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(
+      actions.map(a => a._2)) 
 
 All we need to do now is write the indicators.
 
@@ -163,11 +166,11 @@ Open IDEA and go to the menu File->New->
 
 At this point you may create a "Debug Configuration" to run. In the menu choose Run->Edit Configurations. Under "Default" choose "Application". In the dialog hit the elipsis button "..." to the right of "Environment Variables" and fill in your versions of JAVA_HOME, SPARK_HOME, and MAHOUT_HOME. In configuration editor under "Use classpath from" choose root-3-input-cooc module. 
 
-![image](http://mahout.apache.org/images/debug-config.png =400x)
+![image](http://mahout.apache.org/images/debug-config.png)
 
 Now choose "Application" in the left pane and hit the plus sign "+". give the config a name and hit the elipsis button to the right of the "Main class" field as shown.
 
-![image](http://mahout.apache.org/images/debug-config-2.png =600x)
+![image](http://mahout.apache.org/images/debug-config-2.png)
 
 
 After setting breakpoints you are now ready to debug the configuration. Go to the Run->Debug... menu and pick your configuration. This will execute using a local standalone instance of Spark.