You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ak...@apache.org on 2017/03/21 15:34:58 UTC

[1/2] mahout git commit: [MAHOUT-1956] Update README.md to include build instructions and an example for GPU/GPU/JVM

Repository: mahout
Updated Branches:
  refs/heads/master 2367fde82 -> 1199e5117


[MAHOUT-1956] Update README.md to include build instructions and an example for GPU/GPU/JVM


Project: http://git-wip-us.apache.org/repos/asf/mahout/repo
Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/9eb68e53
Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/9eb68e53
Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/9eb68e53

Branch: refs/heads/master
Commit: 9eb68e5319ad6f7692779e566206a159f98be8b9
Parents: 0496194
Author: Andrew Palumbo <ap...@apache.org>
Authored: Sun Mar 19 14:33:04 2017 -0700
Committer: Andrew Palumbo <ap...@apache.org>
Committed: Sun Mar 19 14:33:04 2017 -0700

----------------------------------------------------------------------
 README.md | 208 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 196 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mahout/blob/9eb68e53/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index 1b058ae..c51db40 100644
--- a/README.md
+++ b/README.md
@@ -30,11 +30,6 @@ export MAHOUT_LOCAL=true # for running standalone on your dev machine,
 ```
 You will need a `$JAVA_HOME`, and if you are running on Spark, you will also need `$SPARK_HOME`
 
-Note when running the spark-shell job it can help to set some JVM options so you don't run out of memory:
-```
-$MAHOUT_OPTS="-Xmx6g -XX:MaxPermSize=512m" mahout spark-shell
-```
-
 ####Using Mahout as a Library
 Running any application that uses Mahout will require installing a binary or source version and setting the environment.
 To compile from source:
@@ -61,25 +56,214 @@ and a dependency for back end engine translation, e.g:
     <version>${mahout.version}</version>
 </dependency>
 ```
+####Building From Source
+
+######Prerequisites:
+
+Linux Environment (preferably Ubuntu 16.04.x) Note: Currently only the JVM-only build will work on a Mac.
+gcc > 4.x
+NVIDIA Card (installed with OpenCL drivers alongside usual GPU drivers)
+
+######Downloads
+
+Install java 1.7+ in an easily accessible directory (for this example,  ~/java/)
+http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
+    
+Create a directory ~/apache/ .
+    
+Download apache Maven 3.3.9 and un-tar/gunzip to ~/apache/apache-maven-3.3.9/ .
+https://maven.apache.org/download.cgi
+        
+Download and un-tar/gunzip Hadoop 2.4.1 to ~/apache/hadoop-2.4.1/ .
+https://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/    
+
+Download and un-tar/gunzip spark-1.6.3-bin-hadoop2.4 to  ~/apache/ .
+http://spark.apache.org/downloads.html
+Choose release: Spark-1.6.3 (Nov 07 2016)
+Choose package type: Pre-Built for Hadoop 2.4
+
+Install ViennaCL 1.7.0+
+If running Ubuntu 16.04+
+
 ```
-<dependency>
-    <groupId>org.apache.mahout</groupId>
-    <artifactId>mahout-flink_2.10</artifactId>
-    <version>${mahout.version}</version>
-</dependency>
+sudo apt-get install libviennacl-dev
+```
+
+Otherwise if your distribution\u2019s package manager does not have a viennniacl-dev package >1.7.0, clone it directly into the directory which will be included in when  being compiled by Mahout:
+
+```
+mkdir ~/tmp
+cd ~/tmp && git clone https://github.com/viennacl/viennacl-dev.git
+cp -r viennacl/ /usr/local/
+cp -r CL/ /usr/local/
+```
+
+Ensure that the OpenCL 1.2+ drivers are installed (packed with most consumer grade NVIDIA drivers).  Not sure about higher end cards.
+
+Clone mahout repository into `~/apache`.
+
 ```
+git clone https://github.com/apache/mahout.git
+```
+
+######CONFIGURATION
+
+When building mahout for a spark backend, we need four System Environment variables set:
+```
+    export MAHOUT_HOME=/home/<user>/apache/mahout
+    export HADOOP_HOME=/home/<user>/apache/hadoop-2.4.1
+    export SPARK_HOME=/home/<user>/apache/spark-1.6.3-bin-hadoop2.4    
+    export JAVA_HOME=/home/<user>/java/jdk-1.8.121
+```
+
+Mahout on Spark regularly uses one more env variable, the IP of the Spark cluster\u2019s master node (usually the node which one would be logged into).
+
+To use 4 local cores (Spark master need not be running)
+```
+export MASTER=local[4]
+```
+To use all available local cores (again, Spark master need not be running)
+```
+export MASTER=local[*]
+```
+To point to a cluster with spark running: 
+```
+export MASTER=spark://master.ip.address:7077
+```
+
+We then add these to the path:
+
+```
+   PATH=$PATH$:MAHOUT_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin:$JAVA_HOME/bin
+```
+
+These should be added to the your ~/.bashrc file.
+
+
+######BUILDING MAHOUT WITH APACHE MAVEN
+
+Currently Mahout has 3 builds.  From the  $MAHOUT_HOME directory we may issue the commands to build each using mvn profiles.
+
+JVM only:
+```
+mvn clean install -DskipTests
+```
+
+JVM with native OpenMP level 2 and level 3 matrix/vector Multiplication
+```
+mvn clean install -Pviennacl-omp -Phadoop2 -DskipTests
+```
+JVM with native OpenMP and OpenCL for Level 2 and level 3 matrix/vector Multiplication.  (GPU errors fall back to OpenMP, currently only a single GPU/node is supported).
+```
+mvn clean install -Pviennacl -Phadoop2 -DskipTests
+```
+
+####TESTING THE MAHOUT ENVIRONMENT
+
+Mahout provides an extension to the spark-shell, which is good for getting to know the language, testing partition loads, prototyping algorithms, etc..
+
+To launch the shell in local mode with 2 threads: simply do the following:
+```
+$ MASTER=local[2] mahout spark-shell
+```
+
+After a very verbose startup, a Mahout welcome screen will appear:
+
+```
+Loading /home/andy/sandbox/apache-mahout-distribution-0.13.0/bin/load-shell.scala...
+import org.apache.mahout.math._
+import org.apache.mahout.math.scalabindings._
+import org.apache.mahout.math.drm._
+import org.apache.mahout.math.scalabindings.RLikeOps._
+import org.apache.mahout.math.drm.RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = org.apache.mahout.sparkbindings.SparkDistributedContext@3ca1f0a4
+
+                _                 _
+_ __ ___   __ _| |__   ___  _   _| |_
+ '_ ` _ \ / _` | '_ \ / _ \| | | | __|
+ | | | | (_| | | | | (_) | |_| | |_
+_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 0.13.0
+
+
+That file does not exist
+
+
+scala>
+
+At the scala> prompt, enter: 
+    
+scala> :load home/<andy>/apache/mahout/examples
+                               /bin/SparseSparseDrmTimer.mscala
+
+Which will load a matrix multiplication timer function definition. To run the matrix timer: 
+
+        scala> timeSparseDRMMMul(1000,1000,1000,1,.02,1234L)
+            {...} res3: Long = 16321
+
+We can see that the JVM only version is rather slow, thus our motive for GPU and Native Multithreading support.
+
+To get an idea of what\u2019s going on under the hood of the timer, we may examine the .mscala (mahout scala) code which is both fully functional scala and the Mahout R-Like DSL for tensor algebra:    
+
+
+
+
+def timeSparseDRMMMul(m: Int, n: Int, s: Int, para: Int, pctDense: Double = .20, seed: Long = 1234L): Long = {
+  val drmA = drmParallelizeEmpty(m , s, para).mapBlock(){
+       case (keys,block:Matrix) =>
+           val R =  scala.util.Random
+           R.setSeed(seed)
+           val blockB = new SparseRowMatrix(block.nrow, block.ncol)
+           blockB := {x => if (R.nextDouble > pctDense) R.nextDouble else x }
+       (keys -> blockB)
+  }
+
+  val drmB = drmParallelizeEmpty(s , n, para).mapBlock(){
+       case (keys,block:Matrix) =>
+           val R =  scala.util.Random
+           R.setSeed(seed + 1)
+           val blockB = new SparseRowMatrix(block.nrow, block.ncol)
+           blockB := {x => if (R.nextDouble > pctDense) R.nextDouble else x }
+       (keys -> blockB)
+  }
+
+  var time = System.currentTimeMillis()
+
+  val drmC = drmA %*% drmB
+ 
+  // trigger computation
+  drmC.numRows()
+
+  time = System.currentTimeMillis() - time
+
+  time  
+ 
+}
+```
+
+For more information please see the following references:
+
+http://mahout.apache.org/users/environment/in-core-reference.html
+http://mahout.apache.org/users/environment/out-of-core-reference.html
+http://mahout.apache.org/users/sparkbindings/play-with-shell.html
+http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
+
+
+
 
 Note that due to an intermittent out-of-memory bug in a Flink test we have disabled it from the binary releases. To use Flink please uncomment the line in the root pom.xml in the `<modules>` block so it reads `<module>flink</module>`.
 
-####Examples
+####EXAMPLES
 For examples of how to use Mahout, see the examples directory located in `examples/bin`
 
 
 For information on how to contribute, visit the [How to Contribute Page](https://mahout.apache.org/developers/how-to-contribute.html)
   
 
-####Legal
+####LEGAL
 Please see the `NOTICE.txt` included in this directory for more information.
 
 [![Build Status](https://api.travis-ci.org/apache/mahout.svg?branch=master)](https://travis-ci.org/apache/mahout)
+<!--
 [![Coverage Status](https://coveralls.io/repos/github/apache/mahout/badge.svg?branch=master)](https://coveralls.io/github/apache/mahout?branch=master)
+-->
\ No newline at end of file


[2/2] mahout git commit: Merge branch 'master' into MAHOUT-1956

Posted by ak...@apache.org.
Merge branch 'master' into MAHOUT-1956


Project: http://git-wip-us.apache.org/repos/asf/mahout/repo
Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/1199e511
Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/1199e511
Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/1199e511

Branch: refs/heads/master
Commit: 1199e51179bf858ccc3c809b332a0cc9cd950adb
Parents: 9eb68e5 2367fde
Author: Andrew Musselman <ak...@apache.org>
Authored: Tue Mar 21 08:34:21 2017 -0700
Committer: Andrew Musselman <ak...@apache.org>
Committed: Tue Mar 21 08:34:21 2017 -0700

----------------------------------------------------------------------

----------------------------------------------------------------------