You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by dl...@apache.org on 2014/05/20 00:03:25 UTC

svn commit: r1596073 - /mahout/site/mahout_cms/trunk/content/users/sparkbindings/play-with-shell.mdtext

Author: dlyubimov
Date: Mon May 19 22:03:24 2014
New Revision: 1596073

URL: http://svn.apache.org/r1596073
Log:
CMS commit to mahout by dlyubimov

Modified:
    mahout/site/mahout_cms/trunk/content/users/sparkbindings/play-with-shell.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/sparkbindings/play-with-shell.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/sparkbindings/play-with-shell.mdtext?rev=1596073&r1=1596072&r2=1596073&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/sparkbindings/play-with-shell.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/sparkbindings/play-with-shell.mdtext Mon May 19 22:03:24 2014
@@ -67,7 +67,7 @@ val drmData = drmParallelize(dense(
 
 Have a look at this matrix. The first four columns represent the ingredients (our features) and the last column (the rating) is the target variable for our regression. [Linear regression](https://en.wikipedia.org/wiki/Linear_regression) assumes that the **target variable y** is generated by the linear combination of **the feature matrix X** with the **parameter vector β** plus the **noise ε**, summarized in the formula **y = Xβ + ε**. Our goal is to find an estimate of the parameter vector *β* that explains the data very well.
 
-As a first step, we extract `\(\mathbf{X}\)' and *y* from our data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. **Mahout's DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.**
+As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and the first four columns, which have the ingredients in milligrams as content. Note that the result is again a DRM. The shell will not execute this code yet, it saves the history of operations and defers the execution until we really access a result. **Mahout's DSL automatically optimizes and parallelizes all operations on DRMs and runs them on Apache Spark.**
 
 <div class="codehilite"><pre>
 val drmX = drmData(::, 0 until 4)
@@ -135,7 +135,7 @@ def goodnessOfFit(drmX: DrmLike[Int], be
 
 So far we have left out an important aspect of a standard linear regression model. Usually there is a constant bias term added to the model. Without that, our model always crosses through the origin and we only learn the right angle. An easy way to add such a bias term to our model is to add a column of ones to the feature matrix *X*. The corresponding weight in the parameter vector will then be the bias term.
 
-Mahout's DSL offers a ```mapBlock()``` method for custom modifications of a DRM. All the rows in a partition are merged to a block of the matrix which is given to custom code in a closure. For our example, we invoke ```mapBlock``` with ```ncol = drmX.ncol + 1``` to let the system know that we change the number of columns of the matrix. The input to our closure is a ```block``` of the DRM and an array of ```keys``` for the rows contained in the block. In order to add a column, we first create a new block with an additional column, then copy the data from the current block into the new block and finally set the last column to ones and return the new block.
+Mahout's DSL offers a ```mapBlock()``` method for custom modifications of a DRM. All the rows in a partition are merged to a block of the matrix which is given to custom code in a closure. For our example, we invoke ```mapBlock``` with ```ncol = drmX.ncol + 1``` to let the system know that change the number of columns of the matrix. The input to our closure is a ```block``` of the DRM and an array of ```keys``` for the rows contained in the block. In order to add a column, we first create a new block with an additional column, then copy the data from the current block into the new block and finally set the last column to ones and return the new block.
 
 <div class="codehilite"><pre>
 val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) {
@@ -158,7 +158,7 @@ val betaWithBiasTerm = ols(drmXwithBiasC
 goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y)
 </pre></div>
 
-As a further optimization, we can make use of the DSL's caching functionality. We use ```drmXwithBiasColumn``` repeatedly  as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling ```checkpoint()```. In the end, we remove it from the cache with ```uncache```:
+As a further optimization, we can make use of the DSL's caching functionality. We use ```drmXwithBiasColumn``` repeatedly  as input to a computation, so it might be beneficial to cache it in memory. This is achieved by calling ```checkpoint()```. In the end, we remove it from the cache with uncache:
 
 <div class="codehilite"><pre>
 val cachedDrmX = drmXwithBiasColumn.checkpoint()
@@ -172,4 +172,4 @@ goodness
 </pre></div>
 
 
-Liked what you saw? Checkout Mahout's overview for the [Scala and Spark bindings](https://mahout.apache.org/users/sparkbindings/home.html).
+Liked what you saw? Checkout Mahout's overview for the [Scala and Spark bindings](https://mahout.apache.org/users/sparkbindings/home.html).
\ No newline at end of file