You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2014/05/18 17:44:37 UTC

svn commit: r909185 - in /websites/staging/mahout/trunk/content: ./ users/sparkbindings/play-with-shell.html

Author: buildbot
Date: Sun May 18 15:44:36 2014
New Revision: 909185

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/sparkbindings/play-with-shell.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sun May 18 15:44:36 2014
@@ -1 +1 @@
-1595618
+1595629

Modified: websites/staging/mahout/trunk/content/users/sparkbindings/play-with-shell.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/sparkbindings/play-with-shell.html (original)
+++ websites/staging/mahout/trunk/content/users/sparkbindings/play-with-shell.html Sun May 18 15:44:36 2014
@@ -353,7 +353,7 @@ export MASTER=[url of the Spark master]
 <h2 id="implementation">Implementation</h2>
 <p>We'll use the shell to interactively play with the data and incrementally implement a simple <a href="https://en.wikipedia.org/wiki/Linear_regression">linear regression</a> algorithm. Let's first load the dataset. Usually, we wouldn't need Mahout unless we processed a large dataset stored in a distributed filesystem. But for the sake of this example, we'll use our tiny toy dataset and "pretend" it was too big to fit onto a single machine.</p>
 <p><em>Note: You can incrementally follow the example by copy-and-pasting the code into your running Mahout shell.</em></p>
-<p>Mahout's linear algebra DSL has an abstraction called <em>DistributedRowMatrix (DRM)</em> which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use <code>dense()</code> to create a dense in-core matrix from our toy dataset and use <code>drmParallelize</code> to load it into the cluster, "mimicking" a large, partitioned dataset.</p>
+<p>Mahout's linear algebra DSL has an abstraction called <em>DistributedRowMatrix (DRM)</em> which models a matrix that is partitioned by rows and stored in the memory of a cluster of machines. We use <code>dense()</code> to create a dense in-memory matrix from our toy dataset and use <code>drmParallelize</code> to load it into the cluster, "mimicking" a large, partitioned dataset.</p>
 <div class="codehilite"><pre>
 val drmData = drmParallelize(dense(
   (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
@@ -374,12 +374,12 @@ val drmData = drmParallelize(dense(
 val drmX = drmData(::, 0 until 4)
 </pre></div>
 
-<p>Next, we extract the target variable vector <em>y</em>, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it in-core using <code>collect</code>:</p>
+<p>Next, we extract the target variable vector <em>y</em>, the fifth column of the data matrix. We assume this one fits into our driver machine, so we fetch it into memory using <code>collect</code>:</p>
 <div class="codehilite"><pre>
 val y = drmData.collect(::, 4)
 </pre></div>
 
-<p>Now we are ready to think about a mathematical way to estimate the parameter vector <em>β</em>. A simple textbook approach is <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares (OLS)</a>, which minimizes the sum of residual squares. In OLS, there is even a closed form expression for estimating <em>ß</em> as <strong><em>(X<sup>T</sup>X)<sup>-1</sup> X<sup>T</sup>y</em></strong>.</p>
+<p>Now we are ready to think about a mathematical way to estimate the parameter vector <em>β</em>. A simple textbook approach is <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">ordinary least squares (OLS)</a>, which minimizes the sum of residual squares between the true target variable and the prediction of the target variable. In OLS, there is even a closed form expression for estimating <em>ß</em> as <strong><em>(X<sup>T</sup>X)<sup>-1</sup> X<sup>T</sup>y</em></strong>.</p>
 <p>The first thing which we compute for this is <strong><em>X<sup>T</sup>X</em></strong>. The code for doing this in Mahout's scala DSL maps directly to the mathematical formula. The operation <code>.t()</code> transposes a matrix and analogous to R <code>%*%</code> denotes matrix multiplication.</p>
 <div class="codehilite"><pre>
 val drmXtX = drmX.t %*% drmX
@@ -389,7 +389,7 @@ val drmXtX = drmX.t %*% drmX
 <div class="codehilite"><pre>
 val drmXty = drmX.t %*% y
 </pre></div></p>
-<p>We're nearly done. The next step we take is to fetch <em>X<sup>T</sup>X</em> and <em>X<sup>T</sup>y</em> into the memory of our driver machine (we are targeting features matrices that are tall and skinny , so we can assume that <em>X<sup>T</sup>X</em> is small enough to fit in). Then, we provide them to an in-core solver (Mahout provides the an analogon to R's <code>solve()</code> for that) which computes <code>beta</code>, our OLS estimate of the parameter vector <em>β</em>.</p>
+<p>We're nearly done. The next step we take is to fetch <em>X<sup>T</sup>X</em> and <em>X<sup>T</sup>y</em> into the memory of our driver machine (we are targeting features matrices that are tall and skinny , so we can assume that <em>X<sup>T</sup>X</em> is small enough to fit in). Then, we provide them to an in-memory solver (Mahout provides the an analogon to R's <code>solve()</code> for that) which computes <code>beta</code>, our OLS estimate of the parameter vector <em>β</em>.</p>
 <div class="codehilite"><pre>
 val XtX = drmXtX.collect
 val Xty = drmXty.collect(::, 0)