You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2017/02/04 00:18:06 UTC

svn commit: r1006174 - in /websites/staging/mahout/trunk/content: ./ users/algorithms/d-spca.html

Author: buildbot
Date: Sat Feb  4 00:18:06 2017
New Revision: 1006174

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/algorithms/d-spca.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sat Feb  4 00:18:06 2017
@@ -1 +1 @@
-1781627
+1781628

Modified: websites/staging/mahout/trunk/content/users/algorithms/d-spca.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/algorithms/d-spca.html (original)
+++ websites/staging/mahout/trunk/content/users/algorithms/d-spca.html Sat Feb  4 00:18:06 2017
@@ -281,9 +281,9 @@
 h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, dt:hover > .elementid-permalink { visibility: visible }</style>
 <h1 id="distributed-stochastic-pca">Distributed Stochastic PCA<a class="headerlink" href="#distributed-stochastic-pca" title="Permanent link">&para;</a></h1>
 <h2 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent link">&para;</a></h2>
-<p>Mahout has a distributed implementation of Stochastic PCA[1]. this algorithm computes the exact equivalent of Mahout's <code>dssvd(\(\mathbf{A-1\mu}\))</code> by modifying the <code>dssvd</code> algorithm so as to avoid forming <code>\(\mathbf{A-1\mu}\)</code>, which would densify a sparse input. Thus, it is suitable for work with both dense and sparse inputs.</p>
+<p>Mahout has a distributed implementation of Stochastic PCA[1]. this algorithm computes the exact equivalent of Mahout's <code>dssvd(``\(\mathbf{A-1\mu}\)``)</code> by modifying the <code>dssvd</code> algorithm so as to avoid forming <code>\(\mathbf{A-1\mu}\)</code>, which would densify a sparse input. Thus, it is suitable for work with both dense and sparse inputs.</p>
 <h2 id="algorithm">Algorithm<a class="headerlink" href="#algorithm" title="Permanent link">&para;</a></h2>
-<p>Given an <em>m</em> <code>\(\times\)</code> <em>n</em> matrix <code>\(\mathbf{A}\)</code>, a target rank <em>k</em>, and an oversampling parameter <em>p</em>, this procedure computes a <em>k</em>-rank PCA by finding the unknowns in <code>\(\mathbf{A−1\mu^\top \approx U\Sigma V}\)</code>:</p>
+<p>Given an <em>m</em> <code>\(\times\)</code> <em>n</em> matrix <code>\(\mathbf{A}\)</code>, a target rank <em>k</em>, and an oversampling parameter <em>p</em>, this procedure computes a <em>k</em>-rank PCA by finding the unknowns in <code>\(\mathbf{A−1\mu^\top \approx U\Sigma V^\top}\)</code>:</p>
 <ol>
 <li>Create seed for random <em>n</em> <code>\(\times\)</code> <em>(k+p)</em> matrix <code>\(\Omega\)</code>.</li>
 <li><code>\(\mathbf{s_\Omega \leftarrow \Omega^\top \mu}\)</code>.</li>
@@ -305,130 +305,131 @@ h2:hover > .headerlink, h3:hover > .head
 <li>Compute an eigensolution of the small symmetric <code>\(\mathbf{M = \hat{U} \Lambda \hat{U}^\top: M \in \mathbb{R}^{(k+p)\times(k+p)}}\)</code>.</li>
 <li>The singular values <code>\(\Sigma = \Lambda^{\circ 0.5}\)</code>, or, in other words, <code>\(\mathbf{\sigma_i= \sqrt{\lambda_i}}\)</code>.</li>
 <li>If needed, compute <code>\(\mathbf{U = Q\hat{U}}\)</code>.</li>
-<li>If needed, compute <code>\(\mathbf{V = B^\top \hat{U} \Sigma^{−1}}\)</code>. Another way is <code>\(\mathbf{V = A^\top U\Sigma^{−1}}\)</code>.</li>
+<li>If needed, compute <code>\(\mathbf{V = B^\top \hat{U} \Sigma^{−1}}\)</code>.</li>
 <li>If needed, items converted to the PCA space can be computed as <code>\(\mathbf{U\Sigma}\)</code>.</li>
 </ol>
 <h2 id="implementation">Implementation<a class="headerlink" href="#implementation" title="Permanent link">&para;</a></h2>
 <p>Mahout <code>dspca(...)</code> is implemented in the mahout <code>math-scala</code> algebraic optimizer which translates Mahout's R-like linear algebra operators into a physical plan for both Spark and H2O distributed engines.</p>
-<p>def dspca<a href="drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0">K</a>:
-  (DrmLike[K], DrmLike[Int], Vector) = {</p>
-<div class="codehilite"><pre>// Some mapBlock() calls need it
-implicit val ktag =  drmA.keyClassTag
-
-val drmAcp = drmA.checkpoint()
-implicit val ctx = drmAcp.context
-
-val m = drmAcp.nrow
-val n = drmAcp.ncol
-assert(k <span class="err">&lt;</span>= (m min n), &quot;k cannot be greater than smaller of m, n.&quot;)
-val pfxed = safeToNonNegInt((m min n) - k min p)
-
-// Actual decomposition rank
-val r = k + pfxed
-
-// Dataset mean
-val mu = drmAcp.colMeans
-
-val mtm = mu dot mu
-
-// We represent Omega by its seed.
-val omegaSeed = RandomUtils.getRandom().nextInt()
-val omega = Matrices.symmetricUniformView(n, r, omegaSeed)
-
-// This done in front in a single-threaded fashion for now. Even though it doesn&#39;t require any
-// memory beyond that is required to keep xi around, it still might be parallelized to backs
-// for significantly big n and r. TODO
-val s_o = omega.t %*% mu
-
-val bcastS_o = drmBroadcast(s_o)
-val bcastMu = drmBroadcast(mu)
-
-var drmY = drmAcp.mapBlock(ncol = r) {
-  case (keys, blockA) ⇒
-    val s_o:Vector = bcastS_o
-    val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed)
-    for (row ← 0 until blockY.nrow) blockY(row, ::) -= s_o
-    keys → blockY
-}
-    // Checkpoint Y
-    .checkpoint()
+<div class="codehilite"><pre>def dspca[K](drmA: DrmLike[K], k: Int, p: Int = 15, q: Int = 0): 
+(DrmLike[K], DrmLike[Int], Vector) = {
 
-var drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
+    // Some mapBlock() calls need it
+    implicit val ktag =  drmA.keyClassTag
 
-var s_q = drmQ.colSums()
-var bcastVarS_q = drmBroadcast(s_q)
+    val drmAcp = drmA.checkpoint()
+    implicit val ctx = drmAcp.context
 
-// This actually should be optimized as identically partitioned map-side A&#39;B since A and Q should
-// still be identically partitioned.
-var drmBt = (drmAcp.t %*% drmQ).checkpoint()
-
-var s_b = (drmBt.t %*% mu).collect(::, 0)
-var bcastVarS_b = drmBroadcast(s_b)
-
-for (i ← 0 until q) {
-
-  // These closures don&#39;t seem to live well with outside-scope vars. This doesn&#39;t record closure
-  // attributes correctly. So we create additional set of vals for broadcast vars to properly
-  // create readonly closure attributes in this very scope.
-  val bcastS_q = bcastVarS_q
-  val bcastMuInner = bcastMu
-
-  // Fix Bt as B&#39; -= xi cross s_q
-  drmBt = drmBt.mapBlock() {
-    case (keys, block) ⇒
-      val s_q: Vector = bcastS_q
-      val mu: Vector = bcastMuInner
-      keys.zipWithIndex.foreach {
-        case (key, idx) ⇒ block(idx, ::) -= s_q * mu(key)
-      }
-      keys → block
-  }
-
-  drmY.uncache()
-  drmQ.uncache()
-
-  val bCastSt_b = drmBroadcast(s_b -=: mtm * s_q)
-
-  drmY = (drmAcp %*% drmBt)
-      // Fix Y by subtracting st_b from each row of the AB&#39;
-      .mapBlock() {
-    case (keys, block) ⇒
-      val st_b: Vector = bCastSt_b
-      block := { (_, c, v) ⇒ v - st_b(c) }
-      keys → block
-  }
-      // Checkpoint Y
-      .checkpoint()
-
-  drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
-
-  s_q = drmQ.colSums()
-  bcastVarS_q = drmBroadcast(s_q)
-
-  // This on the other hand should be inner-join-and-map A&#39;B optimization since A and Q_i are not
-  // identically partitioned anymore.
-  drmBt = (drmAcp.t %*% drmQ).checkpoint()
+    val m = drmAcp.nrow
+    val n = drmAcp.ncol
+    assert(k <span class="err">&lt;</span>= (m min n), &quot;k cannot be greater than smaller of m, n.&quot;)
+    val pfxed = safeToNonNegInt((m min n) - k min p)
+
+    // Actual decomposition rank
+    val r = k + pfxed
+
+    // Dataset mean
+    val mu = drmAcp.colMeans
+
+    val mtm = mu dot mu
+
+    // We represent Omega by its seed.
+    val omegaSeed = RandomUtils.getRandom().nextInt()
+    val omega = Matrices.symmetricUniformView(n, r, omegaSeed)
+
+    // This done in front in a single-threaded fashion for now. Even though it doesn&#39;t require any
+    // memory beyond that is required to keep xi around, it still might be parallelized to backs
+    // for significantly big n and r. TODO
+    val s_o = omega.t %*% mu
+
+    val bcastS_o = drmBroadcast(s_o)
+    val bcastMu = drmBroadcast(mu)
+
+    var drmY = drmAcp.mapBlock(ncol = r) {
+        case (keys, blockA) ⇒
+            val s_o:Vector = bcastS_o
+            val blockY = blockA %*% Matrices.symmetricUniformView(n, r, omegaSeed)
+            for (row ← 0 until blockY.nrow) blockY(row, ::) -= s_o
+            keys → blockY
+    }
+            // Checkpoint Y
+            .checkpoint()
+
+    var drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
+
+    var s_q = drmQ.colSums()
+    var bcastVarS_q = drmBroadcast(s_q)
+
+    // This actually should be optimized as identically partitioned map-side A&#39;B since A and Q should
+    // still be identically partitioned.
+    var drmBt = (drmAcp.t %*% drmQ).checkpoint()
+
+    var s_b = (drmBt.t %*% mu).collect(::, 0)
+    var bcastVarS_b = drmBroadcast(s_b)
+
+    for (i ← 0 until q) {
+
+        // These closures don&#39;t seem to live well with outside-scope vars. This doesn&#39;t record closure
+        // attributes correctly. So we create additional set of vals for broadcast vars to properly
+        // create readonly closure attributes in this very scope.
+        val bcastS_q = bcastVarS_q
+        val bcastMuInner = bcastMu
+
+        // Fix Bt as B&#39; -= xi cross s_q
+        drmBt = drmBt.mapBlock() {
+            case (keys, block) ⇒
+                val s_q: Vector = bcastS_q
+                val mu: Vector = bcastMuInner
+                keys.zipWithIndex.foreach {
+                    case (key, idx) ⇒ block(idx, ::) -= s_q * mu(key)
+                }
+                keys → block
+        }
+
+        drmY.uncache()
+        drmQ.uncache()
+
+        val bCastSt_b = drmBroadcast(s_b -=: mtm * s_q)
+
+        drmY = (drmAcp %*% drmBt)
+            // Fix Y by subtracting st_b from each row of the AB&#39;
+            .mapBlock() {
+            case (keys, block) ⇒
+                val st_b: Vector = bCastSt_b
+                block := { (_, c, v) ⇒ v - st_b(c) }
+                keys → block
+        }
+        // Checkpoint Y
+        .checkpoint()
+
+        drmQ = dqrThin(drmY, checkRankDeficiency = false)._1.checkpoint()
+
+        s_q = drmQ.colSums()
+        bcastVarS_q = drmBroadcast(s_q)
+
+        // This on the other hand should be inner-join-and-map A&#39;B optimization since A and Q_i are not
+        // identically partitioned anymore.
+        drmBt = (drmAcp.t %*% drmQ).checkpoint()
+
+        s_b = (drmBt.t %*% mu).collect(::, 0)
+        bcastVarS_b = drmBroadcast(s_b)
+    }
+
+    val c = s_q cross s_b
+    val inCoreBBt = (drmBt.t %*% drmBt).checkpoint(CacheHint.NONE).collect -=:
+        c -=: c.t +=: mtm *=: (s_q cross s_q)
+    val (inCoreUHat, d) = eigen(inCoreBBt)
+    val s = d.sqrt
+
+    // Since neither drmU nor drmV are actually computed until actually used, we don&#39;t need the flags
+    // instructing compute (or not compute) either of the U,V outputs anymore. Neat, isn&#39;t it?
+    val drmU = drmQ %*% inCoreUHat
+    val drmV = drmBt %*% (inCoreUHat %*% diagv(1 / s))
 
-  s_b = (drmBt.t %*% mu).collect(::, 0)
-  bcastVarS_b = drmBroadcast(s_b)
+    (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
 }
-
-val c = s_q cross s_b
-val inCoreBBt = (drmBt.t %*% drmBt).checkpoint(CacheHint.NONE).collect -=:
-    c -=: c.t +=: mtm *=: (s_q cross s_q)
-val (inCoreUHat, d) = eigen(inCoreBBt)
-val s = d.sqrt
-
-// Since neither drmU nor drmV are actually computed until actually used, we don&#39;t need the flags
-// instructing compute (or not compute) either of the U,V outputs anymore. Neat, isn&#39;t it?
-val drmU = drmQ %*% inCoreUHat
-val drmV = drmBt %*% (inCoreUHat %*% diagv(1 / s))
-
-(drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
 </pre></div>
 
 
-<p>}</p>
 <h2 id="usage">Usage<a class="headerlink" href="#usage" title="Permanent link">&para;</a></h2>
 <p>The scala <code>dspca(...)</code> method can easily be called in any Spark or H2O application built with the <code>math-scala</code> library and the corresponding <code>Spark</code> or <code>H2O</code> engine module as follows:</p>
 <div class="codehilite"><pre><span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">mahout</span><span class="p">.</span><span class="n">math</span><span class="p">.</span><span class="n">_</span>